# Hidden Markov Models The HMM module (`pysinger.hmm`) contains the two coupled forward HMMs that are the computational heart of SINGER, plus the coalescent calculator and emission models that feed them. ## CoalescentCalculator ```{eval-rst} .. autoclass:: pysinger.hmm.coalescent.CoalescentCalculator :members: ``` Computes the **piecewise-exponential coalescent CDF** for a set of branches. The coalescence rate at time $t$ equals the number of lineages alive at $t$: $$ \lambda(t) = \sum_b \mathbb{1}[t_{\text{lower}}(b) \leq t < t_{\text{upper}}(b)] $$ ### Construction (`compute`) 1. **Rate changes**: Record $+1$ at each branch's lower time and $-1$ at each upper time. 2. **Cumulative rates**: Running sum gives the piecewise-constant rate function $\lambda(t)$. 3. **CDF**: The survival probability $S(t) = \exp(-\int_0^t \lambda(s)\,ds)$ is piecewise-exponential. Between consecutive rate-change times $[t_k, t_{k+1})$ with rate $\lambda_k$: $$ S(t_{k+1}) = S(t_k) \cdot e^{-\lambda_k (t_{k+1} - t_k)} $$ The cumulative probability $F(t) = 1 - S(t)$ is stored as a `SortedDict` for $O(\log n)$ lookup. ### Interpolation (`prob`) For a time $t$ between grid points $[t_k, t_{k+1})$: $$ F(t) = F(t_k) + \Delta F \cdot \frac{\text{expm1}(-\lambda_k \cdot (t - t_k))}{\text{expm1}(-\lambda_k \cdot (t_{k+1} - t_k))} $$ This is exact (not an approximation) because the rate is constant within the interval. ### Quantile (`quantile`) The inverse CDF: given $p$, find $t$ such that $F(t) = p$. Uses the inverse formula: $$ t = t_k - \frac{1}{\lambda_k} \log\!\Bigl(1 - \frac{p - F(t_k)}{F(t_{k+1}) - F(t_k)} \cdot (1 - e^{-\lambda_k \Delta t})\Bigr) $$ ### `weight(lb, ub)` and `time(lb, ub)` - `weight(lb, ub)` $= F(ub) - F(lb)$: probability of coalescence in $[lb, ub]$. - `time(lb, ub)`: the exponential median -- the time $t$ at which $F(t)$ is midway between $F(lb)$ and $F(ub)$. Used as the representative time for each HMM interval. ## Emission models ```{eval-rst} .. autoclass:: pysinger.hmm.emission.BinaryEmission :members: .. autoclass:: pysinger.hmm.emission.PolarEmission :members: ``` Both models compute the **ratio** of the data likelihood with vs. without the threaded lineage. This ratio is the emission probability used by the HMM. ### BinaryEmission (used by TSP) For a branch $b$ being split at time $t$ by a new lineage with query node $q$, the three sub-branch lengths are: $$ \ell_{\text{lower}} = t - t_{\text{lower}}, \quad \ell_{\text{upper}} = t_{\text{upper}} - t, \quad \ell_{\text{query}} = t - t_q $$ **Null emission** (no mutations in the bin): $$ e_{\text{null}} = \frac{e^{-\theta \ell_l} \cdot e^{-\theta \ell_u} \cdot e^{-\theta \ell_q}}{e^{-\theta(\ell_l + \ell_u)}} = e^{-\theta \ell_q} $$ **Mutation emission**: For each segregating site, compute the majority-rule state $s_m = \mathbb{1}[s_l + s_u + s_q > 1.5]$ and count transitions on each sub-branch. The probability is a product of Poisson terms: $$ e_{\text{mut}} = \prod_{\text{sites}} \frac{e^{-\theta \ell_l} (\theta/\Delta x)^{d_l} \cdot e^{-\theta \ell_u} (\theta/\Delta x)^{d_u} \cdot e^{-\theta \ell_q} (\theta/\Delta x)^{d_q}}{e^{-\theta(\ell_l+\ell_u)} (\theta/\Delta x)^{d_{\text{old}}}} $$ ### PolarEmission (used by BSP) Extends BinaryEmission with: - **Penalty factor** for derived alleles in the query node (configurable via `penalty`). - **Root reward**: when the branch reaches the root and the majority state is derived while the lower node is ancestral, applies a factor of `ancestral_prob / (1 - ancestral_prob)`. - Simplified null emission: only depends on $\ell_q$ (finite branch) or $\ell_l + \ell_q$ (root branch). ## BSP — Branch Sequence Propagator ```{eval-rst} .. autoclass:: pysinger.hmm.bsp.BSP :members: ``` The BSP is the **branch-dimension HMM**. Its state space is a list of `Interval` objects, one per (branch, time-range) cell in the current marginal tree. ### Forward pass For each genomic bin $i$: 1. **Recombination or advance**: If position $i$ coincides with a recombination breakpoint, call `transfer(r)`. Otherwise call `forward(rho)`. 2. **Emission**: If mutations are present, call `mut_emit(theta, bin_size, mutations, query_node)`. Otherwise call `null_emit(theta, query_node)`. ### `forward(rho)` — no topology change $$ \alpha_i^{(x+1)} = \alpha_i^{(x)} \cdot (1 - p_{\text{recomb},i}) + R \cdot w_i $$ where $p_{\text{recomb},i} = \rho \cdot \Delta t_i \cdot e^{-\rho \Delta t_i}$ is the recombination probability for interval $i$ with $\Delta t_i = t_i - t_c$, and $w_i$ is the coalescent re-landing weight (coalescent probability $\times$ recombination probability, normalised). ### `transfer(r)` — topology change Probability mass is redistributed across the new state space. The BSP classifies each old interval by its branch type (source / target / other) and routes mass to the corresponding new intervals. The `_generate_intervals` method builds the new state space, applying a pruning cutoff to discard low-probability partial intervals. ### Traceback `sample_joining_branches()` walks backward through the forward probabilities, at each step deciding whether the lineage stayed on the same branch or jumped. Jumps are sampled proportional to the recombination probability $\times$ forward probability. The result is a `SortedDict[position -> Branch]`. ## TSP — Time Sequence Propagator ```{eval-rst} .. autoclass:: pysinger.hmm.tsp.TSP :members: ``` The TSP is the **time-dimension HMM**. Conditioned on the BSP's sampled branch path, it samples a coalescence time for each genomic bin. ### State space For a branch $[l, u]$, the time axis is discretised into $K$ intervals using **exponential-quantile spacing**: boundaries are placed at $F_{\text{Exp}(1)}^{-1}(p)$ for evenly-spaced $p$ values. This concentrates grid points where the coalescent density is highest. The gap parameter $q$ (default 0.02) controls resolution: $K = \lceil (p_u - p_l) / q \rceil$. ### Forward pass The PSMC transition kernel is a tridiagonal matrix: - **Diagonal**: $P_{\text{stay}} = P_{\text{PSMC}}(\rho, t_i, [l_i, u_i]) / P_{\text{PSMC}}(\rho, t_i, [l_0, u_K])$ - **Lower diagonal**: probability of transitioning from interval $i+1$ to $i$ - **Upper diagonal**: probability of transitioning from interval $i-1$ to $i$ The $O(K)$ recursion uses `lower_sums` (cumulative sum from below) and `upper_sums` (cumulative sum from above): $$ \alpha_i^{(x+1)} = L_i + D_i \cdot \alpha_i^{(x)} + U_i \cdot \sum_{j>i} \alpha_j^{(x)} $$ ### Transfer at branch changes Three cases mirror the BSP: - **Source $\to$ merging**: Collapse to a point mass at the deleted node's time, with intervals above and below. - **Target $\to$ recombined**: Expand from a point mass, distributing probability across the new branch's time range. - **Regular**: Transfer intervals by time overlap between old and new branches. ### Traceback `sample_joining_nodes()` walks backward, sampling an `Interval` at each step and converting it to a `Node` at the interval's representative time (exponential median with jitter via `_exp_median`).