Hidden Markov Models
The HMM module (pysinger.hmm) contains the two coupled forward HMMs that are the computational heart of SINGER, plus the coalescent calculator and emission models that feed them.
CoalescentCalculator
- class pysinger.hmm.coalescent.CoalescentCalculator(cut_time: float)[source]
Piecewise-exponential coalescent CDF for a set of branches.
Usage:
cc = CoalescentCalculator(cut_time=0.0) cc.compute(set_of_branches) p = cc.weight(lb, ub) # probability of coalescence in [lb, ub] t = cc.time(lb, ub) # representative time in [lb, ub]
- compute(branches: Set['Branch']) → None[source]
Recompute the CDF from branches.
Mirrors Coalescent_calculator::compute.
- prob(x: float) → float[source]
Return cumulative coalescence probability at time x.
Mirrors Coalescent_calculator::prob.
- quantile(p: float) → float[source]
Return the time t such that prob(t) == p.
Mirrors Coalescent_calculator::quantile.
Computes the piecewise-exponential coalescent CDF for a set of branches. The coalescence rate at time \(t\) equals the number of lineages alive at \(t\):
Construction (compute)
Rate changes: Record \(+1\) at each branch’s lower time and \(-1\) at each upper time.
Cumulative rates: Running sum gives the piecewise-constant rate function \(\lambda(t)\).
CDF: The survival probability \(S(t) = \exp(-\int_0^t \lambda(s)\,ds)\) is piecewise-exponential. Between consecutive rate-change times \([t_k, t_{k+1})\) with rate \(\lambda_k\):
The cumulative probability \(F(t) = 1 - S(t)\) is stored as a SortedDict for \(O(\log n)\) lookup.
Interpolation (prob)
For a time \(t\) between grid points \([t_k, t_{k+1})\):
This is exact (not an approximation) because the rate is constant within the interval.
Quantile (quantile)
The inverse CDF: given \(p\), find \(t\) such that \(F(t) = p\). Uses the inverse formula:
weight(lb, ub) and time(lb, ub)
weight(lb, ub)\(= F(ub) - F(lb)\): probability of coalescence in \([lb, ub]\).time(lb, ub): the exponential median – the time \(t\) at which \(F(t)\) is midway between \(F(lb)\) and \(F(ub)\). Used as the representative time for each HMM interval.
Emission models
- class pysinger.hmm.emission.BinaryEmission[source]
Symmetric infinite-sites emission model.
Mirrors Binary_emission.cpp.
The emission probability for a branch (lower, upper) and query node with representative time t on the branch is:
- P(data) ∝ exp(-θ * l_lower) * (θ/bin_size)^s_lower
exp(-θ * l_upper) * (θ/bin_size)^s_upper
exp(-θ * l_query) * (θ/bin_size)^s_query
divided by the “old” probability (without the threaded lineage) to get the ratio used in the HMM.
- emit(branch: Branch, time: float, theta: float, bin_size: float, emissions: list, node: Node) → float[source]
Emit using pre-computed diff counts (used by TSP).
- class pysinger.hmm.emission.PolarEmission(penalty: float = 1.0, ancestral_prob: float = 0.5)[source]
Polarised emission model for ancestral/derived alleles.
Mirrors Polar_emission.cpp.
- Additional parameters:
penalty: Cost of derived alleles in the query node. ancestral_prob: Prior probability that allele is ancestral at root.
- emit(branch: Branch, time: float, theta: float, bin_size: float, emissions: list, node: Node) → float[source]
Emit using pre-computed diff counts (used by TSP).
Both models compute the ratio of the data likelihood with vs. without the threaded lineage. This ratio is the emission probability used by the HMM.
BinaryEmission (used by TSP)
For a branch \(b\) being split at time \(t\) by a new lineage with query node \(q\), the three sub-branch lengths are:
Null emission (no mutations in the bin):
Mutation emission: For each segregating site, compute the majority-rule state \(s_m = \mathbb{1}[s_l + s_u + s_q > 1.5]\) and count transitions on each sub-branch. The probability is a product of Poisson terms:
PolarEmission (used by BSP)
Extends BinaryEmission with:
Penalty factor for derived alleles in the query node (configurable via
penalty).Root reward: when the branch reaches the root and the majority state is derived while the lower node is ancestral, applies a factor of
ancestral_prob / (1 - ancestral_prob).Simplified null emission: only depends on \(\ell_q\) (finite branch) or \(\ell_l + \ell_q\) (root branch).
BSP — Branch Sequence Propagator
- class pysinger.hmm.bsp.BSP[source]
Branch Sequence Propagator — forward HMM for the branch dimension.
State space: a list of Interval objects. Forward probabilities: forward_probs[step][interval_idx].
Mirrors BSP_smc in the C++ code.
- forward(rho: float) → None[source]
Advance forward by one bin with recombination rate rho.
Mirrors BSP_smc::forward.
- mut_emit(theta: float, bin_size: float, mut_set: Set[float], query_node: 'Node' | None) → None[source]
Apply mutation emission.
Mirrors BSP_smc::mut_emit.
- null_emit(theta: float, query_node: 'Node' | None) → None[source]
Apply null-emission (no mutations in bin).
Mirrors BSP_smc::null_emit.
- sample_joining_branches(start_index: int, coordinates: List[float]) → sortedcontainers.SortedDict[source]
Traceback to sample a joining-branch map.
Mirrors BSP_smc::sample_joining_branches. Returns SortedDict[pos → Branch].
- sanity_check(r: Recombination) → None[source]
Zero out invalid point-mass intervals at recombination nodes.
Mirrors BSP_smc::sanity_check.
- start(branches: Set[Branch], t: float) → None[source]
Initialise the forward pass at the left boundary.
branches is the set of branches in the starting tree. t is the cut time (lineage starts at time t).
Mirrors BSP_smc::start.
- transfer(r: Recombination) → None[source]
Apply topology change r to the interval state space.
Mirrors BSP_smc::transfer.
The BSP is the branch-dimension HMM. Its state space is a list of Interval objects, one per (branch, time-range) cell in the current marginal tree.
Forward pass
For each genomic bin \(i\):
Recombination or advance: If position \(i\) coincides with a recombination breakpoint, call
transfer(r). Otherwise callforward(rho).Emission: If mutations are present, call
mut_emit(theta, bin_size, mutations, query_node). Otherwise callnull_emit(theta, query_node).
forward(rho) — no topology change
where \(p_{\text{recomb},i} = \rho \cdot \Delta t_i \cdot e^{-\rho \Delta t_i}\) is the recombination probability for interval \(i\) with \(\Delta t_i = t_i - t_c\), and \(w_i\) is the coalescent re-landing weight (coalescent probability \(\times\) recombination probability, normalised).
transfer(r) — topology change
Probability mass is redistributed across the new state space. The BSP classifies each old interval by its branch type (source / target / other) and routes mass to the corresponding new intervals. The _generate_intervals method builds the new state space, applying a pruning cutoff to discard low-probability partial intervals.
Traceback
sample_joining_branches() walks backward through the forward probabilities, at each step deciding whether the lineage stayed on the same branch or jumped. Jumps are sampled proportional to the recombination probability \(\times\) forward probability. The result is a SortedDict[position -> Branch].
TSP — Time Sequence Propagator
- class pysinger.hmm.tsp.TSP[source]
Time Sequence Propagator — forward HMM for the time dimension.
State space: a list of Interval objects (finely gridded over a Branch). Forward probabilities: forward_probs[step][interval_idx].
Mirrors TSP_smc in the C++ code.
- mut_emit(theta: float, bin_size: float, mut_set: Set[float], query_node: Node) → None[source]
Apply mutation emission.
Mirrors TSP_smc::mut_emit.
- null_emit(theta: float, query_node: Node) → None[source]
Apply null emission (no mutations).
Mirrors TSP_smc::null_emit.
- recombine(prev_branch: Branch, next_branch: Branch) → None[source]
Re-sample interval distribution at a recombination.
Mirrors TSP_smc::recombine.
- sample_joining_nodes(start_index: int, coordinates: List[float]) → Dict[float, Node | None][source]
Traceback to sample a joining-node map.
Returns Dict[pos → Node]. Mirrors TSP_smc::sample_joining_nodes.
- start(branch: Branch, t: float) → None[source]
Initialise forward pass at the left boundary.
Mirrors TSP_smc::start.
- transfer(r: Recombination, prev_branch: Branch, next_branch: Branch) → None[source]
Apply topology change r.
Mirrors TSP_smc::transfer.
The TSP is the time-dimension HMM. Conditioned on the BSP’s sampled branch path, it samples a coalescence time for each genomic bin.
State space
For a branch \([l, u]\), the time axis is discretised into \(K\) intervals using exponential-quantile spacing: boundaries are placed at \(F_{\text{Exp}(1)}^{-1}(p)\) for evenly-spaced \(p\) values. This concentrates grid points where the coalescent density is highest.
The gap parameter \(q\) (default 0.02) controls resolution: \(K = \lceil (p_u - p_l) / q \rceil\).
Forward pass
The PSMC transition kernel is a tridiagonal matrix:
Diagonal: \(P_{\text{stay}} = P_{\text{PSMC}}(\rho, t_i, [l_i, u_i]) / P_{\text{PSMC}}(\rho, t_i, [l_0, u_K])\)
Lower diagonal: probability of transitioning from interval \(i+1\) to \(i\)
Upper diagonal: probability of transitioning from interval \(i-1\) to \(i\)
The \(O(K)\) recursion uses lower_sums (cumulative sum from below) and upper_sums (cumulative sum from above):
Transfer at branch changes
Three cases mirror the BSP:
Source \(\to\) merging: Collapse to a point mass at the deleted node’s time, with intervals above and below.
Target \(\to\) recombined: Expand from a point mass, distributing probability across the new branch’s time range.
Regular: Transfer intervals by time overlap between old and new branches.
Traceback
sample_joining_nodes() walks backward, sampling an Interval at each step and converting it to a Node at the interval’s representative time (exponential median with jitter via _exp_median).