API reference
Full auto-generated API documentation for all public classes and functions.
Sampler
Sampler — top-level MCMC sampler for pysinger.
Mirrors Sampler.cpp / Sampler.hpp.
Typical usage:
sampler = Sampler(Ne=10000, recomb_rate=1e-8, mut_rate=1e-8)
sampler.load_vcf("data/chr1.vcf", start=0, end=1_000_000)
sampler.iterative_start()
sampler.internal_sample(num_iters=100, spacing=1)
- class pysinger.sampler.Sampler(Ne: float = 1.0, recomb_rate: float = 0.0, mut_rate: float = 0.0, recomb_map: RateMap | None = None, mut_map: RateMap | None = None)[source]
Bayesian ARG sampler.
- Parameters:
Ne (float) – Haploid effective population size (used as a time-scaling factor).
recomb_rate (float) – Per-base-pair per-generation recombination rate. Scaled internally by Ne to get the coalescent-unit rate.
mut_rate (float) – Per-base-pair per-generation mutation rate. Scaled by Ne.
recomb_map (RateMap or None) – Variable-rate maps. If provided, take precedence over the scalar rates for rho/theta computation.
mut_map (RateMap or None) – Variable-rate maps. If provided, take precedence over the scalar rates for rho/theta computation.
- internal_sample(num_iters: int, spacing: int = 1) None[source]
Run num_iters MCMC iterations.
Each iteration proposes at least
spacing * sequence_lengthbp of re-threading moves.Mirrors Sampler::internal_sample.
- iterative_start(max_retries: int = 5) None[source]
Thread all sample nodes one by one to build an initial ARG.
Mirrors Sampler::iterative_start.
If a threading step fails (e.g. HMM underflow on long sequences), the entire build is retried with a fresh RNG state up to max_retries times.
- load_vcf(vcf_file: str, start: float = 0.0, end: float = inf, haploid: bool = False) None[source]
Load genotype data from a VCF file.
- Parameters:
vcf_file – Path to the .vcf file.
start – Genomic region (half-open).
end – Genomic region (half-open).
haploid – If True, treat each column as one haplotype (no phased parsing).
Data structures
Node — a vertex in the ARG carrying allele state information.
Mirrors Node.cpp / Node.hpp. The C++ version uses a std::map with a cached iterator; here we use sortedcontainers.SortedDict and rely on its O(log n) bisect operations, which are fast enough for a readable PoC.
- class pysinger.data.node.Node(time: float, index: int = 0, mutation_sites: sortedcontainers.SortedDict = <factory>)[source]
A node in the Ancestral Recombination Graph.
- Parameters:
time (float) – Coalescence time (in units of 2Ne generations). Leaf nodes have time == 0; the global root sentinel has time == inf.
index (int) – Integer label. Sample nodes receive 0-based indices; internal nodes receive larger indices; the root sentinel has index == -1.
mutation_sites (sortedcontainers.SortedDict) – Maps genomic position → allele state (1 = derived). A sentinel entry at position -1 with state 0 is kept so look-ups before the first real position always succeed.
- get_state(pos: float) float[source]
Return the allele state at pos (exact match only).
Mirrors C++ Node::get_state: returns the stored value if there is an exact key at pos, otherwise 0. This means mutations are point events — a mutation at position 100 does NOT carry its state forward to position 101 or beyond.
- index: int = 0
- mutation_sites: sortedcontainers.SortedDict
- time: float
Branch — a directed edge between two nodes, hashable by node identity.
Mirrors Branch.cpp / Branch.hpp.
In C++ each Branch holds shared_ptr<Node> lower_node and upper_node and compares them by pointer identity. Here we do the same using Python object identity (id()). The class is immutable after construction so it can safely serve as a dict key or set member.
- class pysinger.data.branch.Branch(lower_node: 'Node' | None = None, upper_node: 'Node' | None = None)[source]
Directed edge (lower_node, upper_node) in the ARG.
Invariant: lower_node.time <= upper_node.time (except for the null branch where both are None).
Hashability: based on Python object identity of the two node pointers, exactly as the C++ branch_hash struct.
- property length: float
Branch length in time units. Returns inf for branches to root.
- lower_node
- upper_node
Tree — a marginal coalescent tree at a single genomic position.
Mirrors Tree.cpp / Tree.hpp.
- The Tree is represented by two dicts:
parents: child_node -> parent_node children: parent_node -> set of child_nodes
The topology is updated forward/backward along the genome by applying Recombination records.
- class pysinger.data.tree.Tree[source]
Marginal coalescent tree at one genomic position.
- parents
- Type:
child → parent mapping (all non-root nodes present as keys).
- add(added_branch: Branch, joining_branch: Branch, cut_node: 'Node' | None) None[source]
Insert added_branch joining at joining_branch.
Mirrors Tree::add.
- backward_update(r: Recombination) None[source]
Undo recombination r moving left along the genome.
- find_joining_branch(removed_branch: Branch) Branch[source]
Return Branch(sibling, grandparent) for the removed branch.
Mirrors Tree::find_joining_branch.
- find_sibling(n: Node) 'Node' | None[source]
Return the sibling of n (the other child of n’s parent).
- forward_update(r: Recombination) None[source]
Apply recombination r moving right along the genome.
- internal_backward_update(r: Recombination, cut_time: float) None[source]
backward_update restricted to branches with upper_node.time > cut_time.
- internal_forward_update(r: Recombination, cut_time: float) None[source]
forward_update restricted to branches with upper_node.time > cut_time.
Recombination — the ARG topology change at a single recombination breakpoint.
Mirrors Recombination.cpp / Recombination.hpp.
A Recombination stores the set of branches deleted and inserted at a genomic position pos. From the deleted/inserted sets it derives the key named branches (source, target, merging, recombined, transfer branches) and the associated times (start_time). These are used by the BSP/TSP transfer steps and by the MCMC threader.
- class pysinger.data.recombination.Recombination(deleted_branches: Set | None = None, inserted_branches: Set | None = None)[source]
Topology change record for one recombination breakpoint.
- pos
- Type:
Genomic position of the breakpoint (set externally).
- deleted_branches
- Type:
Branches that exist before (left of) pos.
- inserted_branches
- Type:
Branches that exist after (right of) pos.
- Derived attributes (computed by find_recomb_info)
- source_branch
- Type:
The lineage that recombines.
- target_branch
- Type:
The lineage it joins at coalescence.
- inserted_node
- Type:
New internal node created at the coalescence.
- deleted_node
- Type:
Old internal node removed by the topology change.
- merging_branch
- Type:
The branch that takes over after the removed node.
- recombined_branch
- Type:
Part of the source lineage below start_time.
- start_time
- Type:
Height at which recombination begins.
- lower/upper_transfer_branch
- Type:
For BSP interval transfer.
- source_sister_branch, source_parent_branch
- Type:
auxiliary.
- add(prev_added_branch: Branch, next_added_branch: Branch, prev_joining_branch: Branch, next_joining_branch: Branch, cut_node: 'Node' | None = None) None[source]
Update this recombination when a new lineage is threaded in.
Mirrors Recombination::add.
- remove(prev_removed_branch: Branch, next_removed_branch: Branch, prev_split_branch: Branch, next_split_branch: Branch, cut_node: 'Node' | None = None) None[source]
Update this recombination when the surrounding topology is pruned.
Mirrors Recombination::remove (both overloads).
Interval and IntervalInfo — time-height intervals on ARG branches.
Mirrors Interval.cpp / Interval.hpp.
An Interval represents a window [lb, ub] in coalescent time on a specific Branch. The BSP stores a list of current Intervals as its forward-pass state space; the TSP does the same.
IntervalInfo is a lightweight, hashable summary used as a dict key inside BSP.transfer() to accumulate transfer weights from multiple source intervals into a single target interval.
- class pysinger.data.interval.Interval(branch: Branch, lb: float, ub: float, start_pos: int)[source]
A (branch, lb, ub) time interval with associated forward-HMM state.
- branch
- Type:
The ARG branch this interval lives on.
- lb
- Type:
Lower time bound.
- ub
- Type:
Upper time bound.
- start_pos
created (used during traceback).
- Type:
Index into the coordinate grid where this interval was
- source_weights
- Type:
Weights of source intervals (BSP transfer).
- source_intervals
- Type:
Corresponding source Interval objects.
- node
- Type:
Optional Node pointer (used by TSP for point-mass intervals).
- branch
- fill_time() None[source]
Set self.time to the exponential-median of [lb, ub].
Mirrors Interval::fill_time() in the C++ code.
- lb
- node: 'Node' | None
- source_intervals: List['Interval']
- source_weights: List[float]
- start_pos
- time: float
- ub
- weight: float
- class pysinger.data.interval.IntervalInfo(branch: Branch, lb: float, ub: float, seed_pos: int = 0)[source]
Lightweight key for BSP.transfer() accumulation maps.
Mirrors C++ Interval_info — a (branch, lb, ub) triple with a seed_pos tiebreaker. Must be hashable so it can serve as a dict key.
- branch
- lb
- seed_pos
- ub
ARG — the Ancestral Recombination Graph, the central data structure.
Mirrors ARG.cpp / ARG.hpp.
An ARG is encoded as a sorted map of Recombination records keyed by genomic position. A marginal tree at position x is obtained by replaying all records from position 0 up to (and including) x.
- The two main MCMC operations are:
- remove(cut_point) — extract a single lineage from the ARG, leaving
the remaining haplotypes connected.
add(joining, added) — thread the lineage back in at new positions.
- class pysinger.data.arg.ARG(Ne: float = 1.0, sequence_length: float = 1.0)[source]
Ancestral Recombination Graph.
- Ne
- Type:
Effective population size (diploid).
- sequence_length
- Type:
Length of the genomic region (base pairs).
- root
- Type:
Sentinel root node (time=inf, index=-1).
- sample_nodes
- Type:
Set of leaf (sample) nodes.
- node_set
- Type:
All non-root nodes.
- recombinations
sentinels at 0 and INT_MAX.
- Type:
SortedDict[pos → Recombination]. Always has
- mutation_sites
- Type:
Sorted set of positions carrying derived alleles.
- mutation_branches
- Type:
pos → set of Branches carrying the mutation.
- coordinates
- Type:
Grid of genomic positions (HMM bins).
- rhos
- Type:
Scaled recombination rate per bin.
- thetas
- Type:
Scaled mutation rate per bin.
- MCMC working variables (reset by clear_remove_info)
- removed_branches
- Type:
pos → Branch that was removed.
- joining_branches
- Type:
pos → Branch that “jumped over” the cut.
- cut_tree
- Type:
Marginal tree at the cut position.
- cut_pos, cut_node
- Type:
Position and sentinel node for the cut.
- start, end
- Type:
Genomic extent of the current MCMC window.
- add(new_joining_branches: sortedcontainers.SortedDict, added_branches: sortedcontainers.SortedDict) None[source]
Thread the removed lineage back in at new positions.
Mirrors ARG::add.
- approx_sample_recombinations() None[source]
Sample start_times and finalize derived fields for all recombinations.
Mirrors ARG::approx_sample_recombinations / RSP_smc::approx_sample_recombination. Sets source_branch, start_time, then calls _find_target_branch and _find_recomb_info so that merging_branch (used by BSP) is valid.
- compute_rhos_thetas(r: float, m: float) None[source]
Compute per-bin scaled recombination/mutation rates.
Mirrors ARG::compute_rhos_thetas(double r, double m).
- discretize(bin_size: float) None[source]
Build coordinate grid, placing breakpoints at recombinations.
Mirrors ARG::discretize.
- get_check_points() Set[float][source]
Return the set of recombination positions that need sanity checks.
- get_query_node_at(x: float) Node | None[source]
Return the query node (lower_node of the removed branch at x).
- get_tree_at(x: float) Tree[source]
Return the marginal tree at position x.
Replays all Recombination records with pos <= x.
- remove(cut_point: Tuple[float, Branch, float]) None[source]
Remove a lineage from the ARG.
cut_point = (pos, center_branch, cut_time).
After this call: - self.removed_branches maps positions to the removed branch. - self.joining_branches maps positions to the “joining” branch. - self.start / self.end delimit the affected genomic region. - self.start_tree is the marginal tree at start (minus the cut).
HMM
CoalescentCalculator — piecewise-exponential coalescent CDF/quantile.
Mirrors Coalescent_calculator.cpp / Coalescent_calculator.hpp.
Given a set of branches spanning a range of times, computes the probability that a new lineage coalesces in any time interval [lb, ub]. The coalescent rate at time t is the number of lineages currently alive at t (i.e., spanning t).
The CDF is piecewise-exponential: within each interval the rate is constant, so the survival function is exp(−rate * Δt).
- class pysinger.hmm.coalescent.CoalescentCalculator(cut_time: float)[source]
Piecewise-exponential coalescent CDF for a set of branches.
Usage:
cc = CoalescentCalculator(cut_time=0.0) cc.compute(set_of_branches) p = cc.weight(lb, ub) # probability of coalescence in [lb, ub] t = cc.time(lb, ub) # representative time in [lb, ub]
- compute(branches: Set['Branch']) None[source]
Recompute the CDF from branches.
Mirrors Coalescent_calculator::compute.
- prob(x: float) float[source]
Return cumulative coalescence probability at time x.
Mirrors Coalescent_calculator::prob.
- quantile(p: float) float[source]
Return the time t such that prob(t) == p.
Mirrors Coalescent_calculator::quantile.
Emission models for the BSP/TSP forward HMM.
Mirrors Binary_emission.cpp, Polar_emission.cpp, and Emission.hpp.
- The abstract base class Emission defines two methods:
null_emit(branch, time, theta, query_node) — no mutation in bin mut_emit(branch, time, theta, bin_size, mut_set, query_node) — mutations in bin
BinaryEmission — symmetric infinite-sites model (C++ Binary_emission). PolarEmission — polarised (ancestral/derived) model (C++ Polar_emission).
- class pysinger.hmm.emission.BinaryEmission[source]
Symmetric infinite-sites emission model.
Mirrors Binary_emission.cpp.
The emission probability for a branch (lower, upper) and query node with representative time t on the branch is:
- P(data) ∝ exp(-θ * l_lower) * (θ/bin_size)^s_lower
exp(-θ * l_upper) * (θ/bin_size)^s_upper
exp(-θ * l_query) * (θ/bin_size)^s_query
divided by the “old” probability (without the threaded lineage) to get the ratio used in the HMM.
- emit(branch: Branch, time: float, theta: float, bin_size: float, emissions: list, node: Node) float[source]
Emit using pre-computed diff counts (used by TSP).
- class pysinger.hmm.emission.Emission[source]
Abstract emission model for the HMM.
- emit(branch: Branch, time: float, theta: float, bin_size: float, emissions: list, node: Node) float[source]
Emit using pre-computed diff counts (used by TSP).
- class pysinger.hmm.emission.PolarEmission(penalty: float = 1.0, ancestral_prob: float = 0.5)[source]
Polarised emission model for ancestral/derived alleles.
Mirrors Polar_emission.cpp.
- Additional parameters:
penalty: Cost of derived alleles in the query node. ancestral_prob: Prior probability that allele is ancestral at root.
- emit(branch: Branch, time: float, theta: float, bin_size: float, emissions: list, node: Node) float[source]
Emit using pre-computed diff counts (used by TSP).
BSP — Branch Sequence Propagator, the forward HMM for branch threading.
Mirrors BSP_smc.cpp / BSP_smc.hpp.
The BSP computes the HMM forward probabilities over a set of Interval objects (each representing a (branch, time) cell). At each genomic position the HMM:
Optionally applies a Recombination transfer (moves probability mass from old intervals to new ones according to the topology change).
Advances by one bin (forward step), mixing staying-in-place with recombining to a new time.
Multiplies by the emission probability (null or with mutations).
After the forward pass, sample_joining_branches() performs a traceback to return a map of pos → Branch for threading.
- class pysinger.hmm.bsp.BSP[source]
Branch Sequence Propagator — forward HMM for the branch dimension.
State space: a list of Interval objects. Forward probabilities: forward_probs[step][interval_idx].
Mirrors BSP_smc in the C++ code.
- forward(rho: float) None[source]
Advance forward by one bin with recombination rate rho.
Mirrors BSP_smc::forward.
- mut_emit(theta: float, bin_size: float, mut_set: Set[float], query_node: 'Node' | None) None[source]
Apply mutation emission.
Mirrors BSP_smc::mut_emit.
- null_emit(theta: float, query_node: 'Node' | None) None[source]
Apply null-emission (no mutations in bin).
Mirrors BSP_smc::null_emit.
- sample_joining_branches(start_index: int, coordinates: List[float]) sortedcontainers.SortedDict[source]
Traceback to sample a joining-branch map.
Mirrors BSP_smc::sample_joining_branches. Returns SortedDict[pos → Branch].
- sanity_check(r: Recombination) None[source]
Zero out invalid point-mass intervals at recombination nodes.
Mirrors BSP_smc::sanity_check.
- start(branches: Set[Branch], t: float) None[source]
Initialise the forward pass at the left boundary.
branches is the set of branches in the starting tree. t is the cut time (lineage starts at time t).
Mirrors BSP_smc::start.
- transfer(r: Recombination) None[source]
Apply topology change r to the interval state space.
Mirrors BSP_smc::transfer.
TSP — Time Sequence Propagator, the forward HMM for coalescence times.
Mirrors TSP_smc.cpp / TSP_smc.hpp.
The TSP operates on a single Branch and samples a representative coalescence time in each genomic bin. It uses a PSMC-style transition kernel (psmc_prob) rather than the coalescent-CDF-based kernel of BSP.
Key public API
start(branch, t) — initialise at the left boundary forward(rho) — advance one bin transfer(r, prev_branch, next_branch)— apply topology change recombine(prev_branch, next_branch) — re-sample at a recombination null_emit(theta, query_node) — apply no-mutation emission mut_emit(theta, bin_size, mut_set, query_node) — apply mutation emission sample_joining_nodes(start_index, coordinates) — traceback → Dict[pos, Node]
- class pysinger.hmm.tsp.TSP[source]
Time Sequence Propagator — forward HMM for the time dimension.
State space: a list of Interval objects (finely gridded over a Branch). Forward probabilities: forward_probs[step][interval_idx].
Mirrors TSP_smc in the C++ code.
- mut_emit(theta: float, bin_size: float, mut_set: Set[float], query_node: Node) None[source]
Apply mutation emission.
Mirrors TSP_smc::mut_emit.
- null_emit(theta: float, query_node: Node) None[source]
Apply null emission (no mutations).
Mirrors TSP_smc::null_emit.
- recombine(prev_branch: Branch, next_branch: Branch) None[source]
Re-sample interval distribution at a recombination.
Mirrors TSP_smc::recombine.
- sample_joining_nodes(start_index: int, coordinates: List[float]) Dict[float, Node | None][source]
Traceback to sample a joining-node map.
Returns Dict[pos → Node]. Mirrors TSP_smc::sample_joining_nodes.
- start(branch: Branch, t: float) None[source]
Initialise forward pass at the left boundary.
Mirrors TSP_smc::start.
- transfer(r: Recombination, prev_branch: Branch, next_branch: Branch) None[source]
Apply topology change r.
Mirrors TSP_smc::transfer.
MCMC
Threader — MCMC move: remove a lineage and re-thread it.
Mirrors Threader_smc.cpp / Threader_smc.hpp.
- Two public entry points:
thread(arg, node) — add a new leaf node (initial threading) internal_rethread(arg, cut) — MCMC move with Metropolis acceptance
- class pysinger.mcmc.threader.Threader(cutoff: float = 0.0, gap: float = 0.02)[source]
BSP + TSP threader for adding/rethreading a lineage in an ARG.
- Parameters:
cutoff (float) – BSP state-space pruning threshold (bsp_c in C++).
gap (float) – TSP time grid quantile gap (tsp_q in C++).
I/O
VCF reader — load phased or haploid VCF into Node objects.
Mirrors Sampler::naive_read_vcf / naive_read_vcf_haploid in Sampler.cpp.
Returns a list of Node objects with mutation sites set, one node per haplotype (2 per individual for diploid, 1 per individual for haploid).
- pysinger.io.vcf_reader.read_vcf_haploid(vcf_file: str, start_pos: float = 0.0, end_pos: float = inf) Tuple[List[Node], float][source]
Read a haploid VCF and return (nodes, sequence_length).
One Node per individual (column after FORMAT).
Mirrors Sampler::naive_read_vcf_haploid.
- pysinger.io.vcf_reader.read_vcf_phased(vcf_file: str, start_pos: float = 0.0, end_pos: float = inf) Tuple[List[Node], float][source]
Read a phased VCF and return (nodes, sequence_length).
One Node per haplotype (2 × n_individuals). Mutations are stored as positions relative to start_pos.
Mirrors Sampler::naive_read_vcf.
- Parameters:
vcf_file – Path to the .vcf file.
start_pos – Genomic region to load (half-open [start_pos, end_pos)).
end_pos – Genomic region to load (half-open [start_pos, end_pos)).
- Returns:
nodes (List[Node]) – Haplotype nodes (index 0 .. 2n-1).
sequence_length (float) – end_pos - start_pos.
tskit writer — convert a pysinger ARG to a tskit TreeSequence.
Mirrors the conceptual output of SINGER’s write() method, but produces a tskit.TreeSequence that can be analysed with the tskit/msprime ecosystem.
- The conversion:
Add all non-root nodes (time = node.time * 2*Ne) as tskit individuals.
Walk the ARG recombinations to build a list of (left, right, parent, child) edges spanning each genomic interval.
Sort edges (required by tskit) and call tables.sort() + tables.tree_sequence().
- pysinger.io.tskit_writer.arg_to_tskit(arg: ARG, Ne: float = 1.0)[source]
Convert arg to a
tskit.TreeSequence.- Parameters:
arg (ARG) – The pysinger ARG object.
Ne (float) – Effective population size used to convert coalescent time units to generations (t_generations = t_coalescent * 2 * Ne).
- Returns:
ts
- Return type:
tskit.TreeSequence
Rates
RateMap — piecewise-constant rate map (recombination or mutation).
Mirrors Rate_map.cpp / Rate_map.hpp.
Reconstruction
FitchReconstruction — Fitch parsimony ancestral state reconstruction.
Mirrors Fitch_reconstruction.cpp / Fitch_reconstruction.hpp.
Given a Tree, performs a two-pass Fitch parsimony reconstruction at each genomic position to assign ancestral states to internal nodes.
Pass 1 (pruning / up-pass): bottom-up, assigns a state to each node based on the intersection (or union when intersection is empty) of its children. We encode ambiguity as 0.5.
Pass 2 (peeling / down-pass): top-down, resolves ambiguous states by propagating the parent’s definite state downward.
- class pysinger.reconstruction.fitch.FitchReconstruction(tree: Tree)[source]
Two-pass Fitch parsimony reconstruction on a Tree.
Usage:
fr = FitchReconstruction(tree) fr.reconstruct(pos) # assigns states for position *pos* fr.update(recombination) # advance tree topology
- reconstruct(pos: float) None[source]
Perform Fitch reconstruction at genomic position pos.
Writes the inferred ancestral states back into each node via
node.write_state(pos, state).Mirrors Fitch_reconstruction::reconstruct.
- update(r: Recombination) None[source]
Advance the topology by applying recombination r.
Mirrors Fitch_reconstruction::update.