Data structures
All core types live in pysinger.data. They form the ARG representation that the HMM and MCMC modules operate on.
Node
- class pysinger.data.node.Node(time: float, index: int = 0, mutation_sites: sortedcontainers.SortedDict = <factory>)[source]
A node in the Ancestral Recombination Graph.
- Parameters:
time (float) – Coalescence time (in units of 2Ne generations). Leaf nodes have time == 0; the global root sentinel has time == inf.
index (int) – Integer label. Sample nodes receive 0-based indices; internal nodes receive larger indices; the root sentinel has index == -1.
mutation_sites (sortedcontainers.SortedDict) – Maps genomic position → allele state (1 = derived). A sentinel entry at position -1 with state 0 is kept so look-ups before the first real position always succeed.
- get_state(pos: float) float[source]
Return the allele state at pos (exact match only).
Mirrors C++ Node::get_state: returns the stored value if there is an exact key at pos, otherwise 0. This means mutations are point events — a mutation at position 100 does NOT carry its state forward to position 101 or beyond.
A vertex in the ARG. Leaf nodes (samples) have time = 0; internal nodes have time > 0 (in coalescent units); the root sentinel has time = inf, index = -1.
Identity semantics. Node uses Python object identity for equality and hashing (__eq__ is self is other, __hash__ returns id(self)). This mirrors the C++ implementation where nodes are compared by pointer. It means two Node objects with identical fields are still considered different – what matters is the object itself, not its contents.
Allele states. Each node stores a SortedDict mapping genomic position \(\to\) allele state (0 = ancestral, 1 = derived). A sentinel at position \(-1\) with state 0 ensures lookups before the first real site always return 0. get_state(pos) returns an exact-match lookup (mutations are point events, not intervals).
Branch
- class pysinger.data.branch.Branch(lower_node: 'Node' | None = None, upper_node: 'Node' | None = None)[source]
Directed edge (lower_node, upper_node) in the ARG.
Invariant: lower_node.time <= upper_node.time (except for the null branch where both are None).
Hashability: based on Python object identity of the two node pointers, exactly as the C++ branch_hash struct.
- property length: float
Branch length in time units. Returns inf for branches to root.
A directed edge (lower_node, upper_node) with lower_node.time <= upper_node.time. Branches are immutable after construction (__setattr__ raises). Equality and hashing are based on node identity (not node field values), matching the C++ branch_hash struct.
The null branch (Branch()) has both nodes as None and is used as a sentinel throughout the codebase (e.g., to mark the end of a removed lineage). bool(branch) returns False for null branches.
Tree
- class pysinger.data.tree.Tree[source]
Marginal coalescent tree at one genomic position.
- parents
- Type:
child → parent mapping (all non-root nodes present as keys).
- add(added_branch: Branch, joining_branch: Branch, cut_node: 'Node' | None) None[source]
Insert added_branch joining at joining_branch.
Mirrors Tree::add.
- backward_update(r: Recombination) None[source]
Undo recombination r moving left along the genome.
- find_joining_branch(removed_branch: Branch) Branch[source]
Return Branch(sibling, grandparent) for the removed branch.
Mirrors Tree::find_joining_branch.
- find_sibling(n: Node) 'Node' | None[source]
Return the sibling of n (the other child of n’s parent).
- forward_update(r: Recombination) None[source]
Apply recombination r moving right along the genome.
- internal_backward_update(r: Recombination, cut_time: float) None[source]
backward_update restricted to branches with upper_node.time > cut_time.
- internal_forward_update(r: Recombination, cut_time: float) None[source]
forward_update restricted to branches with upper_node.time > cut_time.
A marginal coalescent tree at one genomic position. Stored as two dicts:
parents: Dict[Node, Node]– child \(\to\) parentchildren: Dict[Node, Set[Node]]– parent \(\to\) children
The topology is updated by applying Recombination records:
forward_update(r): deleter.deleted_branches, insertr.inserted_branchesbackward_update(r): reverse offorward_update
find_joining_branch(removed_branch) is a key method used during MCMC. Given the branch being removed, it returns Branch(sibling, grandparent) – the branch that “takes over” when the removed branch’s coalescence node is pruned.
Recombination
- class pysinger.data.recombination.Recombination(deleted_branches: Set | None = None, inserted_branches: Set | None = None)[source]
Topology change record for one recombination breakpoint.
- pos
- Type:
Genomic position of the breakpoint (set externally).
- deleted_branches
- Type:
Branches that exist before (left of) pos.
- inserted_branches
- Type:
Branches that exist after (right of) pos.
- Derived attributes (computed by find_recomb_info)
- source_branch
- Type:
The lineage that recombines.
- target_branch
- Type:
The lineage it joins at coalescence.
- inserted_node
- Type:
New internal node created at the coalescence.
- deleted_node
- Type:
Old internal node removed by the topology change.
- merging_branch
- Type:
The branch that takes over after the removed node.
- recombined_branch
- Type:
Part of the source lineage below start_time.
- start_time
- Type:
Height at which recombination begins.
- lower/upper_transfer_branch
- Type:
For BSP interval transfer.
- source_sister_branch, source_parent_branch
- Type:
auxiliary.
- add(prev_added_branch: Branch, next_added_branch: Branch, prev_joining_branch: Branch, next_joining_branch: Branch, cut_node: 'Node' | None = None) None[source]
Update this recombination when a new lineage is threaded in.
Mirrors Recombination::add.
- remove(prev_removed_branch: Branch, next_removed_branch: Branch, prev_split_branch: Branch, next_split_branch: Branch, cut_node: 'Node' | None = None) None[source]
Update this recombination when the surrounding topology is pruned.
Mirrors Recombination::remove (both overloads).
A topology change at a single genomic position. Stores:
deleted_branches: branches that exist before (left of) this positioninserted_branches: branches that exist after (right of) this position
From these, it derives several named branches used by the HMM:
Branch |
Meaning |
|---|---|
|
The lineage that recombines (detaches) |
|
The lineage it re-coalesces with |
|
Sibling-to-grandparent branch after the source node is removed |
|
Source lineage below the recombination height, re-attached |
|
Target branch below the new coalescence node |
|
Target branch above the new coalescence node |
|
Height at which recombination occurs |
These derived fields are computed by _find_nodes(), _find_target_branch(), and _find_recomb_info(), and are essential for the BSP/TSP transfer steps.
trace_forward(t, branch) and trace_backward(t, branch) map a branch at time \(t\) across the topology change, used by ARG.remove() to trace a lineage through recombination events.
Interval and IntervalInfo
- class pysinger.data.interval.Interval(branch: Branch, lb: float, ub: float, start_pos: int)[source]
A (branch, lb, ub) time interval with associated forward-HMM state.
- branch
- Type:
The ARG branch this interval lives on.
- lb
- Type:
Lower time bound.
- ub
- Type:
Upper time bound.
- start_pos
created (used during traceback).
- Type:
Index into the coordinate grid where this interval was
- source_weights
- Type:
Weights of source intervals (BSP transfer).
- source_intervals
- Type:
Corresponding source Interval objects.
- node
- Type:
Optional Node pointer (used by TSP for point-mass intervals).
- class pysinger.data.interval.IntervalInfo(branch: Branch, lb: float, ub: float, seed_pos: int = 0)[source]
Lightweight key for BSP.transfer() accumulation maps.
Mirrors C++ Interval_info — a (branch, lb, ub) triple with a seed_pos tiebreaker. Must be hashable so it can serve as a dict key.
Interval represents a (branch, [lb, ub]) cell in the HMM state space. The BSP maintains a list of Interval objects as its current states. Key fields:
weight: coalescent probability mass \(F(u_b) - F(l_b)\) (set byCoalescentCalculator)time: representative time point (exponential median viafill_time())source_weights/source_intervals: traceback pointers from BSP transfer
fill_time() computes the exponential median of \([lb, ub]\): the time \(t\) where \(F_{\text{Exp}(1)}(t) = \frac{1}{2}[F_{\text{Exp}(1)}(lb) + F_{\text{Exp}(1)}(ub)]\). For the standard exponential, \(F(t) = 1 - e^{-t}\), giving:
IntervalInfo is a lightweight hashable key (branch, lb, ub) used as dict keys in BSP.transfer() to accumulate probability mass from multiple source intervals into a single target.
ARG
- class pysinger.data.arg.ARG(Ne: float = 1.0, sequence_length: float = 1.0)[source]
Ancestral Recombination Graph.
- Ne
- Type:
Effective population size (diploid).
- sequence_length
- Type:
Length of the genomic region (base pairs).
- root
- Type:
Sentinel root node (time=inf, index=-1).
- sample_nodes
- Type:
Set of leaf (sample) nodes.
- node_set
- Type:
All non-root nodes.
- recombinations
sentinels at 0 and INT_MAX.
- Type:
SortedDict[pos → Recombination]. Always has
- mutation_sites
- Type:
Sorted set of positions carrying derived alleles.
- mutation_branches
- Type:
pos → set of Branches carrying the mutation.
- coordinates
- Type:
Grid of genomic positions (HMM bins).
- rhos
- Type:
Scaled recombination rate per bin.
- thetas
- Type:
Scaled mutation rate per bin.
- MCMC working variables (reset by clear_remove_info)
- removed_branches
- Type:
pos → Branch that was removed.
- joining_branches
- Type:
pos → Branch that “jumped over” the cut.
- cut_tree
- Type:
Marginal tree at the cut position.
- cut_pos, cut_node
- Type:
Position and sentinel node for the cut.
- start, end
- Type:
Genomic extent of the current MCMC window.
- add(new_joining_branches: sortedcontainers.SortedDict, added_branches: sortedcontainers.SortedDict) None[source]
Thread the removed lineage back in at new positions.
Mirrors ARG::add.
- approx_sample_recombinations() None[source]
Sample start_times and finalize derived fields for all recombinations.
Mirrors ARG::approx_sample_recombinations / RSP_smc::approx_sample_recombination. Sets source_branch, start_time, then calls _find_target_branch and _find_recomb_info so that merging_branch (used by BSP) is valid.
- compute_rhos_thetas(r: float, m: float) None[source]
Compute per-bin scaled recombination/mutation rates.
Mirrors ARG::compute_rhos_thetas(double r, double m).
- discretize(bin_size: float) None[source]
Build coordinate grid, placing breakpoints at recombinations.
Mirrors ARG::discretize.
- get_check_points() Set[float][source]
Return the set of recombination positions that need sanity checks.
- get_query_node_at(x: float) Node | None[source]
Return the query node (lower_node of the removed branch at x).
- get_tree_at(x: float) Tree[source]
Return the marginal tree at position x.
Replays all Recombination records with pos <= x.
- remove(cut_point: Tuple[float, Branch, float]) None[source]
Remove a lineage from the ARG.
cut_point = (pos, center_branch, cut_time).
After this call: - self.removed_branches maps positions to the removed branch. - self.joining_branches maps positions to the “joining” branch. - self.start / self.end delimit the affected genomic region. - self.start_tree is the marginal tree at start (minus the cut).
The central data structure. An ARG is encoded as a SortedDict[position -> Recombination] with sentinels at position 0 and \(\infty\). Retrieving the marginal tree at position \(x\) is done by replaying all recombinations up to \(x\).
Key operations
remove(cut_point): Extract a lineage from the ARG. Traces the cut branch forward and backward through recombination records, building removed_branches and joining_branches maps. Modifies the recombination records in-place (via Recombination.remove()), re-maps mutations, and updates the start_tree / end_tree.
add(joining_branches, added_branches): Thread a lineage back in. Walks through the added positions, either modifying existing recombination records (via Recombination.add()) or creating new ones. Imputes mutation states for the newly threaded nodes via majority-rule.
get_arg_length(): Total ARG branch length \(= \sum_{\text{trees}} \sum_{\text{branches } b} (t_{\text{upper}}(b) - t_{\text{lower}}(b)) \times \text{genomic span}\). Used by the rescaling step.
approx_sample_recombinations(): After threading, assigns start_time and derived fields to all recombination records by finding the source branch and sampling a recombination time in \([\max(t_c, t_{\text{lower}}), \min(t_{\text{upper}}, t_{\text{inserted}})]\).
sample_internal_cut(): Sample a random (position, branch, time) triple for the next MCMC move. The time is drawn uniformly over the tree height; the branch is the one spanning that time.