Data structures

All core types live in pysinger.data. They form the ARG representation that the HMM and MCMC modules operate on.

Node

class pysinger.data.node.Node(time: float, index: int = 0, mutation_sites: sortedcontainers.SortedDict = <factory>)[source]

A node in the Ancestral Recombination Graph.

Parameters:
  • time (float) – Coalescence time (in units of 2Ne generations). Leaf nodes have time == 0; the global root sentinel has time == inf.

  • index (int) – Integer label. Sample nodes receive 0-based indices; internal nodes receive larger indices; the root sentinel has index == -1.

  • mutation_sites (sortedcontainers.SortedDict) – Maps genomic position → allele state (1 = derived). A sentinel entry at position -1 with state 0 is kept so look-ups before the first real position always succeed.

add_mutation(pos: float) None[source]

Record that this node carries the derived allele at pos.

get_state(pos: float) float[source]

Return the allele state at pos (exact match only).

Mirrors C++ Node::get_state: returns the stored value if there is an exact key at pos, otherwise 0. This means mutations are point events — a mutation at position 100 does NOT carry its state forward to position 101 or beyond.

write_state(pos: float, state: float) None[source]

Overwrite the allele state; state==0 removes the entry.

A vertex in the ARG. Leaf nodes (samples) have time = 0; internal nodes have time > 0 (in coalescent units); the root sentinel has time = inf, index = -1.

Identity semantics. Node uses Python object identity for equality and hashing (__eq__ is self is other, __hash__ returns id(self)). This mirrors the C++ implementation where nodes are compared by pointer. It means two Node objects with identical fields are still considered different – what matters is the object itself, not its contents.

Allele states. Each node stores a SortedDict mapping genomic position \(\to\) allele state (0 = ancestral, 1 = derived). A sentinel at position \(-1\) with state 0 ensures lookups before the first real site always return 0. get_state(pos) returns an exact-match lookup (mutations are point events, not intervals).

Branch

class pysinger.data.branch.Branch(lower_node: 'Node' | None = None, upper_node: 'Node' | None = None)[source]

Directed edge (lower_node, upper_node) in the ARG.

Invariant: lower_node.time <= upper_node.time (except for the null branch where both are None).

Hashability: based on Python object identity of the two node pointers, exactly as the C++ branch_hash struct.

is_null() bool[source]

True iff this is the null (empty) branch.

property length: float

Branch length in time units. Returns inf for branches to root.

A directed edge (lower_node, upper_node) with lower_node.time <= upper_node.time. Branches are immutable after construction (__setattr__ raises). Equality and hashing are based on node identity (not node field values), matching the C++ branch_hash struct.

The null branch (Branch()) has both nodes as None and is used as a sentinel throughout the codebase (e.g., to mark the end of a removed lineage). bool(branch) returns False for null branches.

Tree

class pysinger.data.tree.Tree[source]

Marginal coalescent tree at one genomic position.

parents
Type:

child → parent mapping (all non-root nodes present as keys).

children
Type:

parent → set of children (only internal nodes as keys).

add(added_branch: Branch, joining_branch: Branch, cut_node: 'Node' | None) None[source]

Insert added_branch joining at joining_branch.

Mirrors Tree::add.

backward_update(r: Recombination) None[source]

Undo recombination r moving left along the genome.

copy() Tree[source]

Shallow-copy the topology dicts (nodes are shared).

delete_branch(branch: Branch) None[source]

Remove branch from the tree.

find_joining_branch(removed_branch: Branch) Branch[source]

Return Branch(sibling, grandparent) for the removed branch.

Mirrors Tree::find_joining_branch.

find_sibling(n: Node) 'Node' | None[source]

Return the sibling of n (the other child of n’s parent).

forward_update(r: Recombination) None[source]

Apply recombination r moving right along the genome.

get_branches() Set['Branch'][source]

Return all branches in the tree as a set of Branch objects.

insert_branch(branch: Branch) None[source]

Add branch to the tree.

internal_backward_update(r: Recombination, cut_time: float) None[source]

backward_update restricted to branches with upper_node.time > cut_time.

internal_forward_update(r: Recombination, cut_time: float) None[source]

forward_update restricted to branches with upper_node.time > cut_time.

length() float[source]

Total branch length (excluding root sentinel branches).

remove(branch: Branch, cut_node: Node) None[source]

Remove branch from the tree and replace with cut lineage.

Mirrors Tree::remove. After this call the tree has cut_node as a leaf connected to branch.lower_node, and the sibling is directly connected to branch.upper_node’s parent.

A marginal coalescent tree at one genomic position. Stored as two dicts:

  • parents: Dict[Node, Node] – child \(\to\) parent

  • children: Dict[Node, Set[Node]] – parent \(\to\) children

The topology is updated by applying Recombination records:

  • forward_update(r): delete r.deleted_branches, insert r.inserted_branches

  • backward_update(r): reverse of forward_update

find_joining_branch(removed_branch) is a key method used during MCMC. Given the branch being removed, it returns Branch(sibling, grandparent) – the branch that “takes over” when the removed branch’s coalescence node is pruned.

Recombination

class pysinger.data.recombination.Recombination(deleted_branches: Set | None = None, inserted_branches: Set | None = None)[source]

Topology change record for one recombination breakpoint.

pos
Type:

Genomic position of the breakpoint (set externally).

deleted_branches
Type:

Branches that exist before (left of) pos.

inserted_branches
Type:

Branches that exist after (right of) pos.

Derived attributes (computed by find_recomb_info)
source_branch
Type:

The lineage that recombines.

target_branch
Type:

The lineage it joins at coalescence.

inserted_node
Type:

New internal node created at the coalescence.

deleted_node
Type:

Old internal node removed by the topology change.

merging_branch
Type:

The branch that takes over after the removed node.

recombined_branch
Type:

Part of the source lineage below start_time.

start_time
Type:

Height at which recombination begins.

lower/upper_transfer_branch
Type:

For BSP interval transfer.

source_sister_branch, source_parent_branch
Type:

auxiliary.

add(prev_added_branch: Branch, next_added_branch: Branch, prev_joining_branch: Branch, next_joining_branch: Branch, cut_node: 'Node' | None = None) None[source]

Update this recombination when a new lineage is threaded in.

Mirrors Recombination::add.

affect(b: Branch) bool[source]

True iff b is in deleted_branches.

create(b: Branch) bool[source]

True iff b is in inserted_branches.

remove(prev_removed_branch: Branch, next_removed_branch: Branch, prev_split_branch: Branch, next_split_branch: Branch, cut_node: 'Node' | None = None) None[source]

Update this recombination when the surrounding topology is pruned.

Mirrors Recombination::remove (both overloads).

trace_backward(t: float, curr_branch: Branch) Branch[source]

Return the branch that maps to curr_branch before pos.

Mirrors Recombination::trace_backward.

trace_forward(t: float, curr_branch: Branch) Branch[source]

Return the branch at time t that curr_branch maps to after pos.

Mirrors Recombination::trace_forward.

A topology change at a single genomic position. Stores:

  • deleted_branches: branches that exist before (left of) this position

  • inserted_branches: branches that exist after (right of) this position

From these, it derives several named branches used by the HMM:

Branch

Meaning

source_branch

The lineage that recombines (detaches)

target_branch

The lineage it re-coalesces with

merging_branch

Sibling-to-grandparent branch after the source node is removed

recombined_branch

Source lineage below the recombination height, re-attached

lower_transfer_branch

Target branch below the new coalescence node

upper_transfer_branch

Target branch above the new coalescence node

start_time

Height at which recombination occurs

These derived fields are computed by _find_nodes(), _find_target_branch(), and _find_recomb_info(), and are essential for the BSP/TSP transfer steps.

trace_forward(t, branch) and trace_backward(t, branch) map a branch at time \(t\) across the topology change, used by ARG.remove() to trace a lineage through recombination events.

Interval and IntervalInfo

class pysinger.data.interval.Interval(branch: Branch, lb: float, ub: float, start_pos: int)[source]

A (branch, lb, ub) time interval with associated forward-HMM state.

branch
Type:

The ARG branch this interval lives on.

lb
Type:

Lower time bound.

ub
Type:

Upper time bound.

start_pos

created (used during traceback).

Type:

Index into the coordinate grid where this interval was

weight
Type:

Coalescent probability mass in [lb, ub] (set by BSP).

time
Type:

Representative time point in [lb, ub] (median or median-CDF).

source_weights
Type:

Weights of source intervals (BSP transfer).

source_intervals
Type:

Corresponding source Interval objects.

node
Type:

Optional Node pointer (used by TSP for point-mass intervals).

fill_time() None[source]

Set self.time to the exponential-median of [lb, ub].

Mirrors Interval::fill_time() in the C++ code.

full(cut_time: float) bool[source]

True iff this interval spans the entire branch above cut_time.

class pysinger.data.interval.IntervalInfo(branch: Branch, lb: float, ub: float, seed_pos: int = 0)[source]

Lightweight key for BSP.transfer() accumulation maps.

Mirrors C++ Interval_info — a (branch, lb, ub) triple with a seed_pos tiebreaker. Must be hashable so it can serve as a dict key.

Interval represents a (branch, [lb, ub]) cell in the HMM state space. The BSP maintains a list of Interval objects as its current states. Key fields:

  • weight: coalescent probability mass \(F(u_b) - F(l_b)\) (set by CoalescentCalculator)

  • time: representative time point (exponential median via fill_time())

  • source_weights / source_intervals: traceback pointers from BSP transfer

fill_time() computes the exponential median of \([lb, ub]\): the time \(t\) where \(F_{\text{Exp}(1)}(t) = \frac{1}{2}[F_{\text{Exp}(1)}(lb) + F_{\text{Exp}(1)}(ub)]\). For the standard exponential, \(F(t) = 1 - e^{-t}\), giving:

\[ t = -\log\!\Bigl(1 - \tfrac{1}{2}\bigl[(1-e^{-lb}) + (1-e^{-ub})\bigr]\Bigr) \]

IntervalInfo is a lightweight hashable key (branch, lb, ub) used as dict keys in BSP.transfer() to accumulate probability mass from multiple source intervals into a single target.

ARG

class pysinger.data.arg.ARG(Ne: float = 1.0, sequence_length: float = 1.0)[source]

Ancestral Recombination Graph.

Ne
Type:

Effective population size (diploid).

sequence_length
Type:

Length of the genomic region (base pairs).

root
Type:

Sentinel root node (time=inf, index=-1).

sample_nodes
Type:

Set of leaf (sample) nodes.

node_set
Type:

All non-root nodes.

recombinations

sentinels at 0 and INT_MAX.

Type:

SortedDict[pos → Recombination]. Always has

mutation_sites
Type:

Sorted set of positions carrying derived alleles.

mutation_branches
Type:

pos → set of Branches carrying the mutation.

coordinates
Type:

Grid of genomic positions (HMM bins).

rhos
Type:

Scaled recombination rate per bin.

thetas
Type:

Scaled mutation rate per bin.

MCMC working variables (reset by clear_remove_info)
removed_branches
Type:

pos → Branch that was removed.

joining_branches
Type:

pos → Branch that “jumped over” the cut.

cut_tree
Type:

Marginal tree at the cut position.

cut_pos, cut_node
Type:

Position and sentinel node for the cut.

start, end
Type:

Genomic extent of the current MCMC window.

add(new_joining_branches: sortedcontainers.SortedDict, added_branches: sortedcontainers.SortedDict) None[source]

Thread the removed lineage back in at new positions.

Mirrors ARG::add.

add_sample(n: Node) None[source]

Register sample node n and update mutation_sites.

approx_sample_recombinations() None[source]

Sample start_times and finalize derived fields for all recombinations.

Mirrors ARG::approx_sample_recombinations / RSP_smc::approx_sample_recombination. Sets source_branch, start_time, then calls _find_target_branch and _find_recomb_info so that merging_branch (used by BSP) is valid.

build_singleton_arg(n: Node) None[source]

Build an ARG containing a single sample node n.

clear_remove_info() None[source]

Reset MCMC working state.

compute_rhos_thetas(r: float, m: float) None[source]

Compute per-bin scaled recombination/mutation rates.

Mirrors ARG::compute_rhos_thetas(double r, double m).

discretize(bin_size: float) None[source]

Build coordinate grid, placing breakpoints at recombinations.

Mirrors ARG::discretize.

get_arg_length() float[source]

Total ARG length (sum of branch_length * genomic_span).

get_check_points() Set[float][source]

Return the set of recombination positions that need sanity checks.

get_index(x: float) int[source]

Return index i such that coordinates[i] <= x < coordinates[i+1].

get_query_node_at(x: float) Node | None[source]

Return the query node (lower_node of the removed branch at x).

get_tree_at(x: float) Tree[source]

Return the marginal tree at position x.

Replays all Recombination records with pos <= x.

remove(cut_point: Tuple[float, Branch, float]) None[source]

Remove a lineage from the ARG.

cut_point = (pos, center_branch, cut_time).

After this call: - self.removed_branches maps positions to the removed branch. - self.joining_branches maps positions to the “joining” branch. - self.start / self.end delimit the affected genomic region. - self.start_tree is the marginal tree at start (minus the cut).

sample_internal_cut() Tuple[float, Branch, float][source]

Sample a random (pos, branch, cut_time) for the next MCMC step.

Mirrors ARG::sample_internal_cut.

The central data structure. An ARG is encoded as a SortedDict[position -> Recombination] with sentinels at position 0 and \(\infty\). Retrieving the marginal tree at position \(x\) is done by replaying all recombinations up to \(x\).

Key operations

remove(cut_point): Extract a lineage from the ARG. Traces the cut branch forward and backward through recombination records, building removed_branches and joining_branches maps. Modifies the recombination records in-place (via Recombination.remove()), re-maps mutations, and updates the start_tree / end_tree.

add(joining_branches, added_branches): Thread a lineage back in. Walks through the added positions, either modifying existing recombination records (via Recombination.add()) or creating new ones. Imputes mutation states for the newly threaded nodes via majority-rule.

get_arg_length(): Total ARG branch length \(= \sum_{\text{trees}} \sum_{\text{branches } b} (t_{\text{upper}}(b) - t_{\text{lower}}(b)) \times \text{genomic span}\). Used by the rescaling step.

approx_sample_recombinations(): After threading, assigns start_time and derived fields to all recombination records by finding the source branch and sampling a recombination time in \([\max(t_c, t_{\text{lower}}), \min(t_{\text{upper}}, t_{\text{inserted}})]\).

sample_internal_cut(): Sample a random (position, branch, time) triple for the next MCMC move. The time is drawn uniformly over the tree height; the branch is the one spanning that time.