MCMC threading

The MCMC module (pysinger.mcmc) contains the Threader class that combines the BSP and TSP into a single threading operation, used both for initial ARG construction and for MCMC re-threading moves.

Threader

class pysinger.mcmc.threader.Threader(cutoff: float = 0.0, gap: float = 0.02)[source]

BSP + TSP threader for adding/rethreading a lineage in an ARG.

Parameters:
  • cutoff (float) – BSP state-space pruning threshold (bsp_c in C++).

  • gap (float) – TSP time grid quantile gap (tsp_q in C++).

internal_rethread(arg: ARG, cut_point: Tuple[float, Branch, float]) None[source]

MCMC move: remove and re-thread a lineage segment.

Proposes a new threading, accepts with Metropolis ratio. Mirrors Threader_smc::internal_rethread.

thread(arg: ARG, node: Node) None[source]

Add node as a new leaf and thread its lineage through arg.

Mirrors Threader_smc::thread.

thread(arg, node) — initial threading

Used during iterative_start() to add a new leaf node to the ARG:

arg.add_sample(node)         # register node, set up removed_branches
├─ BSP forward pass          # which branch to join at each position?
├─ sample_joining_branches   # stochastic traceback → pos → Branch map
├─ TSP forward pass          # when to coalesce on each branch?
├─ sample_joining_points     # stochastic traceback → pos → Node map
├─ arg.add(joining, added)   # thread the lineage into the ARG
├─ arg.approx_sample_recombs # assign recombination times
└─ arg.clear_remove_info     # clean up working state

internal_rethread(arg, cut_point) — MCMC move

Used during internal_sample() for Metropolis–Hastings re-threading:

arg.remove(cut_point)        # extract lineage → removed_branches, joining_branches
├─ BSP forward pass          # propose new branch path
├─ sample_joining_branches
├─ TSP forward pass          # propose new coalescence times
├─ sample_joining_points
├─ acceptance_ratio          # Metropolis ratio = h_old / h_new
├─ if accept:
│    arg.add(new_joining, new_added)
│  else:
│    arg.add(old_joining, old_removed)     # restore original
├─ arg.approx_sample_recombs
└─ arg.clear_remove_info

Acceptance ratio

The Metropolis ratio is:

\[ \alpha = \frac{h_{\text{old}}}{h_{\text{new}}} \]

where \(h\) is the effective tree height at the cut position. For the cut tree, this is the maximum child node time. If the joining branch reaches the root sentinel, \(h\) is replaced by the coalescence node time. This favours proposals that produce comparable or shorter tree heights.

Emission model assignment

The BSP uses PolarEmission (accounts for ancestral/derived polarisation), while the TSP uses BinaryEmission (symmetric model). This matches the C++ implementation where the branch-level HMM benefits from polarity information but the time-level HMM does not.

Sampler

class pysinger.sampler.Sampler(Ne: float = 1.0, recomb_rate: float = 0.0, mut_rate: float = 0.0, recomb_map: RateMap | None = None, mut_map: RateMap | None = None)[source]

Bayesian ARG sampler.

Parameters:
  • Ne (float) – Haploid effective population size (used as a time-scaling factor).

  • recomb_rate (float) – Per-base-pair per-generation recombination rate. Scaled internally by Ne to get the coalescent-unit rate.

  • mut_rate (float) – Per-base-pair per-generation mutation rate. Scaled by Ne.

  • recomb_map (RateMap or None) – Variable-rate maps. If provided, take precedence over the scalar rates for rho/theta computation.

  • mut_map (RateMap or None) – Variable-rate maps. If provided, take precedence over the scalar rates for rho/theta computation.

internal_sample(num_iters: int, spacing: int = 1) None[source]

Run num_iters MCMC iterations.

Each iteration proposes at least spacing * sequence_length bp of re-threading moves.

Mirrors Sampler::internal_sample.

iterative_start(max_retries: int = 5) None[source]

Thread all sample nodes one by one to build an initial ARG.

Mirrors Sampler::iterative_start.

If a threading step fails (e.g. HMM underflow on long sequences), the entire build is retried with a fresh RNG state up to max_retries times.

load_vcf(vcf_file: str, start: float = 0.0, end: float = inf, haploid: bool = False) None[source]

Load genotype data from a VCF file.

Parameters:
  • vcf_file – Path to the .vcf file.

  • start – Genomic region (half-open).

  • end – Genomic region (half-open).

  • haploid – If True, treat each column as one haplotype (no phased parsing).

iterative_start()

  1. Build a singleton ARG with the first sample node.

  2. For each remaining sample, create a Threader and call thread(arg, node).

  3. Rescale all internal node times so that expected mutations match observed.

Includes retry logic: if a threading step fails due to HMM underflow (more likely on long sequences), the entire build is retried with a fresh RNG state.

internal_sample(num_iters, spacing)

For each iteration:

  1. Repeat until at least spacing * sequence_length bp have been updated:

    • Sample a random cut point.

    • Create a Threader and call internal_rethread(arg, cut_point).

    • If the rethread fails (RuntimeError from underflow), restore the original ARG state and break.

  2. Compute last_arg_length (pre-rescale).

  3. Rescale all node times and recombination start times.

_rescale()

Computes a global scale factor:

\[ s = \frac{S_{\text{obs}}}{\theta_{\text{scaled}} \cdot L_{\text{total}}} \]

Multiplies all internal node times and recombination start_time values by \(s\). This keeps coalescence times calibrated to the mutation rate throughout the MCMC.