I/O and reconstruction

VCF reader

VCF reader — load phased or haploid VCF into Node objects.

Mirrors Sampler::naive_read_vcf / naive_read_vcf_haploid in Sampler.cpp.

Returns a list of Node objects with mutation sites set, one node per haplotype (2 per individual for diploid, 1 per individual for haploid).

pysinger.io.vcf_reader.read_vcf_haploid(vcf_file: str, start_pos: float = 0.0, end_pos: float = inf) Tuple[List[Node], float][source]

Read a haploid VCF and return (nodes, sequence_length).

One Node per individual (column after FORMAT).

Mirrors Sampler::naive_read_vcf_haploid.

pysinger.io.vcf_reader.read_vcf_phased(vcf_file: str, start_pos: float = 0.0, end_pos: float = inf) Tuple[List[Node], float][source]

Read a phased VCF and return (nodes, sequence_length).

One Node per haplotype (2 × n_individuals). Mutations are stored as positions relative to start_pos.

Mirrors Sampler::naive_read_vcf.

Parameters:
  • vcf_file – Path to the .vcf file.

  • start_pos – Genomic region to load (half-open [start_pos, end_pos)).

  • end_pos – Genomic region to load (half-open [start_pos, end_pos)).

Returns:

  • nodes (List[Node]) – Haplotype nodes (index 0 .. 2n-1).

  • sequence_length (float) – end_pos - start_pos.

Two functions:

  • read_vcf_phased(path, start, end): Parses a phased VCF (0|1 format). Each diploid individual produces 2 Node objects (one per haplotype). Each node’s mutation_sites is populated with positions where it carries the derived allele.

  • read_vcf_haploid(path, start, end): Treats each sample column as a single haplotype (for haploid data or pre-split files).

Both return (List[Node], sequence_length). The sequence length is the end parameter (or the last variant position + 1 if end = inf).

tskit writer

tskit writer — convert a pysinger ARG to a tskit TreeSequence.

Mirrors the conceptual output of SINGER’s write() method, but produces a tskit.TreeSequence that can be analysed with the tskit/msprime ecosystem.

The conversion:
  1. Add all non-root nodes (time = node.time * 2*Ne) as tskit individuals.

  2. Walk the ARG recombinations to build a list of (left, right, parent, child) edges spanning each genomic interval.

  3. Sort edges (required by tskit) and call tables.sort() + tables.tree_sequence().

pysinger.io.tskit_writer.arg_to_tskit(arg: ARG, Ne: float = 1.0)[source]

Convert arg to a tskit.TreeSequence.

Parameters:
  • arg (ARG) – The pysinger ARG object.

  • Ne (float) – Effective population size used to convert coalescent time units to generations (t_generations = t_coalescent * 2 * Ne).

Returns:

ts

Return type:

tskit.TreeSequence

arg_to_tskit(arg, Ne) converts a pysinger ARG to a tskit.TreeSequence:

  1. Discover nodes: Walk all marginal trees by replaying recombinations. Collect every node that appears in a parent/child relationship (excluding the root sentinel at index = -1).

  2. Add nodes to tskit: Each node becomes a tskit node with time = node.time * Ne (converting coalescent units to generations). Sample nodes get NODE_IS_SAMPLE flag.

  3. Emit edges: Replay recombinations again. For each tree interval \([x_k, x_{k+1})\), emit a tskit edge (left, right, parent, child) for every parent-child pair where the parent is not the root sentinel and parent.time > child.time.

  4. Sort and build: Call tables.sort() and tables.tree_sequence().

The resulting tree sequence can be used with the full tskit API for computing diversity, TMRCA, drawing trees, etc.

Rate maps

class pysinger.rates.rate_map.RateMap[source]

Piecewise-constant genetic rate map.

Loaded from a file with lines: left right rate

Provides cumulative distance lookup for converting genomic positions to rate-weighted distances.

load_map(filename: str) None[source]

Load a rate map from a whitespace-delimited file.

RateMap stores a piecewise-constant rate function (recombination or mutation) as parallel arrays of (left, right, rate) intervals. Supports:

  • load_map(path): Read a 3-column file (left, right, rate).

  • cumulative_distance(pos): Integrated rate from 0 to pos.

  • segment_distance(a, b): Integrated rate from a to b.

When provided to the Sampler, rate maps override the scalar recomb_rate / mut_rate for computing per-bin rhos and thetas.

Fitch parsimony reconstruction

class pysinger.reconstruction.fitch.FitchReconstruction(tree: Tree)[source]

Two-pass Fitch parsimony reconstruction on a Tree.

Usage:

fr = FitchReconstruction(tree)
fr.reconstruct(pos)      # assigns states for position *pos*
fr.update(recombination) # advance tree topology
reconstruct(pos: float) None[source]

Perform Fitch reconstruction at genomic position pos.

Writes the inferred ancestral states back into each node via node.write_state(pos, state).

Mirrors Fitch_reconstruction::reconstruct.

update(r: Recombination) None[source]

Advance the topology by applying recombination r.

Mirrors Fitch_reconstruction::update.

Fitch parsimony assigns ancestral states to internal nodes with the minimum number of mutations. It runs on each marginal tree for each segregating site.

Algorithm

Pass 1 — Pruning (bottom-up):

  • Leaf nodes: state = observed allele.

  • Internal nodes: if both children agree, take that state. If they disagree, mark as ambiguous (0.5). If one child is ambiguous, take the other’s state.

Pass 2 — Peeling (top-down):

  • Root: resolve ambiguity to ancestral (0).

  • Internal nodes: if ambiguous, take parent’s resolved state. If definite, keep it.

After both passes, node.write_state(pos, state) is called for each internal node, updating the node’s mutation_sites map. This is used during ARG._impute() to assign allele states to newly threaded coalescence nodes.