I/O and reconstruction
VCF reader
VCF reader — load phased or haploid VCF into Node objects.
Mirrors Sampler::naive_read_vcf / naive_read_vcf_haploid in Sampler.cpp.
Returns a list of Node objects with mutation sites set, one node per haplotype (2 per individual for diploid, 1 per individual for haploid).
- pysinger.io.vcf_reader.read_vcf_haploid(vcf_file: str, start_pos: float = 0.0, end_pos: float = inf) Tuple[List[Node], float][source]
Read a haploid VCF and return (nodes, sequence_length).
One Node per individual (column after FORMAT).
Mirrors Sampler::naive_read_vcf_haploid.
- pysinger.io.vcf_reader.read_vcf_phased(vcf_file: str, start_pos: float = 0.0, end_pos: float = inf) Tuple[List[Node], float][source]
Read a phased VCF and return (nodes, sequence_length).
One Node per haplotype (2 × n_individuals). Mutations are stored as positions relative to start_pos.
Mirrors Sampler::naive_read_vcf.
- Parameters:
vcf_file – Path to the .vcf file.
start_pos – Genomic region to load (half-open [start_pos, end_pos)).
end_pos – Genomic region to load (half-open [start_pos, end_pos)).
- Returns:
nodes (List[Node]) – Haplotype nodes (index 0 .. 2n-1).
sequence_length (float) – end_pos - start_pos.
Two functions:
read_vcf_phased(path, start, end): Parses a phased VCF (0|1format). Each diploid individual produces 2Nodeobjects (one per haplotype). Each node’smutation_sitesis populated with positions where it carries the derived allele.read_vcf_haploid(path, start, end): Treats each sample column as a single haplotype (for haploid data or pre-split files).
Both return (List[Node], sequence_length). The sequence length is the end parameter (or the last variant position + 1 if end = inf).
tskit writer
tskit writer — convert a pysinger ARG to a tskit TreeSequence.
Mirrors the conceptual output of SINGER’s write() method, but produces a tskit.TreeSequence that can be analysed with the tskit/msprime ecosystem.
- The conversion:
Add all non-root nodes (time = node.time * 2*Ne) as tskit individuals.
Walk the ARG recombinations to build a list of (left, right, parent, child) edges spanning each genomic interval.
Sort edges (required by tskit) and call tables.sort() + tables.tree_sequence().
- pysinger.io.tskit_writer.arg_to_tskit(arg: ARG, Ne: float = 1.0)[source]
Convert arg to a
tskit.TreeSequence.- Parameters:
arg (ARG) – The pysinger ARG object.
Ne (float) – Effective population size used to convert coalescent time units to generations (t_generations = t_coalescent * 2 * Ne).
- Returns:
ts
- Return type:
tskit.TreeSequence
arg_to_tskit(arg, Ne) converts a pysinger ARG to a tskit.TreeSequence:
Discover nodes: Walk all marginal trees by replaying recombinations. Collect every node that appears in a parent/child relationship (excluding the root sentinel at
index = -1).Add nodes to tskit: Each node becomes a tskit node with
time = node.time * Ne(converting coalescent units to generations). Sample nodes getNODE_IS_SAMPLEflag.Emit edges: Replay recombinations again. For each tree interval \([x_k, x_{k+1})\), emit a tskit edge
(left, right, parent, child)for every parent-child pair where the parent is not the root sentinel andparent.time > child.time.Sort and build: Call
tables.sort()andtables.tree_sequence().
The resulting tree sequence can be used with the full tskit API for computing diversity, TMRCA, drawing trees, etc.
Rate maps
- class pysinger.rates.rate_map.RateMap[source]
Piecewise-constant genetic rate map.
Loaded from a file with lines: left right rate
Provides cumulative distance lookup for converting genomic positions to rate-weighted distances.
RateMap stores a piecewise-constant rate function (recombination or mutation) as parallel arrays of (left, right, rate) intervals. Supports:
load_map(path): Read a 3-column file (left, right, rate).cumulative_distance(pos): Integrated rate from 0 topos.segment_distance(a, b): Integrated rate fromatob.
When provided to the Sampler, rate maps override the scalar recomb_rate / mut_rate for computing per-bin rhos and thetas.
Fitch parsimony reconstruction
- class pysinger.reconstruction.fitch.FitchReconstruction(tree: Tree)[source]
Two-pass Fitch parsimony reconstruction on a Tree.
Usage:
fr = FitchReconstruction(tree) fr.reconstruct(pos) # assigns states for position *pos* fr.update(recombination) # advance tree topology
- reconstruct(pos: float) None[source]
Perform Fitch reconstruction at genomic position pos.
Writes the inferred ancestral states back into each node via
node.write_state(pos, state).Mirrors Fitch_reconstruction::reconstruct.
- update(r: Recombination) None[source]
Advance the topology by applying recombination r.
Mirrors Fitch_reconstruction::update.
Fitch parsimony assigns ancestral states to internal nodes with the minimum number of mutations. It runs on each marginal tree for each segregating site.
Algorithm
Pass 1 — Pruning (bottom-up):
Leaf nodes: state = observed allele.
Internal nodes: if both children agree, take that state. If they disagree, mark as ambiguous (0.5). If one child is ambiguous, take the other’s state.
Pass 2 — Peeling (top-down):
Root: resolve ambiguity to ancestral (0).
Internal nodes: if ambiguous, take parent’s resolved state. If definite, keep it.
After both passes, node.write_state(pos, state) is called for each internal node, updating the node’s mutation_sites map. This is used during ARG._impute() to assign allele states to newly threaded coalescence nodes.