Module sequencing

Expand description

Sequence-based species identification for Mycobacteriaceae.

§Generation of species identification databases

The database sequences are flagged as type material using the INSD Collaboration type_material qualifier. Type material ties formal species names to physical specimens (culture collections for prokaryotes, museum or herbarium specimens for eukaryotes), as annotated in the NCBI Taxonomy Database.

See fn fetch_myco_sequences() in build.rs for details on how the sequences were fetched from NCBI at build time.

myco_erm41.fasta is generated at build time but unused; erm41 identification uses per-subspecies references (erm41_abscessus_ATCC_19977.fasta, erm41_bolletii_CIP_108541.fasta, erm41_massiliense_CCUG_48898.fasta) instead.

Re-exports§

pub use batch::SampleSusceptibilityRecord;

Modules§

batch
bed
erm41
hsp65
ntfy_notify
pnca
rpob
rrl
rrs
serde_helpers
tb_data

Structs§

Ab1Channels: Parsed channel intensity data from an AB1 chromatogram.
Erm41ViewState: Chromatogram display parameters for the erm(41) region.
GappedAlignment: Alignment result from align_to_ref: gapped strings plus reference start position.
RrlNtmViewState: Chromatogram display parameters for the rrl / NTM macrolide-resistance region.
SeqData: Top-level result for a processed AB1 read.
SeqIdHit: Best-hit result from aligning an AB1 read against the reference sequences.
SusceptibilityCalls: Susceptibility calls derived from AB1 capillary sequencing, keyed by gene target.

Constants§

ACC_GASTRI 🔒
ACC_KANSASII 🔒
ACC_MARINUM 🔒
ACC_ULCERANS 🔒
DESC_ABSCESSUS
DESC_BOLLETII
DESC_MASSILIENSE
ERM41_ANCHOR_L 🔒
ERM41_ANCHOR_R 🔒
ERM41_FWD_END 🔒
ERM41_FWD_START 🔒
KANSASII_GASTRI_ACCS 🔒
MARINUM_ULCERANS_ACCS 🔒
MIN_PNCA_REF_LEN 🔒
MIN_RPOB_REF_LEN 🔒
MIN_RRL_REF_LEN 🔒
MIN_RRS_REF_LEN 🔒
MIN_SEQ_ID_IDENTITY
PDF_COL_HEADERS 🔒
PDF_COL_X 🔒
PDF_MARGIN_B 🔒
PDF_MARGIN_L 🔒
PDF_MARGIN_T 🔒
PDF_PAGE_H 🔒
PDF_PAGE_W 🔒
PDF_ROW_H 🔒
PDF_TABLE_W 🔒
PNCA_FWD_END 🔒
PNCA_FWD_START 🔒
REF_ERM41_ABSCESSUS 🔒
REF_ERM41_BOLLETII 🔒
REF_ERM41_MASSILENSE 🔒
REF_MYCO_HSP65 🔒: hsp65 / groEL2 reference sequences — Mycobacteriaceae type strains, fetched from NCBI at build time.
REF_MYCO_RPOB 🔒: rpoB reference sequences — Mycobacteriaceae type strains, fetched from NCBI at build time.
REF_MYCO_RRL 🔒: 23S rRNA (rrl) reference sequences — Mycobacteriaceae type strains, fetched from NCBI at build time.
REF_MYCO_RRS 🔒: 16S rRNA (rrs) reference sequences — Mycobacteriaceae type strains, fetched from NCBI at build time.
REF_PNCA 🔒: pncA CDS + 50bp upstream promoter flank for each M. tuberculosis complex member with a distinct reference sequence, fetched from NCBI at build time (see pnca module docs). Concatenated into one multi-FASTA so identify_sequence_pnca() can search all of them via parse_multi_fasta, the same way identify_sequence_rrl_ntm() searches REF_MYCO_RRL.
RRL_ANCHOR_L 🔒
RRL_ANCHOR_R 🔒

Functions§

align_to_ref: Align query (Sanger read) against reference (gene sequence) using semiglobal Smith-Waterman (free reference end-gaps, full query placed within reference).
base_at_ref_pos: Return the query base at a given reference position, or None if the position is outside the aligned region or the query has a deletion ('-') there.
build_report_pdf: Build a landscape A4 PDF report from AB1 scan records. Filtered to gene-identified records with identity ≥ MIN_SEQ_ID_IDENTITY, same as the CSV output, and to samples no older than report_max_age_days (see TBConfig::report_max_age_days).
dedup_substring_same_desc 🔒: Within each description group, remove entries whose sequence (uppercased) is a contiguous substring of a longer entry that shares the same description. Longer entries survive; the shorter entries are redundant for alignment purposes because the aligner will find the same best position inside the longer reference.
format_pairwise_alignment 🔒
parse_ab1_quality: Tries edited quality scores (PCON tag 2) first, falling back to raw (PCON tag 1). Each byte is a Phred quality score corresponding to the base at the same index in PBAS.
parse_ab1_sequence: Tries the edited basecalls (PBAS tag 2) first, falling back to raw basecalls (PBAS tag 1).
parse_fasta_seq 🔒: Parse a FASTA string, returning just the sequence bytes (ignores header).
parse_multi_fasta 🔒: Parse a multi-FASTA string into (accession, description, sequence) tuples.
pdf_current_date 🔒
pdf_days_to_ymd 🔒
pdf_hline 🔒
pdf_sus 🔒
pdf_truncate 🔒
pdf_write_row 🔒
reverse_complement
scan_window 🔒
trim_start_end: Trim a basecall sequence to the amplicon region defined by a primer pair.
trim_to_min_quality: Trim leading and trailing low-quality bases using a sliding-window average.