MicroSNP – Tool for finding SNPs in Metagenomic data
Sep 19, 2021
Microbial DNA variation
If you compare any two human genomes, you will find there is around 1 SNP difference (change in a single nucleotide) every kilobase 1 . Variations in human DNA can largely be attributed to inheritance — individual differences between our parents and lineage of ancestry.
Within a bacterial species, the rate of allelic variation is much higher between any two isolates. This is due to the liberal way bacteria “have sex” and share genes via horizontal gene transfer.
To make things even more complicated, bacteria can possess or carry a genetic “toolbox” of mobile genetic elements which are conferred from bacteriophage insertions (prophage), plasmids, or pathogenicity islands. Some of these mobile elements can be cut out or pasted in as bacteria need them.
We can see the potential for a high baseline level of genetic variation between individual bacteria from the same species.
What is a SNP?
A SNP (single nucleotide polymorphism; pronounced as “snip” for short) is when there is a single base difference in a species’ genome compared to another genome, e.g.:
A → T
A single nucleotide substitution can ultimately lead to structural mishaps, disordered peptides, or changes in a protein’s structure downstream during protein folding.
Variants or polymorphisms
SNPs are one type of variant that can exist between genomes. Structural variants also exist in the form of:
- Insertions
- Deletions
- Inversions
- Duplications

The database dbSNP catalogues structural variants present in the human genome.
As far as I know, there exists no comprehensive unified database for all microbial SNPs.
dBacWGSTdbP is a SNP database tracker for pathogens, which aims to track the evolution and source of pathogens over time.
Why look for structural variants in metagenomic data?
To understand antimicrobial resistance genes or which form of an allele makes a bacterium resistant to antibiotics or drug treatment.
To look for correlations between alleles in a microbial population and an individual’s health (e.g. microbiome profiling).
To compare multiple samples from a microbiome or environment to observe which allele form is dominant or expressed across disease states, countries, body sites, plant locations, conditions, or timepoints.
And many, many more scenarios…
Limitations of current SNP finding software
I found that many programs are designed for diploid genomes (e.g. human). Things become complicated when dealing with haploid genomes, especially when reporting across multiple species within the same sample.
One such tool I tried was SnpEff
It works well when analysing SNPs in a single species (e.g. comparing multiple human genomes), but becomes less feasible when trying to scrape information from HTML reports and reformat it concisely.
Below are examples of the HTML output SnpEff creates:



BCFTOOLS
This led me to discover bcftools, a command-line program maintained by samtools for variant calling and manipulating VCF and BCF files.
It works with and creates:
- VCF: Variant Call Format
- BCF: binary-compressed VCF
Variant calling refers to identifying genetic variants between genomes.

If you are going to be looking for variants or SNPs, you will need to become comfortable with the VCF format.
For an overview of VCF files, check out this presentation.
By working directly with a VCF file using bcftools, you can quickly filter data by:
- Variant position
- Variant type
- Sample
- Quality score
- Depth
- Genotype quality
Microsnp
Due to the challenges of analysing data from a complex experimental design, I wrote a Python wrapper around bcftools to automate the identification of SNPs and structural variants. The whole thing could probably be written in bash, but Python felt more portable and fun to work with.
The experiment involved adding a phage cocktail to germ-free mice colonised with a native population of bacteria associated with inflammatory microbiome disorders. The goal was to see whether introduced phages would select for genetic variants in the bacterial population.
The study included:
-
15 mice
-
5–7 timepoints per mouse
-
~15 bacterial and phage species
-
All examined for SNPs and structural variants across mice and timepoints.
To see the program and installation instructions, check out the GitHub repository: https://github.com/linda5mith/microsnp