MicroSNP – Pipeline for finding SNPs and structural variants in metagenomic data

Microbial DNA variation

If you compare any 2 human genomes, you will find there is around 1 SNP difference (change in a single nucleotide) every kilobase [1]. Variations in human DNA can be largely chalked down to inheritance – individual differences between our parents and lineage of ancestry.

Within a bacterial species, the rate of allelic variation is much higher for any 2 isolates. This is due to the liberal way bacteria have sex and share genes via horizontal gene transfer. To make things even more complicated, bacteria can possess or carry a genetic ‘toolbox’ or mobile genetic elements which are conferred from bacteriophage insertions (prophage), plasmids or pathogenicity islands. Some of these mobile elements can be cut out or pasted in as the bacteria need them.

We can see the potential for a baseline level of genetic variation between individual bacteria from the same species being high.

What is a SNP?

A SNP or single nucleotide polymorphism is when there is a single base difference changed in a species’ genome when compared to another genome e.g. A->T.

A single change or mutation leading to a substitution of a different nucleotide can ultimately lead to structural mishaps, disordered peptides or changes in a protein’s structure down the line when protein folding.

Variants or polymorphisms

SNPs are one type of variant that can exist between genomes. Structural variants also exist in the form of:

  • Insertions
  • Deletions
  • Inversions
  • Duplications

The database dbSNP catalogues the types of structural variants present in the human genome. As far as I know there exists no comprehensive unifed database for all microbial SNPs. BacWGSTdb is a SNP database tracker for pathogens, which aims to track the evolution and source of pathogens over time.

Why look for structural variants in metagenomic data?

  • When trying to understand antimicrobial resistance genes or what form of an allele makes a bacteria resistant to antibiotics or a drug treatment.
  • To look for correlations between the existence of alleles in a microbial population or species and an individual’s health e.g. microbiome profiling.
  • To compare multiple samples from a microbiome or environment in order to observe what form of a gene or allele in a species is the most dominant or expressed in different disease states, countries, locations of the body or plant, conditions, timepoints etc.
  • And many, many more scenarios…

Limitations of current SNP finding software

I found there are lots of programs which are designed for diploid genomes e.g. human. It complicates the matter when you dealing with haploid genomes and you are looking for a report on multiple species within the same sample.

One such tool I tried was SNPEff. This would be great if I was just looking for SNPs in a single species e.g. comparing multiple human genomes. It becomes less feasible to work with when you are trying to scrape the info from a html report and reformat it in a concise way. This is an example of the html report snpEff creates:

SNPEff example html output


This lead me to find bcftools, a command-line program maintained by samtools for variant calling and to manipulate vcf and bcf files. It works with and creates vcf – variant call format files – which is a unified format/standard optimised for storing information about genetic variants.

Variant calling is the term used to describe looking for variants between a species. A bcf is a binary-compressed version of a vcf.

If you are going to be looking for variants or SNPs you will need to become comfortable with the vcf format. For an overview of vcf’s check out this nice presentation I found:


By working with a vcf file directly using bcftools, you can quickly filter the data by location of variant (position), the type of variant, the sample, quality score, depth, genotype quality etc.


Due to the challenges of analysing data from a complex experimental design I decided to write a python wrapper around bcftools to automate the finding of SNPs and other structural variants. You could probably write the whole thing in bash but I thought python would be more portable and fun to work with.

The experiment involved adding a phage cocktail to germ-free mice which were embedded with a native population of common bacteria associated with certain inflammatory microbiome disorders. The aim was to see if the introduced species of phages would select for genetic variants in the bacterial population.

There were 15 mice, with 5-7 timepoints for each, and around 15 species of bacteria and phage to be examined for SNPs and structural variants across all mice and timepoints.

Anyway, to see the program and instructions for installation check out the github:


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.