Genomics approaches to studying the microbiome

Microbes in a community interact with each other and the host, so it is important to capture as much of the diversity of the microbiome within its phenotypic context as possible.

Understanding how microbial species and the host are related through genome-based and high-throughput techniques is a hot topic in microbiome research.

In my previous post on the plant microbiome I dropped many references to genomics studies which have informed so much of our knowledge on plant-microbe interactions. In this post, I will discuss the modern genomics techniques used in studying the microbiome and the genetic markers which have been set as goal posts.

The main questions when approaching a microbiome study are:

  • Who is there?
    • Taxa identification, their abundance and distribution.
  • What are they doing?
    • What functional chemistry is being carried out and what environmental products are being consumed and excreted?
  • How are they doing it?
    • Which enzyme pathways are present? Is pathway expression dominant in one species over another?

Abundance and prevalence are concepts that also need consideration. Abundance means there is a high quantity of a species present in a sample. If a species is prevalent it is consistently found across samples. If it is abundant and prevalent, the species is consistently found in high quantities across samples. So how can we explore these metrics?

HOW can we study the Microbiome?

Classic microbiology techniques involve isolating and culturing microbes from their environment. Culture-dependent techniques are limited in that they require specific conditions to grow microorganisms which don’t capture or reflect the vast majority of microbial diversity occurring in a natural environment.

It was observed that a sample under the microscope contained many more bacterial cells than when grown under standard cell-culturing conditions. This was dubbed the great plate count anomaly.

Great plate count anomaly.

Culture-independent methods came in the form of analyzing microbial DNA through genetic sequencing.

  • 16S rRNA surveys – Next-generation sequencing (NGS) of PCR amplified SSU (small subunit) ribosomal RNA products) identifies which bacteria are present.
  • Metagenomics – measures the abundance of genes present in the sample (all the DNA).
  • Metascriptomics – measures mRNA expression or community-wide gene expression.
  • Proteomics – finds the proteins translated.
  • Metabolomics – characterizes the metabolites.
  • Metaphenomics – a combination of the above within the context of a specific environment or phenotype.

Because sequences are generated directly from the environmental sample, the cultivation of microbial isolates is not necessary.

This image has an empty alt attribute; its file name is 1-s2.0-S2001037015000318-gr1.jpg
Different sequencing and bioinformatic strategies for human microbiome analysis.

Challenges that arise from designing sequencing experiments come from variation across samples in microbial content from the same environment. This complicates the detection of statistically significant and biologically meaningful differences among small sets of samples.

Obtaining appropriate controls for a complex environmental sample can also be difficult. In the case of plant microbiome studies, the soil can be sterilized to remove microbes whereas when studying the human microbiome it is recommended to use samples from the same habitat but over varying periods of time.


The omics studies allow for a global analysis to gain a microbial community snapshot. The addition of an “omics” suffix to a molecular term implies a comprehensive, or global, assessment of a set of molecules (

Omics approaches fall under the field of functional genomics which studies how genes and intergenic regions of the genome on a “genome-wide” scale (i.e. all or multiple genes/regions at the same time) contribute to different biological processes.

Genomics was the first omics discipline which studied the entire genome as opposed to investigating single genes or variants.

For a long time, little was known about the soil microbiome. It was labelled a “black box” since it was difficult to pull apart the role of microbes and the influence of their environment. Advances in sequencing technology has led to the development of remarkable approaches and protocols to decipher the incredible microbial world.

Functional genomics can answer questions such as:

  • Which microbial populations are associated with a specific phenotype or disease?
  • What is the relative abundance of a species of microbe within an environment?
  • Which microbial genes are expressed in a specific condition or environment?
  • How does a microbial population change in response to their environment?
  • What metabolites do microbes produce in a certain context?

We are now able to hone in on a number of outputs or products at varying levels of an organism’s or community’s genetic potential e.g. an organism’s entire gene expression (transcriptomics) or protein translation potential (proteomics).

BIOLOGY IS A big data problem

Modern biology requires the collaboration between the brains of the smartest biologists and computer scientists to algorithmically understand and optimize the mountains of data that spews out of next-generation sequencing machines.

Illumina announces $1000 whole human genome sequencing ...

Next-generation sequencing (NGS) allows several organisms to be detected in parallel in a sample by directly determining the nucleic acid sequence of a given DNA or cDNA molecule.

The drop in cost and time of sample processing and sequencing is leading to the generation of terabytes of data which need to be analyzed. Advances in the generation of sequencing data is said to be exponential, thereby increasing faster than computational power is growing.

The result of NGS is a number of short reads (segments of DNA) which need to be compared or mapped to a reference genome (entire strand of DNA) to assemble a new complete genome.

Figure 1 from FPGA Acceleration of Short Read Alignment | Semantic ...

This is called the read alignment problem.

Billions of reads of DNA or RNA nucleotides are aligned to a reference genome and are represented as rows within a matrix.

Compared to the human genome (~3 billion base pairs) which is 3.2GB in size, the bacterial genome has the benefit of being much smaller at an average size of ~3.87 MB (~3-4 million base pairs).

To read more about read mapping I recommend this excellent blog post.

Something I would like to know is how much data is generated in a human microbiome vs. a soil microbiome experiment?

Metagenomics – who is there?

Metagenomics focuses on the microbial community composition by identifying the proportion of sequences (DNA or 16S rRNA) that map to different taxa.

It involves the isolation, sequencing and analysis of the DNA of microbial communities directly from the environment.

This omics view captures the genetic composition of material extracted from an environment. It identifies who is there – the taxonomic composition of the sample and the functional gene information of the microbial identities.

Shotgun metagenomics, from sampling to analysis.

What is the difference between metagenomics and meta analyses? In case you thought metagenomics was the comparison of multiple whole genome datasets, metagenomics is simply the umbrella term for the tools to study all the genetic material of all the organisms in a sample whereas meta analyses carries out the comparison of features between multiple datasets.

First we need to foray into the world of 16S rRNA and understand its legacy in identifying microbial species.

16S ribosomal RNA

To observe and understand their organisms of interest, microbial ecologists use molecular sequencing to measure the abundance of different microbes in the sampled environment.

The 16S rRNA gene is considered the “barcode gene” in microbial ecology and is most commonly used to quantify microbial community structure and diversity. So much so that it is the most sequenced taxonomic marker with over 4 million database entries for 16S rRNA sequences existing in publicly available databases.

The 16S rRNA subunit is:

  • present in all bacteria.
  • highly conserved over time within a species.
  • has a high sequence similarity between species.
    • to an extent that it can fail to distinguish between certain species as described for Escherichia coli and Shigella spp. the 16S rRNA genes of which share >99% sequence identity (Devanga Ragupathi et al., 2017).
  • possesses unique markers or variable regions for taxonomic recognition and the classification of kingdoms.

Many labs use 16S rRNA sequencing to identify unknown bacterial strains. For identification purposes you only need to amplify and sequence regions of the 16S rRNA gene, not the whole gene, since sequence differences are so well defined. Sequencing of 16S rRNA has identified ~11,000 bacterial and archaeal species so far.

Ribosomal RNAs were first suggested as a molecular marker for studying molecular evolution by Carl Woese in the 1970s. From analyzing 16S rRNA sequences he proposed a new tree or phylogeny of life.

Originally there was a 5 kingdom taxonomy consisting of Animalia, Plantae, Fungi, Prostists and Monera. Woese et al. 1990 suggested that there should only be 3 kingdoms composed of Bacteria, Archaea and the Eurcarya.

On the molecular level he recognized that the living world should originate or branch from 3 distinct groups as the sequence differences in the 16S rRNA gene between his proposed 3 kingdoms were much more significant than between the plant and animal lineage for example.

The 16S rRNA gene differs between species and therefore is used for their identification and classification.

A reconstruction of the phylogenetic tree of life accurately portraying microbial evolution was possible through comparison of species’ small subunit rRNA sequence.

To this day subjects from eubacteria, archaebacteria and eukaryotes can be distinguished from one another based on this method.
Figure 1 | Variable regions of the 16S ribosomal RNA.
  • Eubacteria – hairpin loop lies between positions 500 and 545. This loop or bulge protrudes from the 5th or 6th base pair and consists of six nucleotides.
  • Eukaryotes – are identified by region between positions 585 and 655.
  • Archaebacterial – their 16S rRNAs have a unique region between positions 180 and 197 or 405 and 498.

Due to the abundance of 16S rRNA sequences available to refer to, this gene is the cornerstone for the current systematic classification of bacteria and archaea.

Amplicon Sequencing of 16s rRNA

Amplicon sequencing uses PCR to create sequences of DNA called amplicons. Slide taken from metagenomics video.


Polymerase chain reaction (PCR) sparked a revolution in molecular biology. It allowed for the selection of a specific region of DNA or RNA to be amplified i.e. generate millions of copies of the selected molecular region from a small sample.

By ingeniously taking advantage of the enzymatic properties of Taq polymerase – a heat stable form of DNA polymerase isolated from the thermophilic bacteria Thermus aquaticus – a sample’s DNA can be denatured at high temperatures while the Taq polymerase can retain its DNA replicating enzymatic activity, allowing the extension and synthesis of new strands from the original double-stranded template.

Since sequencing the 16S rRNA gene is an accurate and fast method for bacterial identification, the 16S rRNA gene is a common primer target in PCR to extract DNA from samples.

Primers mark the start and end of the piece of DNA desired for amplification.

Real-time PCR analysis is well suited to studies where analysis of a small number of genes in a small number of samples is needed.

16S rRNA sequencing

Closed reference

Once the sequences are extracted, clustering between sequences is performed to find those surpassing a required similarity threshold, usually 97% similarity or a distance of 3% (distances between pairs of sequences). This is called operational taxonomic unit (OTU) picking. Species taxonomy is then assigned by comparing the sequence to a database of 16S reference genomes.

Open reference

For segments of sequence that do not match a reference genome, de novo clustering algorithms are performed where sequenced reads are clustered against one another without comparison to an external reference sequence. From here the peptide amino acid product is converted from the amino acid sequence to understand the gene’s potential function.

16S rrNA advantages vs disadvantages


  • Relatively cheap so you can sequence many samples.
  • Therefore increasing the statistical power of your findings.
  • Extensive online databases to compare your sequence to.

Disadvantages of only using a single gene for sequencing:

  • Not quantitative in measuring the relative proportions of bacteria present as
    • Many microbes have multiple copies of the 16S rRNA gene in their genome.
  • Limited resolution in identification of taxa in comparison to metagenomics (NGS has resolution down to a single nucleic acid).
    • Can ID most often to the genus level, sometimes species level.
  • Compositional in that it measures the relative proportion of bacteria present as opposed to total bacteria.
The bar plots in (B) show the difference between the count of molecules and the proportion of molecules for two features, A (red) and B (gray) in three samples. In sample 2 and 3 the total bacteria counts for species A and species B are very different. After 16S sequencing the relative proportions of A and B in samples 2 and 3 turn out to have the same relative abundance even though the counts are much different.

While 16S rRNA sequencing is useful for identifying the relative number or abundance of organisms in a sample whole genome sequencing is necessary to identify the functional and metabolic potential of the organism.

Metagenomic whole genome sequencing

One of the main advantages of whole-genome sequencing over 16S sequencing is that it can capture sequences from all the organisms, including viruses and fungi, which cannot be captured with 16S sequencing alone.

Whole Metagenome sequencing Steps

  1. Extract DNA.
  2. Library preparation and sequencing.
2. All DNA from the sample is fragmented and sequenced without biased PCR amplification.

Mapping-based analysis

3. Generate reads and perform sequence alignment (using BLAST, bowtie2, DIAMOND) to map to existing gene in a database.

3. Reads are aligned to known sequences, pangenomes and protein databases.

A number of steps are required to process and analyze the raw reads generated from NGS which can be accomplished with a package like humann2. Different tiers represent steps in the mapping algorithm.

Tier 1 – Map to clade-specific marker genes which are known to uniquely identify a species e.g. using MetaPhlAn 2.0. By focusing on a small subset of genes, this is an efficient way of narrowing down the organisms present in a sample.

Tier 2 – In this stage, you take the species’ sequences where known genes were identified in tier 1 and map your reads to an online database of pangenomes. A pangenome is the entire gene set of all strains of a species.

Tier 3 – The ambiguous reads that were unclassified or unidentified from tier 2 get filtered down to tier 3 where the DNA reads are translated to their corresponding protein sequences and compared to a protein database. From here the function of these novel reads can be deduced from the protein structure.

assembly-BASED analysis

An alternative to mapping your reads to a reference genome is de novo or assembly-based analysis. This is necessary when studying obscure samples or organisms that do not have significant database entries.

Assembly-based analysis is a complex challenge accomplished by de Brujin’s graph algorithm. A set of slides describing de Brujin’s algorithm can be found here.

(A) In the de Bruijn graph approach, short reads are split into short k-mers before the de Bruijn graphs are built. (B) In the Hamiltonian approach, the k-mers (or sequences) are the nodes, whereas they are the edges in the Eulerian approach. The k-mers are connected to neighbors by overlapping prefix and suffix (k-1)-mers.
The present and future of de novo whole-genome assembly.

The reads are split up into small sections or k-mers with k denoting the number of nucleotides. From here a graph is formed and traversed depending on the probability of a nucleotide occurring in each position in the k-mer.

enter image description here

If the species’ genome is small, assembly may result in a complete genome or more commonly you can end up with a bunch of contigs which are variably-sized chunks of genome.

Metagenomic Results

Examples of some beautiful visualizations generated from metagenomic data:
Metagenomic Profiling of Soil Microbes.
Microbial life in the phyllosphere.

Microbial life in the phyllosphere.
Composition of the gut microbiotas of four snake species by bacterial (A) phylum and (B) genus.

16S vs. Whole genome sequencing

When deciding whether to use 16S surveys over metagenomics depending on your experimental design you need to consider that:

  • 16S amplicon sequencing is more sensitive to the presence or identification of rare organisms present in the sample.
  • Whole genome sequencing is less biased in measuring the relative abundance of organisms.

Metagenomics drawbacks

Although it is valuable to gleam the microbial taxa present in the sample it tells us little about gene expression in situ. This is because it detects all the DNA present which can include DNA from dead and dormant cells, DNA trapped in biofilms and from microbes in vastly varying physiological states where only a fraction of their genes are expressed at any given time.

For example, Verrumicrobia are often found in soil metagenomes but have low levels of gene expression in metatranscriptome data. Meaning although they are present in high amounts they are not very active members of the community.

Metatranscriptomics – What can they do?

The next step in a microbiome experiment is to examine which genes are expressed by the community. This is accomplished by looking at which organisms are actively transcribing which genes into RNA. The sum of all actively transcribed genes into RNA is called the transcriptome.

Known as “expression profiling“, metatranscriptomics is one of the most common study types.

The quantification can be done by extracting RNA (mostly mRNA but sometimes total RNA) from samples following a treatment of at fixed time-points in a time-series to view snapshots of gene expression patterns.

Translation: DNA to mRNA to Protein | Learn Science at Scitable
Translation: DNA to mRNA to Protein

messenger RNA

If you are still with me, I want to just briefly touch base with what mRNA or messenger RNA’s structure and role in protein synthesis is.

mRNA is the single-stranded, final genetic sequence that is ready to be read by a ribosome to create a protein. It is composed of introns (regions that do not make the cut for the final sequence) and exons (regions which will remain in the final sequence and encode proteins).

Since mRNA is the active genetic sequence in a cell which codes for protein, by looking at mRNA we can study the diversity of active genes within a microbial community, quantify their expression levels and monitor how these levels change in different conditions (e.g., physiological vs. pathological conditions in an organism).


Metatranscriptomics uses RNA-seq to determine which genes and pathways are being actively expressed within a community by examining mRNA.

RNA sequencing (RNA-Seq) is the application of NGS to cDNA molecules. The cDNA is obtained from reverse transcription of RNA.

There are many advantages to using RNA-Seq:

  • it offers extremely high resolution and sensitivity for genes expressed at low levels.
  • it is not limited to prior known knowledge about the organism i.e. availability of reference genomes.

This makes RNA-seq and NGS suited for discovery-based projects and in depth analyses aiming to:

  • identify new RNA transcripts.
  • study non-coding RNAs.
  • map transcription start sites.
  • characterize the exact location of epigenetic modifications.

The downsides of RNA-seq is its expense and its intensive computational processing and data analysis requirements.

RNA-Seq Process

Overview of a RNA-seq experiment for detecting differential expression
RNA-seq experimental design for detecting differential expression (DE).

The metatranscriptome process involves sample preparation, RNA sequencing using NGS and then a data analysis pipeline. You need to:

  1. Extract the RNA.
  2. Convert the RNA to cDNA (complementary DNA).
  3. Fragment DNA and sequence (single-end vs. paired-end).
  4. Align reads to find relative expression.

Reference based alignment

Introduction to RNA sequencing - Easy Guides - Wiki - STHDA

mRNA contains the spliced out exon regions put together.

Here reads are mapped directly to a reference genome, allowing for splice junctions across exons. Since the exon contigs do not align completely, the algorithm needs to take this into account. Modern tools for mRNA alignment include STAR, HiSAT2 and Bowtie.

Splice-aware alignment

Introduction to RNA-Seq - ppt download
Introduction to RNA-Seq slides.

A splice-aware aligner does not introduce a long gap where the the intron lies (which can be variable in length and also very long) in the final sequence, which occurs in regular DNA-DNA alignment. Instead, it ignores the introns and only aligns to the exons.

TopHat2 is a splice-aware software built on the splice-unaware Bowtie.


Modified Figure 1a,b
How “Pseudoalignments” Work in kallisto

A new program for quantifying abundances of transcripts from RNA-Seq data is kallisto. This program reinvented the wheel by working on the principle of pseudo-alignment.

My understanding of this algorithm is that it removes the need to align each individual base from each sequence of the original reads with those of the bases in the contigs of the reference transcriptome.

Instead of alignment, it directly compares raw RNA-Seq reads with that of a reference transcript. Quoting this blog post, Lior states kallisto has to “determine, for each read, not where in each transcript it aligns, but rather which transcripts it is compatible with”.

By bypassing the alignment step makes this algorithm extremely lightweight and fast. Read Nicolas Bray’s kallisto thesis here.

Metatranscriptome Challenges

Microbiome samples are a challenge as bacteria do not have polyA tail:

  1. Lack of polyA signal makes it difficult to isolate bacterial mRNA and results in (massive) rRNA contamination.
  2. Environmental microbiome samples lack reference genomes making it difficult to map reads back to their source transcripts.

RNA-Seq data analysis

RNA-seq processing pipeline used to generate gene expression data in Expression Atlas
RNA-seq processing pipeline used to generate gene expression data in Expression Atlas.

The cleaning and analysis of RNA-Seq data involves many steps beyond the scope of this blog post but I will give an overview of the pipeline.

Once raw fastq sequences have been obtained, data analysis involves:

  • quality control.
  • read alignment (with or without a reference genome).
  • quantification of gene and transcript expression and
  • detection of differentially expressed genes.

Stranded RNA-Seq data

Depending on the program you’re using you’ll need to decide on the parameters or settings of alignment. Sometimes a read will map to an overlap of 2 different genes. In this case you will need to set whether that instance will be defined as ‘ambiguous’ or else if it will be mapped to the gene with the largest amount of overlap.

HTSeq-count can count the number of genes that map or align to each gene.

Intro to RNA-Seq data analysis.

Once the stranded data has been aligned to a gene, a count matrix is constructed using genes as the rows and the samples as the columns. From here you can understand if your gene of interest is up-regulated or down-regulated in your experiment.

Principle Component Analysis (PCA) plots are generated to visualize the variation present in your dataset. This is a quality control step that can be generated using R’s DESeq2

The number of principle components is equal to the number of samples in the dataset. By measuring the distance between samples or data points you can see which has the largest spread or variation.

The proportion of variance is explained by each component and can be used to mark the difference between groups.

Features with a low variance score can be excluded from the analysis.

Differential gene expression analysis

Hierarchical Clustering with correlation heatmaps

Heatmaps are a common way of visualizing gene expression data. The genes and/or samples are clustered together based on the similarity of their gene expression pattern.
Diversity and Co-occurrence Patterns of Soil Bacterial and Fungal Communities in Seven Intercropping Systems

Heatmap data is arranged into rows which represent genes and columns which are the samples. The genes are shown to be either up-regulated or down-regulated shown by changes in colour intensity.

How to make heatmaps with R:

Further interpretation and analysis is needed to find out if differentially expressed genes are associated with a specific biological process or molecular function.


The proteome captures all the proteins which were translated. This is the logical next step from transcriptome analysis. Since not all expressed genes are translated into proteins this is a more accurate representation of functions that were carried out in the sample’s environment at that snapshot in time.


Metabolites are the intermediary or end products of metabolic reactions. Looking at the metabolites present can help understand what biochemical reactions are occurring in the environment and thus construct microbial biochemical pathways and networks.


A relatively new approach and concept to omics studies is understanding the product of the combined genetic potential of the microbiome within its environment environment (resources available, spatial, biotic and abiotic constraints).

This image has an empty alt attribute; its file name is 1-s2.0-S1369527417302205-gr2.jpg
The soil microbiome — from metagenomics to metaphenomics

Metaphenomics attempts to answer the problem of uncovering what is going on at any given moment within the microbial community. By understanding gene expression in situ, the complexity of microbe-microbe interactions as well as microbe-plant ones can be formulated. Biochemical pathways can be elucidated.

Understanding the fine scale distribution of microbes and resources is required to predict species physiology and metabolic interactions among community members, that comprise the collective soil metaphenome.

The end goal of this would be to isolate or categorize microbial groups based on metabolic function or their specific phenotype, for example:

  • Cellulose decomposition.
  • Methanogenesis.
  • Sulfate reduction.

A couple of suggested ways of studying the metaphenome is to introduce soil with known metabolic capabilities into ‘synthetic’ communities of microbes and measuring what genes are expressed or if the composition of microbes in the environment fluctuates.


I hope this brief foray into the world of microbiome genetics has illuminated some of the processes underlying modern research methods. Hopefully in a future blog post I can delve deeper into these workflows using real NGS data.


Cover photo art | Antoine Dore

Learn about functional genomics for free |

Metagenomics slides taken from Introduction to Metagenomics for Researchers video.

RNA-Seq course 2019 |


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.