Genomic approaches to studying the microbiome

May 18, 2020

Genomic approaches to studying the microbiome

Microbes in a community interact with each other and the host, so it is important to capture as much of the diversity of the microbiome within its phenotypic context as possible.

Understanding how microbial species and the host are related through genome-based and high-throughput techniques is a hot topic in microbiome research.

In my previous post on the plant microbiome I dropped many references to genomics studies which have informed so much of our knowledge on plant–microbe interactions. In this post, I will discuss the modern genomics techniques used in studying the microbiome and the genetic markers which have been set as goal posts.

The main questions when approaching a microbiome study are:

  • Who is there? — taxa identification, their abundance and distribution.
  • What are they doing? — what functional chemistry is being carried out and what environmental products are being consumed and excreted?
  • How are they doing it? — which enzyme pathways are present? Is pathway expression dominant in one species over another?

Abundance = high quantity in a sample.
Prevalence = consistently found across samples.
If both → consistently found in high quantities across samples.


HOW can we study the Microbiome?

Classic microbiology techniques involve isolating and culturing microbes. These culture-dependent techniques are limited — most microbes cannot be grown under lab conditions.

This mismatch between observed cells vs. cultured cells was dubbed the great plate count anomaly.

Great plate count anomaly

Culture-independent methods use DNA sequencing:

  • 16S rRNA surveys — identify bacteria by sequencing ribosomal RNA.
  • Metagenomics — survey all DNA present.
  • Metatranscriptomics — measure mRNA expression.
  • Proteomics — detect proteins.
  • Metabolomics — identify metabolites.
  • Metaphenomics — combine multiple omics in context.

Sequencing and bioinformatics strategies

Challenges:

  • Variation between samples complicates analysis.
  • Controls can be difficult (e.g., sterilized soil vs. human longitudinal sampling).

OMICS approaches

The suffix omics = comprehensive, global study (omics.org).

Omics overview

Functional genomics examines how genes and regions across the genome contribute to biological processes.

Questions it can answer:

  • Which populations are associated with a phenotype/disease?
  • What is the relative abundance of species?
  • Which genes are expressed under specific conditions?
  • How does a population shift under environmental change?
  • What metabolites are produced?

BIOLOGY IS A big data problem

NGS generates vast amounts of sequencing data, requiring computational power and bioinformatics expertise.

Illumina genome sequencing

Polynucleotide

The cost of sequencing has plummeted, but data output is growing faster than Moore’s law for computing.

📺 Watch: The $1000 genome

Short reads must be mapped to a reference genome to assemble full genomes.

Short read alignment

This is the read alignment problem.

Reads aligned to a reference genome

Compared to the human genome (~3 billion bp, 3.2GB), bacterial genomes are tiny (~3–4 million bp, ~3.87MB) (source).


Metagenomics — who is there?

Metagenomics = sequencing all DNA in a community. It identifies taxa present and their functional genes.

Shotgun metagenomics workflow

Metagenomics ≠ meta-analysis. Metagenomics = genetic survey of an environment. Meta-analysis = comparison across datasets.


16S ribosomal RNA

16S rRNA is the barcode gene in microbial ecology. It’s:

  • present in all bacteria
  • highly conserved within species
  • has variable regions useful for taxonomy
  • sometimes insufficient (e.g., E. coli vs. Shigella >99% identical Devanga Ragupathi et al., 2017)

16S rRNA gene

Carl Woese pioneered 16S phylogenetics in the 1970s, proposing the 3-domain system (Bacteria, Archaea, Eukarya) Woese et al., 1990.

16S rRNA variation

Variable regions

Advantages of 16S:

  • cheap
  • high sample throughput
  • huge databases

Disadvantages:

  • not quantitative (multiple 16S copies per genome)
  • limited resolution (often genus-level)
  • compositional (relative, not absolute)

16S compositional bias


Metagenomic whole genome sequencing

WGS captures all DNA, including viruses/fungi missed by 16S.

Metagenome analysis

Steps:

  1. Extract DNA
  2. Prepare library + sequence
  3. Map reads to marker genes, pangenomes, or protein DBs (MetaPhlAn, HUMAnN2)
  4. Or assemble de novo (de Bruijn graphs)

Mapping reads

Assembly de Bruijn graphs

Output: complete genomes (rare) or contigs (common).


Metatranscriptomics — what can they do?

Looks at RNA transcripts to see which genes are actively expressed.

Translation DNA→RNA→Protein

mRNA = single-stranded transcript coding for proteins. By sequencing RNA (RNA-Seq), we capture active functions.

Process:

  1. Extract RNA
  2. Reverse transcribe to cDNA
  3. Fragment + sequence
  4. Align reads → quantify expression

RNA-Seq workflow

Approaches:

  • Reference-based alignment (STAR, HiSAT2, Bowtie)
  • Splice-aware alignment (TopHat2, etc.)
  • Pseudo-alignment (kallisto Bray et al. 2015)

Challenges:

  • bacteria lack polyA tails → high rRNA contamination
  • many microbes lack reference genomes

RNA-Seq pipeline


Differential gene expression analysis

Heatmaps, PCA, clustering are used to visualize expression.

Heatmap of soil microbes


Proteome

The proteome = all proteins expressed. More accurate than transcriptome since not all transcripts → proteins.


Metabolome

Metabolites = intermediates + end products of metabolism. Reveal biochemical pathways.


Metaphenome

The metaphenome = combined functional output of a microbial community in its environment.

Soil metaphenomics

Attempts to connect in situ expression to ecosystem-level functions like cellulose decomposition, methanogenesis, sulfate reduction.


Conclusion

I hope this foray into microbiome genomics illuminated some of the methods shaping modern research. In future posts I’ll dive deeper into workflows using real NGS data.


References