metagenomics and discovering viruses from sequences
Analyzing shotgun metagenomics data has become an exciting method of virus discovery in modern microbiology. Research insights gleamed this way are deemed in silico or by computer.
Most of the virus genomes we currently have access to come from metagenomics (all DNA sequences from multiple species found in a single sampled environment). Even though a particular virus can be highly abundant in its environment, more often that not its host has not been identified and therefore has never cultured in the lab.
From these sequences we know that the human gut virome is abundant in bacteriophage. Gut virome compositions are also highly unique and stable to each individual.
Up to 80% of all viral sequences in a shotgun metagenomics sequence have no known homology to a reference database. These unknown viral sequences have been termed viral dark matter and present a unique opportunity to mine new viral signals and discover new viruses from publicly available genomics data.
One example of a virus discovered through bioinformatics and metagenomics is crAssphage (cross-assembly phage) named after the assembly software which aided in its discovery.
crAssphage Fun facts
- CrAssphage is the most abundant phage in our gut.
- It can account for up to 90% of the reads in the viral portion of the gut metagenome and around 22% of the reads of the total metagenome in some individuals.
- 98-100% of healthy adults from Western samples carry at least 1 type of crAssphage.
- CrAssphage-like sequences have been found in faecal samples from remote regions such as rural Malawi and from the Amazonas of Venezuela.
- It is found in all age groups (from as young as 1 year old to 65+) and is thought to be acquired during early childhood.
- Is a benign gut inhabitant as no associations between it and diet or health have emerged.
- CrAss is “a cosmopolitan virus that may have coevolved with the human lineage and is an integral part of the normal human gut virome.”
- Found in human populations all over the world and crAssphage populations are very region-specific. (You can identify the region someone is from by their crAssphage population!)
- CrAss are an order of phage containing up to 78 different generas.
I highly recommend watching the 2 excellent presentations by Rob Edwards (5:10) and Colin Hill (26:39) from the recent iVoM episode to get a better understanding of crAssphage biology and the current ongoing research.
Cross Assembly (crAss)
So what is cross-assembly? Cross assembly involves combining the reads from multiple metagenomics datasets and making a single assembly from these reads i.e. a single assembly from two or more metagenomes. The resulting contigs may be interpreted as ‘metagenomic entities’ or traits that are shared between the sampled environments.
This makes CA a reference-independent comparative metagenomics method whereby you compare all the reads of multiple different samples. If a sequence or number of reads is common among all samples (a cross-contig), or among some and not others, you can deduce the interrelationships between their metagenomes.
Cross-contigs are contigs that are common across 2 or more metagenomes.
This is how researchers discovered CrAssphage. They found a sequence that was homologous or common across all subjects and then went down the rabbit hole of investigating it further.
Benefits of cross-assembly? When analyzing the reads of metagenomics data, the first port of call is often to align your assembled reads to a reference genome and see if you get any hits to known sequences. In the case of viral sequences there are no reference genomes or existing databases are limited. Also if your goal is viral discovery you will need to independently uncover patterns in your data. This is where deep learning and alternative methods of analysis such as cross-assembly come in.
Discovery of CrAssphage approach (Dulith, 2014)
Data: They re-analyzed publicly available viral metagenomes isolated from faecal samples of 12 different individuals comprising of 4 pairs of healthy female monozygotic twins and their mothers. The 4 families were unrelated.
Software: Used the cross-assembly program CrAss.
Data analysis approach: From a total of 1,584,658 metagenomic reads derived from the 12 different individual’s faecal viral metagenomes they did a cross-assembly which resulted in 7,584 cross-contigs.
One short cross-contig they named, contig07548, contained reads from all 12 subjects which hinted that it possibly belonged to a common ubiquitous viral entity.
Using depth profile binning as well as homology binning they collected all the contigs that were likely derived from the same ubiquitous viral genome as contig07548.
Through separate assembly of 1 sample’s viral metagenome they extracted the most complete circular crAssphage sequence of ~97 kb.
Predicting the host: phage can only exist when there is a host present, so the authors used co-occurrence profiling to predict phage host interaction by finding correlations between the presence of the crAssphage sequence and bacterial sequences in the reads.
CRISPR spacing prediction: CRISPR spacers are fragments of virus DNA that bacteria incorporate into their own genome. These fragments were collected from viruses that previously tried to attack the bacterial cell. This process is part of the CRISPR/Cas 9 system bacteria evolved as a defense mechanism against foreign invaders. By keeping a reference database against future attackers bacteria can recognize viral invaders of the same kind and express an enzyme Cas9 to cut the intruder viral DNA.
By comparing the co-occurrence profile and CRISPR spacers of the crAssphage genome with that of 404 intestinal bacterial strains they predicted the likely host to be belonging to the phylum Bacteroidetes. Bacteroidetes bacteria are the most common gut inhabitant so with crAssphage composition being so abundant it makes sense for Bacteroidetes to be crAssphage’s host.
First isolated crassphage, ΦcrAss001 (Shkoporov, 2018)
The first crAssphage was cultured successfully by Andrey Shkoporov which they called ΦcrAss001. With the triump of culturing the first crAss-like phage came confirmation on the specifics of its genome and host.
Findings: As well as confirming the exact nature of the genome and the host, some interesting morphological and biological traits were uncovered:
- The ΦcrAss001 genome is circular, 102 kb in size, and has unusual structural traits.
- Electron micoscopy confirmed ΦcrAss001 has a podovirus-like morphology.
- The host is bacteroides intestinalis.
- Despite the absence of obvious lysogeny genes, ΦcrAss001 replicates in a way that does not disrupt proliferation of the host bacterium, and is able to maintain itself in continuous host culture during several weeks.

Isolation approach: Faecal samples collected from 20 healthy Irish adults were processed to enrich the number of phage present. These samples were then pooled and then used to infect 54 bacterial strains commonly found in the human gut microbiome.
After three rounds of growing the numerous bacterial strains, a cell-free supernatant was extracted from each of the 54 bacterial samples and underwent shotgun metagenomic sequencing.
From analysing the assembled reads for each of the bacterial species they found that a 102.7 kb contig dominated the reads of the strain Bacteroides intestinalis. By comparing this contig to the one identified by Dulith in his cross-assembly study, they could conclude that it was a crAssphage they had isolated.

Since crAss are notoriously difficult to culture this points to these viruses having a narrow host range.
crAss-like viruses in general and ΦcrAss001 in particular are likely to be narrow specialists, rather than generalists in terms of their host range.
Future crAssphage study
It will be super interesting for further functions and the role of this ubiquitous organism in our microbiome to be elucidated. Being such an old and dominant gut inhabitant you would imagine it occupies an important niche and serves to enable its host and its environment somehow.
Sources
Reference-independent comparative metagenomics using cross-assembly: crAss
A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes
Global phylogeography and ancient evolution of the widespread human gut virus crAssphage
Streamlining CRISPR spacer-based bacterial host predictions to decipher the viral dark matter