Also referred to as "community genomics" or "environmental genomics", metagenomics is the sequencing and analysis of DNA of microorganisms recovered from an environment, without the need for culturing them.
We currently have little information on the vast majority of microorganisms present in Earth's different environments, mainly due to our inability to culture them in the laboratory – estimates are that <1% of all bacterial species have been cultured. Historically, our inability to culture microorganisms is due to lack of knowledge of their physiologies and environmental cues that may allow the design of suitable culture medium, however new cultivation technologies are beginning to address this problem (Zengler, Walcher et al. 2005). Nonetheless, most of our knowledge has been gleaned from the relatively small number of presently cultured representatives. Imagine if aliens landed on the continent of Antarctica, studied penguins,and considered this an adequate representation of all life on Earth!
Metagenomics enables us to study microorganisms by deciphering their genetic information from DNA that is extracted directly from communities of environmental microorganisms, thus sidestepping the need for culturing or isolation. This discipline builds on the successes of culture-independent 16S rRNA surveys of environmental samples (Olsen, Lane et al. 1986). It offers insights into the evolutionary history as wellas previously unrecognized physiological abilities of microbial communities specialized to live in a given environmental niche. That is, it allows us toaddress the questions "who is there?," "what are they doing?," and "how are they doing it?" The resultant wealthof genes and molecular structures deciphered from uncultured microorganisms has tremendous potential in the development of novel biocatalysts for industrial and medical applications.
Some of the specific aims of a metagenomics project may include:
- Examining phylogenetic diversity using 16S rRNA and other phylogenetically-informative genes - diversity patterns of microorganisms can be used for monitoring and predicting environmental conditions and change.
- Examining genes/operons for desirable enzyme candidates (e.g., cellulases, chitinases, lipases, antibiotics, other natural products); these may be exploited for industrial or medical applications.
- Examining variation or diversity within genes for key enzymes; this may help in identifying or designing optimal catalysts.
- Examining secretory, regulatory, and signal transduction mechanisms associated with samples or genes of interest. Learning how these are organized and potentially regulated may aid in enhancing enzymatic activities.
- Examining transporter systems - what can we predict about nutrient pools or substrates that might be encountered in the environment?
- Examining bacteriophage or plasmid sequences. These potentially influence diversity and structure of microbial communities.
- Examining potential lateral gene transfer events. Knowledge of genome plasticity may give us an idea of selective pressures for gene capture and evolution within a habitat.
- Examining genes/operons for nutrient gathering, auto-inducers (for community sensing), central intermediary metabolism, etc. These may provide insights into syntrophic interactions or reveal basis for the success of organisms in their environment.
- Examining metabolic pathways. A comprehensive understanding of this may lead to a directed approach towards designing culture media for the growth of previously-uncultured microbes.
- Examining genes that predominate in a given environment compared to others. Recognition of "marker" genes in an environment can assist in developing methods to understand community responses, interactions, and processes.
Finally, metagenomic data and metadata can be leveraged towards designing low- and high-throughput experiments focused on definingthe roles of genes and microorganisms in the establishment of a dynamic microbial community.
How to Do a Metagenomics Project
A model metagenomics project (see figure below) begins with the isolation of DNA from a mixed microbial population collected from any given environment (e.g., sewage digestor, dental plaque, termite gut, medical implant, your computer keyboard). Environmental DNA is then sheared into fragments that are used in construction of a DNA clone library. Clone libraries are either small- or medium-insert (2-15 kb insert size) libraries or large-insert bacterial artificial chromosome (BAC) or fosmid libraries (up to 150 kb insert size),that may be sequenced in either a random or targeted fashion.
In a random sequencing approach, the clones are randomly chosen and end-sequenced, and the resulting sequences are assembled into larger contiguous pieces("contigs") by matching up overlapping sequences. The resulting data are contigs of different lengths as well as shorter unassembled fragments.The availability of completely sequenced "reference" genomes may assist in the assembly process for closely related genomes. In the absence of this, contigs may be assigned to various "bins" based on their G+C content, codon usage, sequence coverage, presence of short n-mers(nucleotide frequency), and other parameters, allowing them to be sorted into groups that can be viewed as a "species." Coding sequences (CDSs, or colloquially "genes") are then predicted from these sequence data using various methods. Often in the random sequencing approach, identified genes may not be attributable to a particular microbial species (i.e., there is no taxonomic or phylogenetic affiliation). These nonetheless represent abilities of the general microbial community and may reveal characteristics of their environment.
In a "targeted" sequencing approach, clones are first screened for the presence of a desirable gene (e.g., by PCR amplification) or a gene function (by functional assay). Sequencing targeted large-insert clones in their entirety allows the possibility of recovering complete operons, e.g., those encoding metabolic pathways.
A common approach is to target fosmids bearing phylogenetically informative genes such as 16S rRNA. In this method, known as "phylogenetic anchoring," if a 16S rRNA gene is detected, the fosmid insert is sequenced in its entirety, allowing us to assign the genomic DNA sequence to a specific phylotype. So, unlike random sequencing, this approach helps affiliate phylogeny (rRNA) with putative functional genes (predicted from flanking insert sequences), and permits us to confirm/dispute old assumptions of "who does what." A classic example of this was the discovery of rhodopsin-like photoreceptor genes (light-drivenproton pumps for energy production) in Monterey Bay BAC clones harboring 16S rRNA genes (Beja, Aravind et al. 2000). Where previously rhodopsins (and rhodopsin-based phototrophy) were thought to be exclusivelyarchaeal in origin, metagenome data from phylogenetically anchored clones indicated that this functionality exists in marine g- and a-proteobacteria as well(Beja, Aravind et al. 2000; Sabehi, Beja et al. 2004; Sabehi, Loy et al. 2005). Thus, metagenomics is emerging as a powerful method to study the function and physiology of the unexplored microbial biosphere, and is causing us to reevaluate basic precepts of microbial ecology and evolution.
Fosmids bearing process-specific or biomarker genes (e.g., for processes that may be prominent in the environment under study,like methane oxidation or denitrification) may also be targeted for sequencing to expand information on pathways for these processes.
An obvious limitation of these approaches is that they rely on "what we already know," i.e., PCR primers or probes are designed based on consensus sequences of known genes. Random sequencing of clones, on the other hand, offers the potential to discover completely unknown genes or genes that are too different from currently known genes to be amplified by PCR or hybridize with probes. The optimal strategy may be to combine both random and targeted approaches. In this strategy, genes of interest (e.g., 16S rRNA genes from unknown phylotypes) or novel genes identified from the random sequencing phase may be used to screen and target other clones for sequencing or identify linking clones and expand genome coverage.
Data is obtained in the form of numerous sequence fragments, and near-complete genomes can be re-assembled from overlapping sequence data, particularly for DNA samples from low-complexity environments (i.e., lower species diversity, e.g., acid mine drainage biofilm). However,in higher complexity environments (e.g., Sargasso Sea microbiome), genome reconstruction is a significant challenge. Then again, complete genome sequences are not always necessary for a meaningful understanding of microbial communities - metabolic pathways and interactions can still be pieced together from fragmentary data (Tringe, von Mering et al. 2005). These types of studies also provide a surplus of novel genes and molecular structures that may have potential in development of novel biocatalysts for industrial and medical applications.
Despite these methodologies, many genes may go unnoticed due to their "unclonability" in a heterologous or non-native host like Escherichiacoli (most commonly used host for cloning libraries). Failure to produce clones representing these novel genes arises primarily due to their toxicityin E. coli. Basically, these genes may be too "foreign," and their expressed protein may cause failures in the operation of their hostcell. Sequencing technologies like 454 can address this problem because it has eliminated the cloning step (required for traditional Sanger sequencing based technologies).However, there are some disadvantages to the application of this technology in metagenomics currently, such as smaller sequence reads (~100-150 nucleotides,as opposed to ~800 nucleotides for Sanger sequencing) and lack of paired end sequences, which create major challenges for assembly and annotation.
How do environments get chosen for a metagenomics project? There are many rationales ranging from an exploratory survey to establish a framework for other studies to targeting certain physiological groups to exploit their abilities for biotechnological applications. The three "Cs" of metagenomics - Concept, Complexity and Cost - are all important considerations.
Metagenomics Case Studies
SARGASSO SEA MICROBIOME (Venter, Remington et al. 2004)
The Sargasso Sea community metagenomic survey in 2004 yielded in excess of 1625 million basepairs (Mbp) of data (almost 1000 times more than the first sequenced bacterial genome in 1995 - 1.8 Mbp from Haemophilus influenzae). The rationale for selecting the Sargasso Sea site was manifest. Marine organisms represent the closest living descendants of the progenitor forms of life and contribute to key elementary budgets in the biogeochemical cycles ofour planet. Sargasso Sea was also a well studied marine environment and predicted to possess "relatively" low microbial species diversity dueto nutrient-poor conditions. It would therefore be more amenable to metagenomics than another marine site with greater diversity of microorganisms. A random sequencing approach was used to generate more than 1625 Mbp of data from more than 1,800 different bacterial species (including 148 novel bacterial phylotypes) and encoding over 1.2 million new gene sequences. One of the highlights of the study was the expansion of a family of proteorhodopsin genes. Proteorhodopsins are light-driven proton pumps used for energy production (phototrophy). The discoveryof an expanded set of proteorhodopsins with different spectral properties has potential biotechnological applications in optical data storage and signal processing. The data also revealed the presence of an ammonium monooxygenase gene (for ammonium oxidation, producing nitrite) in archaea-associated assemblies;causing reevaluation of the precept that oceanic nitrification is an exclusively bacterial occupation. The catalog of genes generated from this study build an all-essential framework for subsequent genomic comparisons by enriching gene databases with information from unexplored organisms. They also provide a surplus of novel genes or molecular structures that have potential in the development of novel biocatalysts.
ACID MINE DRAINAGE BIOFILM (Tyson, Chapman et al. 2004)
Acid mine drainage (AMD) is a worldwide ecological disaster resulting from commercial mining operations and exacerbated by the activity of uncultured extremophiles that flourish in this inhospitable environment. A random sequencing approach was used to sequence the metagenome ofa microbial biofilm community from AMD in Iron Mountain, California. In striking contrast to the Sargasso Sea survey, the AMD biofilm consisted of only five dominant species, and 75 Mbp of data was sufficient to reconstruct two near-complete genome sequences and gather detailed information about metabolic pathways and even strain-level variation. Insights gained from this data were also leveraged towards isolating an uncultivated bacterial species from the AMD biofilm (Tyson, Lo et al. 2005). This is one of the major ambitions for metagenomics – figuring outoptimal culture conditions (that represent the natural environment) for isolation of new species.
METHANE_OXIDIZING ARCHAEA FROM DEEP SEA SEDIMENTS (Hallam, Putnam et al. 2004)
Anaerobic methane oxidation (CH4 + H2O à CO2 + H2) by archaea in marine deep sediments plays a major role in reducing methane (a greenhouse gas) released from oceans into the atmosphere. A metagenomic study of such uncultured communities from a methane seep in Eel River Basin off the California coastline was undertaken. This study utilized an enrichment step to reduce complexity ofthe sample and select for archaeal DNA. The enrichment step involved size selection (for archaeal cells and sulfate-reducing bacteria) by density centrifugationn and size-fractionation prior to clone library construction. Random and targeted sequencing of fosmid clones produced ~120 Mbp of DNA that was examined for methanogenesis pathways. An overwhelming absence of a single key enzyme of the 7-step methanogenesis pathway in the sesamples lent support to a previously proposed "reverse-methanogenesis" hypothesis (Hallam, Girguis et al. 2003). Conjecture is that a forwardmethanogenesis pathway (CO2 + H2 à CH4 + H2O) may have been altered or reversed to give rise to the remarkable anaerobic methane oxidation capabilities of methanotrophs.
HUMAN DISTAL GUT COMMUNITIES (Gill, Pop et al. 2006)
The human gut hosts a large community of microbes that are an integral part of human physiology. These organisms may encode >100 times as many genes as the human genome itself. Random sequencing of DNA libraries created from fecal flora of two healthy humans generated ~78 Mbp of data. A comparison ofthis metagenomic data with that of the human genome and sequenced prokaryotic genomes revealed a relative abundance of pathways for methanogenesis, vitaminsynthesis, and degradation of polysaccharides and xenobiotic compounds (e.g., plant glycans obtained through diet, chlorinatedorganic toxins). These findings have implications for human susceptibility to cancer, obesity, drug metabolism, etc.
SYMBIONT COMMUNITY FROM MARINE WORM (Woyke, Teeling et al. 2006)
Like humans, lower eukaryotes also enjoy symbiotic relationships with various prokaryotic microbial species. The metagenome of bacterial symbionts of amarine oligochaete worm (Olavius algarvensis ) that has uniquely evolved to do without a mouth, gut, or excretory system, was sequenced. The hypothesis is that these bacterial symbionts compensate for the loss of digestive and excretory systems in their host. ~204 Mbp of randomly sequenced data from small and large clone libraries allowed partial assembly of four symbiotic species. Reconstruction of their physiologies from the genomic data revealed extensive carbon fixing capabilities amongst others that may fulfill their host's energy and waste management requirements.
A list of ongoing/completed metagenomics projects can be found at:
http://www.genomesonline.org/gold.cgi?want=Metagenomes
http://egg.umh.es/micromar/
http://rdp.cme.msu.edu/
What are the Challenges
Metagenomics is a burgeoning field with new challenges encountered at every instance. The gamut of challenges runs from inefficiencies in sampling, DNA extraction methods, and construction of libraries to inadequacies in data analysis and visualization tools. Added to this are severe computational power and data storage constraints due to the huge amounts of genomic data flooding in from initiatives worldwide. For example, the global ocean sampling project alone will add greater than 8000 Mbp of genomic data, essentially doubling the current amount of genomic information for all prokaryotic species. Just a few of these challenges for metagenomics are described below. For a detailed review of challenges in assembly and comparative metagenomics, see Chen and Pachter (Chen and Pachter 2005) (Tringe and Rubin 2005) (DeLong 2005).
Low abundance species overlooked
The high complexity environment of the Sargasso Sea comprising ~1800 different species was daunting in terms of metagenomeassembly and analysis. Many current assembly software are befuddled by the large numbers of complex, polymorphic metagenomic data, as are the annotation software, which are designed for use on "closed" (completely assembled) microbial genomes. Assembly is also hampered by shallow sequence coverage resulting from failures to sample uniformly, particularly in high-complexity environments where relative abundance of individual species varies.Most of the sequences obtained may be from the most predominant species in the environment, while sequences from low-abundance species may go undetected.These low-abundance species may well play a critical role in the ecophysiology of the habitat.
Lack of reference genomes
Sometimes, assembly can be assisted by the availability of a pre-existing reference genome that can serve as a blueprint for piecing environmental genomicdata together. Of course, such reference genomes are presently only available for a subset of cultured species, so assembling genomes of more divergentor novel species is not always facilitated. Finally, intraspecies heterogeneity or polymorphisms, or high levels of sequence conservation between phylogenetically unrelated genomes, all can confound the assembly software and result in false or chimeric assemblies.
Sequencing complex environments cost prohibitive
So, why not just sequence more data to improve coverage? Cost is the obvious rate-limiting step. However, new technologies like 454 and Solexa offer lower costs and other advantages compared to Sanger sequencing technologies. For example, 454 technology has eliminated the need for constructing clone libraries prior to sequencing (no cloning bias) and reduced sequencing bias (e.g., hard stops, hair pins structures are not a hindrance). However,one major current disadvantage of 454 sequencing is that much smaller sequence reads are generated (~100-150 nucleotides, as opposed to ~800 nucleotides forSanger sequencing based technologies), creating further challenges for assembly and annotation. Lack of paired end sequences isalso a drawback. Undoubtedly,as sequencing technologies and bioinformatics tools evolve, these challenges will dissipate.
Standardizing metadata
Metadata refers to the temporal, spatial, and physicochemical data associated with the sampling site from which organisms were derived for the metagenomicsstudy. Typical examples are time, date, latitude, longitude, temperature, pH, salinity, etc. The purpose of making such metadata available is to enablecorrelation of deciphered ecology with the environmental conditions that may favor one population structure over another. Presently, there are no establishedstandards for submission of metadata, and a Genomics Standards Consortium is involved in solicitingopinions from the research community to define a minimal set of metadata required for every genomic and metagenomic project.
One of the chief goals of the CAMERA initiative is to provide bioinformatics resources freely to the scientific community in order to help expand the utility of metagenomic studies.
Recommended Reading
Beja, O., L. Aravind, et al. (2000). "Bacterial rhodopsin: evidence for a new type of phototrophy in the sea." Science 289(5486): 1902-6.
Breitbart, M., I. Hewson, et al. (2003). "Metagenomic analyses of an uncultured viral community from human feces." J Bacteriol 185(20): 6220-3.
Breitbart, M., P. Salamon, et al. (2002). "Genomic analysis of uncultured marine viral communities." Proc Natl Acad Sci U S A 99(22): 14250-5.
Chen, K., and L. Pachter (2005). "Bioinformatics for whole-genome shotgun sequencing of microbial communities." PLoS Comput Biol 1(2): 106-12.
DeLong, E.F. (2005). "Microbial community genomics in the ocean." Nat Rev Microbiol 3(6): 459-69.
Gill, S.R., M. Pop, et al. (2006). "Metagenomic analysis of the human distal gut microbiome." Science 312(5778): 1355-9.
Hallam, S.J., P.R. Girguis, et al. (2003). "Identification of methyl coenzyme M reductase A (mcrA) genes associated with methane-oxidizing archaea." Appl Environ Microbiol 69(9): 5483-91.
Hallam, S.J., N. Putnam, et al. (2004). "Reverse methanogenesis: testing the hypothesis with environmental genomics." Science 305(5689):1457-62.
Handelsman, J., M. R. Rondon, et al. (1998). "Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products." Chem Biol 5(10): R245-9.
Olsen, G.J., D.J. Lane, et al. (1986). "Microbial ecology and evolution: a ribosomal RNA approach." Annu Rev Microbiol 40: 337-65.
Sabehi, G., O. Beja, et al. (2004). "Different SAR86 subgroups harbour divergent proteorhodopsins." Environ Microbiol 6(9): 903-10.
Sabehi, G., A. Loy, et al. (2005). "New insights into metabolic properties of marine bacteria encoding proteorhodopsins." PLoS Biol 3(8):e273.
Short, J. M. (1997). "Recombinant approaches for accessing biodiversity." Nat Biotechnol 15(13): 1322-3.
Tringe, S.G., E.M. Rubin (2005). "Metagenomics: DNA sequencing of environmental samples." Nat Rev Genet 6(11): 805-14.
Tringe, S.G., C. von Mering, et al. (2005). "Comparative metagenomics of microbial communities." Science 308(5721): 554-7.
Tyson, G.W., J. Chapman, et al. (2004). "Community structure and metabolism through reconstruction of microbial genomes from the environment." Nature 428(6978): 37-43.
Tyson, G.W., I. Lo, et al. (2005). "Genome-directed isolation of the key nitrogen fixer Leptospirillum ferrodiazotrophum sp. nov. from an acidophilic microbial community." Appl Environ Microbiol 71(10): 6319-24.
Venter, J.C., K. Remington, et al. (2004). "Environmental genome shotgun sequencing of the Sargasso Sea." Science 304(5667): 66-74.
Worden, A.Z., Cuvelier, M.L., Bartlett, D.H. (2006). "In-depth analyses of marine microbial community genomics." Trends Microbiol14(8):331-6.
Woyke, T., H. Teeling, et al. (2006). "Symbiosis insights through metagenomic analysis of a microbial consortium." Nature 443(7114): 950-5.
Zengler, K., M. Walcher, et al. (2005). "High-throughput cultivation of microorganisms using microcapsules." Methods Enzymol 397: 124-30.
Glossary
BAC, bacterial artificial chromosome is a DNA vector derived from bacteria and used for cloning relatively large DNA
fragments (150-kb average insert size) in Escherichia coli cells.
Biofilm, a complex aggregation of microorganisms marked by the secetion of a protective and adhesive matrix.
Biofilms are also often characterized by surface attachment, structural heterogeneity, genetic diversity, complex community interactions, and an extracellular
matrix of polymeric substances.
Coding sequence, open reading frame likely to code for a protein.
Extremophile, an organism that grows optimally in "extreme" conditions, including extreme
temperature, pressure, pH, ionic concentration, and pressure.
Fosmid, a low-copy-number vector that is based on the Escherichia coli F-factor replicon. Cloned sequences
(40 kb average insert size) are more stable in fosmids than in multi-copy vectors.
Methane Seep, oil and methane gas bubbling up from undersea sediment layers.
n-mers, short nucleotide subsequences of different lengths.
Phototrophy, ability of organisms to use light to generate energy for cellular activity, growth, and reproduction.
Phylotypes, bacterial 16S rDNA sequence (or phylogenetic) types determined by an arbitrarily chosen level of genetic identity.
Polymorphism, a naturally occurring variation in the sequence of genetic information on a segment of DNA
among individuals.
16S rRNA, highly conserved ribosomal RNA component used for discerning evolutionary relationships among prokaryotic
organisms. rRNAs are ancient molecules, functionally constant, universally distributed, and moderately well conserved across broad phylogenetic distances.
Syntrophic, metabolic relationship in which two (or more) species living together can utilize a substrate that neither
could use by itself.
Xenobiotic, a chemical which is found in an organism but which is not normally produced or expected to be present in it
Rekha Seshadri, PhD
(J. Craig Venter Institute)