NIH Human Microbiome Project 2


This material provides a quick tour of much of the data available from the Human Microbiome Project, but it is not an exhaustive inventory of all data sets and analysis products. Many approximations and generalizations are made for the sake of intelligibility. It is also focused on the subset of data products that are likely to be both tractable and interesting for the average researcher.

The HMP is generating large amounts of genomic and metagenomic sequence data. There are two primary portals for accessing data:

Framework of Sequence data: Cohort type and Data type

One way of organizing much (though not all) of the metagenomic sequence data generated under the project is to split it by cohort type and data type.

There are two primary cohort types:

  1. Center "Healthy Cohort": This is a single cohort of 300 healthy individuals, each sampled at 5 major body sites (oral, airways, skin, gut, vagina) and up to three timepoints. Each body site consisted of a number of body subsites, for a total of 15 to 18 samples per individual per timepoint.
  2. Demonstration Project "disease cohorts": These 15 projects each have one or more cohorts aimed at studying specific health conditions. Each project developed sampling, processing, and 16S or whole metagenome shotgun sequencing approaches according to their condition of interest. These cohorts include both controls and affected individuals.

There are three primary data types:

  1. Reference microbial genomes: Most of these are not derived from specific cohorts
  2. Whole metagenome shotgun (mWGS) sequence
  3. 16S metagenomic sequence

The resulting division can be roughly represented by the following table:

Center"Healthy Cohort" Demonstration Project "disease cohorts"
NCBI BioProject 46305
Reference microbial genomes
NCBI BioProject 28331
~1000 strains Hundreds of strains
mWGS metagenomic sequence
NCBI BioProject 43017
Subset of the 300 subjects, multiple timepoints, 15+ bodysites 5 projects, each with unique, sampling sites, conditions, etc.
16S metagenomic sequence
NCBI BioProject 48489
300 subjects, multiple timepoints, 15+ bodysites 14 projects, each with unique, sampling sites, conditions, etc. 4 projects contain both 16S and mWGS components

There are other data types being generated under the project and many nuances even within this approximate organization. All of the sequence data listed above is openly available for download. To protect subject privacy, data has been filtered to remove contaminating human sequence.

Framework of Clinical Data

In addition to the generation of metagenomic sequence data (mWGS and/or 16S), information, or metadata, about the human subjects was also collected. To protect subject privacy, those data are available only through NCBI's dbGaP to qualified researchers. "Qualified researchers" are defined as PI-level investigators at legitimate institutions who can describe how they plan to use the data and can follow a series of precautions to safeguard patient privacy. Detailed information on the accessing private data is available at the NCBI dbGaP site.

Only the following clinical metadata are available outside of dbGaP, directly embedded in the sequence file metadata:

  1. Unique subject ID
  2. Body site
  3. Sex (male/female)
  4. Visit number

No approval is necessary to access these data.

Accessing sequence data

Most of the raw sequence data reside at NCBI's Sequence Read Archive (SRA). The most straightforward way to identify all of the SRA data associated with a particular dataset is to enter through the BioProject pages referred to above. Each project-level BioProject page provides links to all associated SRA experiments (accession prefix: SRX). Alternately, it is also possible to begin in the SRA and search for all experiments that are linked to a given BioProject ID. Both processes can be performed manually through NCBI's website or by using E-utilities.

The DCC hosts value-added sequence data, with datasets representing numerous steps along common analysis paths. This is intended to allow researchers to bin analysis pipelines mid-stream, dedicating time to the areas they find most important.

Go back to Data Browser