 |
Genome-wide mapping of DNase Hypersensitive sites using Massively Parallel Signature Sequencing (MPSS)
Gregory E. Crawford1, Ingeborg E. Holt1, James Whittle1, Bryn D. Webb1,
Denise Tai1, Sean Davis1, Elliott H. Margulies1, YiDong Chen1,
John A. Bernat2, David Ginsburg2, Daixing Zhou3, Shujun Luo3,
Thomas J. Vasicek3, Tyra G. Wolfsberg1, and Francis S. Collins1
1National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892
2University of Michigan, Department of Human Genetics, Ann Arbor, MI 48109
3Solexa, Inc., Hayward, CA 94545
Abstract
A major goal in genomics is to understand how genes are regulated in different tissues, stages of development, diseases, and species.
Mapping DNaseI hypersensitive (HS) sites within nuclear chromatin is a powerful and well-established method of identifying many different
types of regulatory elements, but in the past has been limited to analysis of single loci. We have recently described a protocol to
generate a genome-wide library of DNase HS sites. Here, we report high throughput analysis, using massively parallel signature sequencing
(MPSS), of 230,000 tags from a DNase library generated from quiescent human CD4+ T cells. Of the tags that uniquely map to the genome,
we identified 14,200 clusters of sequences that group within close proximity to each other. Using a real-time PCR strategy,
we determined that the majority of these clusters represent valid DNase HS sites. Approximately 80% of these DNase HS sites
uniquely map within one or more annotated regions of the genome believed to contain regulatory elements, including regions
2kb upstream of genes, CpG islands, and highly conserved sequences. Most DNase hypersensitive sites identified in CD4+ T cells are
also hypersensitive in CD8+ T cells, B cells, hepatocytes, human umbilical vein endothelial cells (HUVECs), and HeLa cells.
However, ~10% of the DNase HS sites are lymphocyte specific, indicating that this protocol can identify gene regulatory elements
that control cell type specificity. This strategy, which can be applied to any cell line or tissue, will enable a better understanding
of how chromatin structure dictates cell function and fate.
Table Column Headers | Description of DNase HS clusters | Genome Assembly | Verification
Note: The individual sequence files were updated on May 25, 2005. Before that date,
the files did not contain the complete data set. If you downloaded data prior to May 25, 2005,
please retrieve the data again to obtain the full list of coordinates. The DNase HS clusters
files were not affected.
Table Column Headers
Individual Sequences:
chr: chromosome
coord: coordinate of DNase sequence
strand: strand of DNase sequence
2kb_upstream: + indicates that the sequence falls within 2 kb upstream of an mRNA RefSeq
CpG_Island: + indicates that the sequence falls within a CpG Island
MCS: + indicates that the sequence falls within an MCS (multi-species conserved sequences)
DNase HS clusters:
chr: chromosome
start: first coordinate of cluster
stop: last coordinate of cluster
name: cluster identifier
count: number of DNase sequences in cluster
2kb_upstream: + indicates that the midpoint of cluster falls within 2 kb upstream of an mRNA RefSeq
CpG_Island: + indicates that the cluster region overlaps with a CpG Island
MCS: + indicates that the cluster region overlaps with an MCS (multi-species conserved sequences
Description of DNase HS clusters
DNase HS clusters are multiple DNaseI library sequences that map within 500 bases of each other.
Each cluster has a unique identifier; the last digit of each identifier represents the number of
sequences that map within that particular cluster. For example, 500bp_199_4 represents a cluster
of 4 sequences (that has the unique identifier 199) in which the distance between each sequence is
less than 500 bp.
Genome Assembly
Coordinates were derived using UCSC's Human July 2003 assembly (hg16, NCBI build 34)
Verification
Real-time PCR assay was used to verify valid DNaseI-hypersensitive sites in CD4+ T cells. Approximately 20% of
individual sequences (singlets) are valid, 50% of clusters of 2 sequences are valid, 80% of clusters
of 3 sequences are valid, and 100% of clusters of 4 or more sequences are valid.
* DNase HS Clusters identify individual sequence coordinates that map within 500 bp from each other
Comments, suggestions and problems to
bioinformatics@nhgri.nih.gov
|
 |