----------------------
README File Accompanying GSE18199: 
Comprehensive mapping of long-range interactions reveals folding principles of the human genome.
----------------------

Several types of data accompany this submission.

1. Raw Hi-C Data (Illumina Paired-end reads)
2. Processed Data Files
3. Heatmaps
4. Eigenvector Files
5. 3D-FISH Data
6. Peano Curves
7. Globules
8. Additional ENCODE Data: DNAseI and ChIP-Seq
9. Supplemental Code

See below for more information about each data type.

Please Cite:

Lieberman-Aiden E*, van Berkum NL*, Williams L, Imakaev M et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 2009 Oct 9;326(5950):289-93. PMID: 19815776

----------------------
1. Raw Hi-C Data

Paired end Illumina Reads are provided in standard fastq format.

----------------------
2. Processed Data File

The processed data file shows a large number of paired end alignments.
 
Processed data file header:
read name, chromosome1, position1, strand1, restrictionfragment1, chromosome2, position2, strand2, restrictionfragment2 

chromosome1, position1, strand1, restrictionfragment1 correspond to the first paired end.

chromosome2, position2, strand2, restrictionfragment2 correspond to the second paired end.

"chromosome" nomenclature: 
0 Mitochondrial
23 X
24 Y

"position" is position in base pairs where the alignment starts. The published alignments are created using maq-0.7.1, using the command:

maq map -e 150 <input>.map <output>.bfq

The command is used separately for the reads at each end. All alignments are based on the hg18 assembly of the human genome.

"strand" is Watson (denoted with a 0) or Crick (denoted with a 1).

"restrictionfragment" is a different way of numbering the position of a read. Instead of numbering which base pair the alignment starts at, each read is assigned to a restriction fragment of the chromosome.  Each restriction fragment on the chromosome is assigned a number indicating its position on the chromosome in the standard ordering. The restriction fragments used correspond to the restriction enzyme used in the experiment, for instance, HindIII or NcoI.

----------------------
3. Heatmaps

Heatmaps were generated by dividing the chromosome up into 100 Kb or 1 Mb windows, as indicated in the title of the file.

There are four types of heatmaps. Headers are present in the first row (resp., column) of the heatmap to indicate the window corresponding to each row (resp., column). There are heatmaps with four different kinds of data.

(1) Observed number of reads.
(2) Expected number of reads.
(3) Observed/Expected ratio.
(4) Pearson correlation.

Intrachromosomal heatmaps are available for every chromosome. Interchromosomal heatmaps are available for every pair of chromosomes. 

Visualizations of these heatmaps are available at the supplemental website, hic.umassmed.edu, which was created by Brian Lajoie.

----------------------
4. Eigenvector File

The eigenvector files show the values of the eigenvectors which may be used to determine compartment identity. The first 3 principal components are provided.

The header is:

chromosome, windownumber, value of eigenvector

"chromosome" nomenclature is as above.

"window number" is calculated by dividing the chromosome up into 100 Kb or 1 Mb windows, as indicated in the title of the file.

For GM12878 cells at 1 Mb resolution, the first eigenvector corresponds to the compartmental structure for all chromosomes except 4 and 5, where the second eigenvector is used. At 100 Kb resolution, the first eigenvector is used for all but chomosomes 2 and 10. For K562 cells, the first eigenvector is always used.

----------------------
5. 3D-FISH Data

Four sets of pairwise distances between several triads of genomic positions, measured in microns. Each row corresponds to a given locus.

L1, L2, L3
L2, L3, L4
L5, L6, L7
L6, L7, L8

Data generated by Tobias Ragoczy, Agnes Telling, M. A. Bender, and Mark Groudine.

----------------------
6. Peano Curves

The exact trajectory of several iterations of various Peano (spacefilling) curves are provided. These curves are:

(1) Hilbert Curve (Hilbert)
(2) Peano Curve (Peano)
(3) Symmetrized Peano Curve (PeanoSym)
(4) Quadratic Gosper Curve (qGosper)
(5) 3D Hilbert Curve (Hilbert3D)
(6) 3D Peano Curve (Peano3D)
(7) Randomized 3D Peano Curve (Peano3D_Random; ten randomly generated instances of this ensemble have been included.)

Each file contains a list of points (in two or three dimensions) which correspond to successive points on the various curves.

----------------------
7. Globules

1000 sample fractal globules and 1000 sample equilibrium globules are provided. The *.dat files contain a header line specifying the length of the polymer, and additional lines specifying the starting point and ending point of each successive monomer. The chain is continuous; the endpoint of the preceding monomer is the starting point of the next monomer.

Globule data was generated by Maxim Imakaev.

----------------------
8. Additional ENCODE Data: DNAseI and ChIP-Seq

Additional DNAseI (Lab of John Stamatoyannopoulos) and ChIP-Seq (Lab of Bradley Bernstein) data tracks were used in the paper, and are available at the UCSC browser (http://genome.ucsc.edu/)

DNAseI:
wgEncodeUwDnaseSeqRawSignalRep2K562
wgEncodeUwDnaseSeqRawSignalRep2Gm06990

For DNAseI, the wiggle files available at the UCSC Browser were used.

ChIP-Seq:
wgEncodeBroadChipSeqSignalGm12878H3k36me3
wgEncodeBroadChipSeqSignalGm12878H3k27me3

For ChIP-Seq, raw read counts were used. These may be directly downloaded at:
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/encodeDCC/wgEncodeBroadChipSeq/

----------------------
9. Supplemental Code.

Two supplemental codesets are currently available.

(1) Supplemental Codeset II: Cumulative distance plots for 3D-FISH.
This code was used to construct Figures 3E,3F, and S2.

(2) Supplemental Codeset III: Contact Probabilities on Peano Curves.
This codebase was used to construct Figures S13-S19.