Analysis Challenges of Next-Generation Sequencing
has significantly reduced the time and cost of performing sequencing
experiments. However, translating the massive next-generation sequencing
data into meaningful results faces new challenges. CAMDA2009 will mainly
focus on the ChIP-seq assay. The challenges of ChIP-seq assay include:
- Experiment design: number of replicates;
Quality assessment: sequencing bias; sequencing errors; batch effects
- Data management and visualization
- Sequence alignment: alignment speed; error
handling; multiple matches
- Preprocessing: normalization; filtering, peak
- Differential analysis: statistical models;
- Annotation: associate ChIP loci with genes;
functional analysis (biological pathways)
- Motif identification
To face these analysis
challenges, CAMDA2009 adoptes the approach of a community-wide experiment,
letting the scientific community analyze the same public ChIP-seq dataset.
The details of the dataset will be illustrated in the subsequent sections.
Data description (Summary)
Data type: ChIP-seq data
(Chromatin ImmunoPrecipitation following by high-thoughput tag sequencing)
Platform: GA II platform
from Illumina / Solexa
1. The human transcription factor Pol II in HeLa S3
cells (the same protocol as used for the STAT1 samples except that the
cells were not stimulated using gamma-interferon prior to formaldehyde
2. Input DNA dataset for HeLa S3 cells
3. The human transcription factor STAT1 in HeLa S3
cells (The STAT1 ChIP was performed using HeLa S3 cells that are stimulated
4. Input DNA dataset for gamma-interferon stimulated
HeLa S3 cells
enables systematic scoring of ChIP-seq experiments relative to controls,
Nature Biotechnology: (Vol 27-1, Jan 2009)
Download the dataset:
The dataset can be
accessed at GEO
(GSE12738) . But it only includes the mapped sequence reads, and does
not include the raw sequence read.
The paper companion
website provides more detailed data information, and includes both
mapped and raw sequence data.
Summary of the samples
detailed sample information table (Tab separated txt file).
Illumina Sequence File Format
Illumina sequence files
are in FASTQ format.
It combines the sequence reads and quality score into one file. The quality
score is represented as an ascii character. The ascii character is
associated with the quality score. The encoding implicit in Solexa-derived
fastq files is that each character code corresponds to a score equal to the
ASCII character value minus 64 (e.g., ASCII @ is decimal 64, and
corresponds to a Solexa quality score of 0). It is different from BioPerl,
for instance, which recovers quality scores by subtracting 33 from the
ASCII character value (so that, for instance, !, with decimal value 33,
encodes value 0).
Following is the first
four lines of an example sequence file. The first line starts with @ and
followed by the sequence identifier. The second line is the sequence and
the fourth line is the ASCII encoded quality score of the sequence.
First 4 lines of an
example FASTQ sequence file:
ELAND stands for
Efficient Large-Scale Alignment of Nucleotide Databases.
ELAND searches a set of
large DNA files for a large number of short DNA reads (≤ 32 bases, upcoming
new release of Eland will support longer read lengths (>32 bp)) allowing
up to 2 errors per match.
The columns from left
to right are:
Type of match. Codes
are (from the Eland manual): NM (no match); QC (no match due to quality
control failure); RM (no match due to repeat masking); U0 (best match was
unique and exact); U1 (best match was unique, with 1 mismatch); U2 (best
match was unique, with 2 mismatches); R0 (multiple exact matches found); R1
(multiple 1 mismatch matches found, no exact matches); R2 (multiple 2
mismatch matches found, no exact or 1-mismatch matches).
Number of exact
Number of 1-error
Number of 2-error
matches found (For unique best match, U, only:)
Genome file in which
match was found
Position of match
Direction of match
(strand of match) (F, forward; R, reverse)
How N characters in
read were interpreted (., not applicable; D, deletion; I, insertion)
Position and type of
the first substitution error (For U1 and U2 matches only:)
12. Position and type of the second substitution error
(For U2 matches only:)
Sequence alignment tools
Maq, Eland, SOAP, RMAP,
SHRiMP and others
See some comments of
analysis tools at:
Data input using functions in ShortRead Bioconductor package
Illumina sequence data in FASTQ format
suppose sequenceDir is the directory keeping the sequence data. Function
readFastq can input multiple sequence files together. It uses parameter
pattern to select the files in the “sequenceDir” directory. In default, it
will input all files in the directory.
<- readFastq(sequenceDir, pattern="HeLa*")
retrieve sequence data
retrieve sequence data
retrieve sequence quality
Coerce the quality score as numeric
mapped sequence data by Eland
can use the “readAligned” function to input the mapped sequence files.
Suppose mappedSeqDir is the directory keeping the mapped sequence data.
Similar with function readFastq, function readAligned can also input
multiple mapped sequence files together. To input the mapped sequence in
our dataset, we can set the type parameter as “SolexaResult”. Please read
the help of readAligned for other supported file formats.
<- readAligned(mappedSeqDir, type="SolexaResult" )
show the slot names in the data
"chromosome" "position" "strand"
"alignQuality" "alignData" "quality"