Analysis Challenges of Next-Generation Sequencing
Next-generation sequencing
has significantly reduced the time and cost of performing sequencing
experiments. However, translating the massive next-generation sequencing
data into meaningful results faces new challenges. CAMDA2009 will mainly
focus on the ChIP-seq assay. The challenges of ChIP-seq assay include:
- Experiment design: number of replicates;
sequencing depth
Quality assessment: sequencing bias; sequencing errors; batch effects
- Data management and visualization
- Sequence alignment: alignment speed; error
handling; multiple matches
- Preprocessing: normalization; filtering, peak
finding
- Differential analysis: statistical models;
multiple testing;
- Annotation: associate ChIP loci with genes;
functional analysis (biological pathways)
- Motif identification
To face these analysis
challenges, CAMDA2009 adoptes the approach of a community-wide experiment,
letting the scientific community analyze the same public ChIP-seq dataset.
The details of the dataset will be illustrated in the subsequent sections.
Data description (Summary)
Data type: ChIP-seq data
(Chromatin ImmunoPrecipitation following by high-thoughput tag sequencing)
Platform: GA II platform
from Illumina / Solexa
Samples:
1. The human transcription factor Pol II in HeLa S3
cells (the same protocol as used for the STAT1 samples except that the
cells were not stimulated using gamma-interferon prior to formaldehyde
fixation)
2. Input DNA dataset for HeLa S3 cells
3. The human transcription factor STAT1 in HeLa S3
cells (The STAT1 ChIP was performed using HeLa S3 cells that are stimulated
using gamma-interferon).
4. Input DNA dataset for gamma-interferon stimulated
HeLa S3 cells
Publication: PeakSeq
enables systematic scoring of ChIP-seq experiments relative to controls,
Nature Biotechnology: (Vol 27-1, Jan 2009)
Download the dataset:
The dataset can be
accessed at GEO
(GSE12738) . But it only includes the mapped sequence reads, and does
not include the raw sequence read.
The paper companion
website provides more detailed data information, and includes both
mapped and raw sequence data.
Summary of the samples
Download
detailed sample information table (Tab separated txt file).
Data formats
Illumina Sequence File Format
Illumina sequence files
are in FASTQ format.
It combines the sequence reads and quality score into one file. The quality
score is represented as an ascii character. The ascii character is
associated with the quality score. The encoding implicit in Solexa-derived
fastq files is that each character code corresponds to a score equal to the
ASCII character value minus 64 (e.g., ASCII @ is decimal 64, and
corresponds to a Solexa quality score of 0). It is different from BioPerl,
for instance, which recovers quality scores by subtracting 33 from the
ASCII character value (so that, for instance, !, with decimal value 33,
encodes value 0).
Following is the first
four lines of an example sequence file. The first line starts with @ and
followed by the sequence identifier. The second line is the sequence and
the fourth line is the ASCII encoded quality score of the sequence.
First 4 lines of an
example FASTQ sequence file:
@FC201WVA_20080307:6:1:85:638
GATAATACCTATTAAATAGTCCAAATTA
+FC201WVA_20080307:6:1:85:638
[[[[[[[[[[[[[[[[[[[[[[[[[YYY
Eland format
ELAND stands for
Efficient Large-Scale Alignment of Nucleotide Databases.
ELAND searches a set of
large DNA files for a large number of short DNA reads (≤ 32 bases, upcoming
new release of Eland will support longer read lengths (>32 bp)) allowing
up to 2 errors per match.
The columns from left
to right are:
1.
Sequence name
2.
Sequence
3.
Type of match. Codes
are (from the Eland manual): NM (no match); QC (no match due to quality
control failure); RM (no match due to repeat masking); U0 (best match was
unique and exact); U1 (best match was unique, with 1 mismatch); U2 (best
match was unique, with 2 mismatches); R0 (multiple exact matches found); R1
(multiple 1 mismatch matches found, no exact matches); R2 (multiple 2
mismatch matches found, no exact or 1-mismatch matches).
4.
Number of exact
matches found
5.
Number of 1-error
matches found
6.
Number of 2-error
matches found (For unique best match, U, only:)
7.
Genome file in which
match was found
8.
Position of match
9.
Direction of match
(strand of match) (F, forward; R, reverse)
10.
How N characters in
read were interpreted (., not applicable; D, deletion; I, insertion)
11.
Position and type of
the first substitution error (For U1 and U2 matches only:)
12. Position and type of the second substitution error
(For U2 matches only:)
Suggested Tools
Sequence alignment tools
Maq, Eland, SOAP, RMAP,
SHRiMP and others
See some comments of
analysis tools at:
http://massgenomics.wordpress.com/2008/05/14/short-read-aligners-maq-eland-and-others/
Analysis tools
ShortRead
(R/Bioconductor)
PeakSeq
CisGenome
Basic Tutorial
Data input using functions in ShortRead Bioconductor package
Input
Illumina sequence data in FASTQ format
#
suppose sequenceDir is the directory keeping the sequence data. Function
readFastq can input multiple sequence files together. It uses parameter
pattern to select the files in the “sequenceDir” directory. In default, it
will input all files in the directory.
rfq
<- readFastq(sequenceDir, pattern="HeLa*")
#
retrieve sequence data
sequenceRead
<- sread(rfq)
#
retrieve sequence data
sequenceID
<- id(rfq)
#
retrieve sequence quality
sequenceQuality
<- quality(rfq)
sequenceQuality[1]
#
Coerce the quality score as numeric
as(sequenceQuality[1],
'numeric')
Input
mapped sequence data by Eland
We
can use the “readAligned” function to input the mapped sequence files.
Suppose mappedSeqDir is the directory keeping the mapped sequence data.
Similar with function readFastq, function readAligned can also input
multiple mapped sequence files together. To input the mapped sequence in
our dataset, we can set the type parameter as “SolexaResult”. Please read
the help of readAligned for other supported file formats.
alignedSeq
<- readAligned(mappedSeqDir, type="SolexaResult" )
#
show the slot names in the data
slotNames(alignedSeq)
#
"chromosome" "position" "strand"
"alignQuality" "alignData" "quality"
"sread" "id"
|