Data processing pipeline

We developed a semi-automated data processing pipeline specifically for transcription factor and histone modification ChIP-seq experiments. The pipeline can accommodate data sets from the public domain as well as private one. We make use of several freely available tools developed by the community to process those data. HAEMCODE aims to accommodate all relevant functional genomic data related to the haematopoietic system. At the present time we focused our effort on transcription factors and histone datasets. However, the interface has been designed to accommodate DNAse and RNA expression data that will appear in the near future. We welcome suggestions from users regarding the types of dataset that should be incorporated. The criteria used to select hematopoietic datasets were based on a complementary approach between literature search and surveillance of the latest data sets released on NCBI Gene Expression Omnibus.

Briefly, raw reads are transformed to fastQ format using fastq-dump. Reads are assessed for quality using the fastQC program and eventually discarded if quality is not satisfying. Our quality control checks consist of an iterative process. We run fastQC to detect adapter sequences. If adapters are present, we remove them using trimGalore and run fastQC again. We then run bowtie2 and check whether the alignment scores have over >60% global alignment rate. If not, we inspect visually the bigWig in conjunction with the control sample which may lead to sample rejection if there are no clear TF binding peaks (see here for the list of discarded samples). The bowtie2 command for unpaired samples is ran as follow:

bowtie2 -k 2 -N 1 --mm -x genomeReference -U GSM123.fastq -S GSM123.sam -p 20

Similarly, for paired reads library
bowtie2 -k 2 -N 1 --mm -x genomeReference -1 GSM123_1.fastq -2 GSM123_2.fastq -S GSM123.sam -p 20

The Bowtie2 options are describe here. We keep only uniquely matching reads using a grep command removing second best scoring aligned reads from the sam file.

Aligned reads (SAM files) are then transformed to BED format and all regions made 200 bp long to match the theoretical lenght of the cDNA fragments while taking care of the strand orientation. BED files are then used to generate the bigWig using genomeCoverageBed followed by bedGraphToBigWig (bedtools and UCSC tools).

If the experiment is part of a replicate set the different entities are compared using wigcorrelate. If replicates agree at 80% or more the individual fastQ files are merged and the pipeline run from the top again. The final step consist in calling peaks BED files using MACS2. Peaks were called using matched control sample (IgG, WCE...) from the same experiment when available. We use a p-value threshold adapted for each experiment, from 1e-4 to 1e-15 that we setup by visually inpecting the ChIP-seq profiles. Briefly, we generate multiple peak files with a range of p-value thresholds for each experiment and visualize the peaks together with the bigwig density profile in the UCSC genome browser. UCSC sessions are sent for review to Professor Gottgens and Dr. Ruau to check for over-calling or under-calling based on knowledge of the transcription factor studied. The need for expert curation and experiment-specific thresholds for TF ChIP-Seq studies has been stated before by leaders in the field (see for example reviews by Dr Keji Zhao such as Schones & Zhao (2008) Nature Reviews Genetics 9, 179-191). We do not use the q-value (FDR) parameter in MACS2 as TF can sometimes bind to very few regions and the FDR consistently label those regions as non-significant.

The following figure summarizes our approach.