HAEMCODE

HAEMCODE similarity analysis tool allow user to compare selected ChIP-seq experiment peaks between them. We opted to compute similarity using the Dice coefficient instead of the traditional Pearson correlation. The reason for that will be explained in the following paragraphs.

Binding events or peaks in ChIP-seq experiemnt represent interaction of a transcription factor protein with DNA in the genome. When comparing multiple experiments together we represent each peak profiles as a binary vectors into a matrix with column as experiemnts and rows as genomic regions. If a TF binds to a particular region the event is encoded 1 and 0 otherwise. As the number of experiments in the matrix growth the number of regions bound by a single transcription factor growth disproportionally compare to the common regions. Using Pearson to compute the pairwise correlation on such data will give coefficient mostly negative and close to zero due to the overwhelming number of zeros in the matrix.

This observation lead us to consider the meaning of negative correlation when dealing with ChIP-seq data. Here negative correlation does not mean that binding profiles are opposite but rather that transcription factors binds different genomic coordinates. Hence the Pearson correlation has little information content in this context. We chose instead to look at the simple percentage of agreement between two experiments using the Dice coefficient. The Dice coefficient is designed to measure similarity between assymetric binary vectors. Meaning we do not consider agreement (1-1) as having the same importance as disagreement (1-0 or 0-1).

Similarity analysis