The biological world is highly stochastic as well as inhomogeneous in its behavior. The transition between homogeneous and inhomogeneous regions of DNA, known also as change points, carry important biological information. Our goal is to employ rigorous methods of information theory to quantify structural properties of DNA sequences. In particular, we adopt the Stein-Ziv lemma to find asymptotically optimal discriminant function that determines whether two DNA segments are generated by the same source and assuring exponentially small false positives. Then we apply the Minimum Description Length (MDL) principle to select parameters of our segmentation algorithm. Finally, we perform extensive experimental work on human chromosome 9. After grouping A and G (purines) and T and C (pyrimidines) we discover change points between coding and noncoding regions as well as the beginning of a CpG island.
Citation:
Wojciech Szpankowski, Wenhui Ren, Lukasz Szpankowski, "An Optimal DNA Segmentation Based on the MDL Principle," csb, pp.541, IEEE Computer Society Bioinformatics Conference (CSB'03), 2003