M.A. McClure, Dept. of Biol. Sci., Nevada Univ., Las Vegas, NV, USA
R. Raman, Dept. of Biol. Sci., Nevada Univ., Las Vegas, NV, USA
Complex genome analysis is the study of nucleic acid and protein sequences to further the understanding of the molecular evolutionary mechanisms and frequency of events instrumental in the construction of genomes. The corner stone of these studies is the multiple alignment of homologous sequences. To date no method exists that can correctly identify the most conserved features of distantly related proteins without refinement by human pattern recognition skills. Recent application of HMM approaches to the problem of multiple protein sequence alignment offers a new method of analysis. The quality of the alignment produced by an HMM is dependent on the quality of the model itself. We measure the quality of a model by the correspondence between the optimal model, the highest average entropylposition model, and the biologically informative model, which by definition is the one that captures a specific set of biological features common to a protein family. The studies reported here on the effect of model length and training set size suggest that both play a critical role in generating biologically informative HMMs.
Index Terms:
genetics; DNA; hidden Markov models; biology computing; parameterization studies; hidden Markov models; highly divergent protein sequences; complex genome analysis; nucleic acid; protein sequences; molecular evolutionary mechanisms; homologous sequences; distantly related proteins; HMM approaches; multiple protein sequence alignment; optimal model; highest average entropylposition model; biologically informative model; biological features; protein family; model length; training set size; biologically informative HMMs; genetics
Citation:
M.A. McClure, R. Raman, "Parameterization studies of hidden Markov models representing highly divergent protein sequences," hicss, pp.184, 28th Hawaii International Conference on System Sciences (HICSS'95), 1995