Use of profile hidden Markov models in viral discovery: current insights

Authors Reyes A, Alves JMP, Durham AM, Gruber A

Alejandro Reyes,1–3 João Marcelo P Alves,4 Alan Mitchell Durham,5 Arthur Gruber4

1Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia; 2Department of Pathology and Immunology, Center for Genome Sciences and Systems Biology, Washington University in Saint Louis, St Louis, MO, USA; 3Max Planck Tandem Group in Computational Biology, Universidad de los Andes, Bogotá, Colombia; 4Department of Parasitology, Institute of Biomedical Sciences, 5Department of Computer Science, Institute of Mathematics and Statistics, Universidade de São Paulo, São Paulo, Brazil

Abstract: Sequence similarity searches are the bioinformatic cornerstone of molecular sequence analysis for all domains of life. However, large amounts of divergence between organisms, such as those seen among viruses, can significantly hamper analyses. Profile hidden Markov models (profile HMMs) are among the most successful approaches for dealing with this problem, which represent an invaluable tool for viral identification efforts. Profile HMMs are statistical models that convert information from a multiple sequence alignment into a set of probability values that reflect position-specific variation levels in all members of evolutionarily related sequences. Since profile HMMs represent a wide spectrum of variation, these models show higher sensitivity than conventional similarity methods such as BLAST for the detection of remote homologs. In recent years, there has been an effort to compile viral sequences from different viral taxonomic groups into integrated databases, such as Prokaryotic Virus Orthlogous Groups (pVOGs) and database of profile HMMs (vFam) database, which provide functional annotation, multiple sequence alignments, and profile HMMs. Since these databases rely on viral sequences collected from GenBank and RefSeq, they suffer in variable extent from uneven taxonomic sampling, with low sequence representation of many viral groups, which affects the efficacy of the models. One of the interesting applications of viral profile HMMs is the detection and sequence reconstruction of specific viral genomes from metagenomic data. In fact, several DNA assembly programs that use profile HMMs as seeds have been developed to identify and build gene-sized assemblies or viral genome sequences of unrestrained length, using conventional and progressive assembly approaches, respectively. In this review, we address these aspects and cover some up-to-date information on viral genomics that should be considered in the choice of molecular markers for viral discovery. Finally, we propose a roadmap for rational development of viral profile HMMs and discuss the main challenges associated with this task.

Keywords: profile hidden Markov models, viral discovery, DNA assembly, metagenomic analysis, molecular markers, de novo diagnosis

