Arcady Mushegian
$665,188
Harvard University
Massachusetts
Biological Sciences (BIO)
The characterization of protein properties and functions is fundamental for life and bioengineering. Think of the spike protein that controls the infection of SARS-Cov-2 into human cells, the many proteins that control cancer tumor spread and growth, or the newly discovered enzymes that can convert plastic waste into usable proteins. A way to increase the number of functionally characterized proteins relies on finding similarities to other known proteins. The scientific community has had for some time several foundational and widely used technologies that compare proteins based on their amino acid sequences — the problem known as protein homology detection. Computational methods such as BLAST and HMMER that find similarities amongst proteins in the sequence databases are used routinely by experimental biologists working in all branches of life sciences. Still, many biological proteins found in living cells remain functionally uncharacterized. This project aims to implement, within the HMMER software package, a computational method able to recognize biological sequence similarities that current methods cannot yet detect. This method, which uses statistical models of sequence evolution, will result is many more proteins for which there is a hint to their function by establishing homologies between protein families currently assumed to be disconnected. This method will open the door for many more proteins to become bioengineering targets. Graduate and undergraduate students will be trained in statistical bioinformatics in the course of this project. Nowadays protein homology analysis relies on profile searches much more than on pairwise sequence search, but profile parameterization is still typically not calibrated to expected evolutionary distance of remote homologs, but rather just to distances of the observed sequences in the alignment used for training. This project proposes to turn those currently fixed parameters into functions that depend of evolutionary divergence. Turning the standard probabilistic methods of homology detection into divergence-parameterized models is novel and should improve sensitivity to very remote homologs. Substitution events have been modeled with PAM/BLOSUM matrices as well as more explicit substitution models, but the development of evolutionary models dealing with insertion and deletion events has proven difficult. This project will built upon the mathematical foundations for evolutionary models developed earlier by this group and others that are suitable to apply to the standard profile methods (with insertions and deletions) used for homology detection. Those evolutionary models will be implemented into a next version of HMMER software suite for remote homology detection and multiple sequence alignment. Stretching the parameters of the homology methods into divergence regimes beyond what has been observed, guided by an evolutionary model will result in more sensitive homology detection. This method will also provide a natural tool to set statistical boundaries on the detectability of homologs and the identification of clade-specific genes. This method will integrate homology with phylogeny into one a more powerful detection tool. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.