Tajana S Rosing
University of California-San Diego
Biological Sciences (BIO)
As the COVID-19 pandemic spreads rapidly around the world, public health officials need to be able to answer questions such as “How is COVID-19 spreading through the population?” and “How many individual outbreaks exist within a given community?”. With increasing access to sequencing technologies, scientists can analyze the genome sequences of collected SARS-CoV-2 viral samples in order to gain information about to aid in the development of vaccines and drugs as well as to infer the most likely evolutionary history of the virus, which can help epidemiologists track the spread of the virus across populations. The epidemiological use of the evolutionary history of the virus is only useful if it can be updated in real-time, but as the sheer volume of available data rapidly grows, scientists will require scalable computational tools to conduct these analyses. The goal of this project is to develop novel algorithms, software tools, and hardware systems that will scale to the massive amounts of data that are rapidly being generated in this pandemic, which will in turn aid in phylogenomic analysis of the virus, the effective tracking of the spread of the virus as well as in the development of novel vaccines and drugs in this pandemic. As a broader impact, this project will help with replicability and reproducibility of genetic and epidemiological research results. Furthermore, the existence of such a system will aid in fighting future viral outbreaks. This project provides professional development opportunities for an early career scientist.The standard viral phylogenetic inference workflow consists of quality checking and filtering, multiple sequence alignment, phylogenetic inference, phylogenetic rooting, phylogenetic dating, and transmission clustering. The researchers have identified that the computational bottlenecks of the workflow are multiple sequence alignment and phylogenetic inference, which scale poorly as a function of the number of input sequences. The objective of this project is the development of a user-friendly, scalable, and modular workflow for conducting a real-time computational phylogenetic analysis of assembled viral genomes, with a primary focus of SARS-CoV-2. The project solution includes: (1) the development of a novel software tool for orchestrating the automated end-to-end workflow, (2) the development of novel algorithms (and software implementations of these algorithms) to speed up the computational bottlenecks of the workflow, (3) the development of novel hardware systems for accelerating the workflow, and (4) a real-time publicly-accessible repository in which researchers can access the most up-to-date analysis results (with intermediate files) of all SARS-CoV-2 genomes currently available to prevent repeat computation efforts. The analysis infrastructure that will be built in this project will be broadly applicable to any viral pathogen for which phylogenetic inference is biologically and epidemiologically meaningful.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.