Automating Viral Genome Annotation and Quality Control for Viruses of Public Health Importance

Search for this grant on NIH site

Program Manager:

CATHERINE MARY FARRELL

Active Dates:

Sept. 27, 2025 -- Aug. 31, 2027

Awarded Amount:

$529,500

Investigator(s):

Alexander L Greninger

Awardee Organization:

UNIVERSITY OF WASHINGTON
Washington

Funding ICs:

National Library of Medicine (NLM)

Abstract:

National Center of Biotechnology (NCBI) and International Nucleotide Sequence Database Collaboration (INSDC) databases have been cornerstones of public sharing of pathogen genomic data for basic science and epidemiological investigations. To ensure the integrity of sequence databases, all sequences must be validated and curated prior to deposition into GenBank. A major bottleneck in the rapid sharing of viral sequencing data is the requirement for inclusion of gene and/or protein annotations along with curation of sequences prior to deposition to NCBI GenBank. Annotations are critical for cross-referencing other NCBI databases, while curation is required to ensure the accuracy and usability of the databases. However, correctly annotating and performing appropriate quality control can be challenging to non-specialist submitters. Notably, this limitation is restricted to viral sequences, as prokaryotic and eukaryotic genome annotation and quality control has been automated via the NCBI's Prokaryotic and Eukaryotic Genome Annotation Pipelines. Recently, NCBI has created an open-source viral annotation tool called VADR (Viral Annotation DefineR). VADR validates and annotates viral sequences using RefSeq-based models. VADR is currently limited to a select number of human viruses, including SARS-CoV-2, influenza virus, monkeypox virus, norovirus, dengue virus, and respiratory syncytial virus. The implementation of VADR has reduced manual reviews by NCBI indexers by >95% for these viruses, illustrating its critical role in prescreening submitted viral sequences. However, for most viruses relevant to clinical infectious diseases and public health, there is no way to rapidly submit unannotated consensus sequences to open databases. Here, we propose to accelerate the public sharing of viral sequences by building, validating, and implementing sustainable, open-source VADR models for human viruses relevant to public health. Specifically, we will build, validate, and implement appropriate open-source models for VADR for respiratory viruses, viruses associated with vaccine preventable diseases, and hepatitis viruses, taking advantage of the moderate numbers of viral sequences available for these viruses. Our group has significant knowledge of viral genome quality control, annotation, and submission, including their associated challenges as currently constructed in NCBI. Using viral genome data from other sequencing projects, we will validate the accuracy and sustainability of our newly generated VADR models on sequences not yet present in NCBI GenBank. Working together with NCBI, we will deploy these open-source models to automate analysis for GenBank by the end of the grant period. The proposed work will both ease submission and increase open genomic data for viruses of public health importance. This work will and ensure that we are ready for increasing amounts of routine public health sequencing and help overall preparedness for the next pandemic. 1