$401,943
National Institute of Environmental Health Sciences
North Carolina
National Institute of Environmental Health Sciences (NIEHS)
Cell lineage mapping and state transition modeling during normal or disease progression using time course single-cell data sets is still a challenge. Current models generally assume continuity of state space. A process like spermatogenesis has been shown to comprise of an initial discrete process followed by continuous transitions. Secondly, most trajectory models use prior knowledge in terms of differentiation markers and cell of origin to drive the developmental reconstruction process which might not be true. Thirdly, most single-cell time dependent data are collected from developmental staging studies. Studying developmental stages ignore variation associated with phenotypic plasticity. A very important developmental process that has been shown to exhibit such a complex phenotype characterized by reversible biological processes is the Epithelial Mesenchymal Transition (EMT). To overcome the above limitations, we propose a new computational framework called Dynamic Spanning Forests (DSF). DSF is a fast and scalable computational model framework that takes as input a temporal or staging single-cell data collected at discrete time points and outputs a mixture of dependent trees. DSF uses binary decision-tree models to select statistically significant features associated with marginal distributions of multimodality and skewness as well as the underlying dynamic cell to cell variation. The selected features are then used to connect all cells with a minimum spanning tree, followed by breaking it up into a minimum spanning forest based on the tree motifs. The minimum spanning forest subtrees are derived by combining a tree agglomerative hierarchical clustering (TAHC) with a dynamic branch cutting method based on the shape of the underlying dendrogram. We also showed how the DSF algorithm can be further combined with correlation search engines for new scientific discoveries. A key finding from our study shows that the normalized marginal expression of genes during a given biological process exhibit a very high proportion of non-uniform distribution of shapes which are mostly skewed to the right and that multimodal genes characterize major steady states during development. This finding challenges the inferences from most current statistical methods used in single-cell analysis driven by averages and unimodality. Furthermore, DSF can visualize and characterize complex relationships between spanning trees or forests and the underlying unknown clusters in weighted directed biological networks derived from longitudinal or staging data. No other current approaches properly leverage the power of such study designs. Our work so far motivates the continued need for better clustering and ordering algorithms of biological systems that take into account heterogeneous and dynamic tree topologies. We applied our DSF framework to visualize, test and characterize nested, intermediate and simultaneous dynamic lineages in a number of important real biological applications such as EMT, spermatogenesis, induced pluripotent stem cells (iPSC) reprogramming, early hormonal transcriptional response and COVID-19 immune response. For example, in the EMT study, DSF identified time dependent additional markers e.g. phosphorylated retinoblastoma (pRb), FAP, TROP2, Keratin-7 and CD45 markers not included in the original EMT map by Karacosta et al. 2019 and automatically identified a forest with EMT and MET trees. Furthermore, by forcing several asynchronous cell types into one continuous tree, it was challenging to identify, visualize and characterize individual early, intermediate and late spermatogenesis lineages. DSF identified a complex microenvironment involving communication between different germ and somatic cell types associated with a diverse and nested differentiation trajectories during spermatogenesis. DSF was further able to identify a global dynamic branching trajectory during Chemically induced Pluripotent Stem Cells (CiPSC) reprogramming with 3 terminal branching differentiation lineages instead of 2 as captured by monocle in Zhao et al. 2018, the goal standard for scRNA-seq data. The additional terminal branch is enriched with genes such as CrxOS which has been shown to maintain the self-renewal capacity of murine embryonic stem cells. Finally, an application of DSF to study the immune response due to coronavirus disease (COVID-19) showed a clear visualization of patient-specific healthy to disease cellular progression as individual trees in a forest. This project involves research on human coronavirus, novel coronavirus, COVID-19, Severe Acute Respiratory Syndrome coronavirus disease, SARS coronavirus, SARS-coronavirus-2, SARS-cov-2, SARS-cov2, SARS-related coronavirus 2, Severe acute respiratory syndrome coronavirus 2, SARS-Associated Coronavirus, SARS-cov, or SARS-Related Coronavirus.