Jeremy Epstein
$99,925
University of Texas at Dallas
Texas
Computer and Information Science and Engineering (CISE)
The COVID-19 pandemic has demonstrated that sharing data is critical to building better statistical epidemiological models, enabling policy decisions (in the public and private sector), and assuring the health of the public. Moreover, the situation has evolved quickly, indicating that data sharing needs to take place repeatedly and in a timely manner. To date, much of the data sharing that has taken place has focused on aggregate statistics (e.g., counts of events), yet some of the most important data is at the person-level, which is critical to providing intuition into how comorbidities influence health outcomes and model the trajectory of the disease in a temporal-spatial perspective. This data is captured by a large number of service providers who wish to support these endeavors, but are concerned that doing so will infringe upon the privacy rights of the corresponding individuals, particularly their anonymity. To enable timely, useful and privacy-preserving releases of patient specific COVID-19 data, this project aims to develop and disseminate novel privacy-risk assessment techniques, implemented in working software, to assist data managers, as well as public health officials, to reason about the tradeoffs between privacy risks (with a focus on re-identification, according to current law) and public data utility. The project will provide the best practices and tools needed for sharing patient-specific data about individuals diagnosed with, or suspected of, COVID-19. This project will develop novel, and dynamic privacy risk assessment models for disclosing data in support of epidemiological investigations (and particularly pandemics) by considering evolving privacy risks and data utility. In doing so, the proposed models will be tailored to enable the disclosure of geographic-, demographic-, and clinically-relevant phenomena (e.g., health indications based on pharmaceutical prescriptions or purchases) by modeling a much richer data attribute space, specifically one that is important for modeling epidemiologic risk factors associated with biological agents, such as COVID-19. To model evolving privacy risks, privacy risk estimation models that consider multiple types of potential re-identification attacks and data redactions used to release multiple versions of the same data will be developed. Furthermore, the proposed models will be oriented to support utility functions that are specific to bio-surveillance efforts, including those which have emerged for COVID-19 modeling and response. Finally, to ensure that the proposed approach is accessible and reusable widely, an open source software tool, that enables data custodians, and particularly public health authorities, to make informed decisions appropriately balancing public health goals with personal privacy when sharing data, will be released.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.