Scientists use machine learning techniques to better characterize long COVID


A research team supported by the National Institutes of Health has identified the characteristics of people with long COVID and those likely to have it. Scientists, using machine learning techniques, analyzed an unprecedented collection of electronic health records (EHRs) available for COVID-19 research to better identify who has long COVID. By exploring de-identified EHR data in the National COVID Cohort Collaborative (N3C), a centralized national public database run by NIH’s National Center for Advancing Translational Sciences (NCATS), the team used the data to find more 100,000 COVID cases likely long as of October 2021 (as of May 2022 the number is over 200,000). The findings appear in The Lancet’s digital health.

The long COVID is marked by many symptoms, including shortness of breath, fatigue, fever, headache, “brain fog” and other neurological issues. These symptoms can last several months or longer after an initial diagnosis of COVID-19. One of the reasons COVID has long been difficult to identify is that many of its symptoms are similar to those of other illnesses and conditions. Better characterization of long COVIDs could lead to better diagnostics and new therapeutic approaches.

It made sense to leverage modern data analytics tools and a unique big data resource like N3C, where many features of a long COVID can be represented. »

Emily Pfaff, Ph.D., co-author, clinical informatician, University of North Carolina at Chapel Hill

The N3C data enclave currently includes information representing more than 13 million people nationwide, including nearly 5 million COVID-19 positive cases. The resource allows for rapid research on emerging questions regarding COVID-19 vaccines, therapies, risk factors and health outcomes.

The new research is part of a larger related trans-NIH initiative, Researching COVID to Enhance Recovery (RECOVER), which aims to improve understanding of the long-term effects of COVID-19, called post-acute sequelae of SARS-CoV. -2 infections (PASC). RECOVER will accurately identify people with PASC and develop approaches for its prevention and treatment. The program will also answer critical research questions about the long-term effects of COVID through clinical trials, longitudinal observational studies, and more.

In the Lancet study, Pfaff, Melissa Handel, Ph.D., of the University of Colorado’s Anschutz Medical Campus, and their colleagues looked at patient demographics, health care utilization, diagnostics, and medications in the health records of 97,995 adult COVID-19 patients from N3C. They used this information, along with data on nearly 600 long COVID patients from three long COVID clinics, to create three machine learning models to identify long COVID patients.

In machine learning, scientists “train” computational methods to quickly sift through large amounts of data to reveal new insights – in this case, about the long COVID. The models looked for patterns in the data that could help researchers both understand patient characteristics and better identify people with the disease.

The models focused on identifying potential long COVID patients among three groups in the N3C database: all COVID-19 patients, patients hospitalized with COVID-19, and patients who had COVID-19 but did not. not been hospitalized. The models were found to be accurate, as people identified as at risk for long COVID were similar to patients seen at long COVID clinics. The machine learning systems classified about 100,000 patients in the N3C database whose profiles closely matched those with long COVID.

“Once you are able to determine who has long COVID from a large database of people, you can start asking about those people,” said Josh Fessel, MD, Ph.D., clinical advisor senior at NCATS and a science program. lead in RECOVER. “Was there something different about these people before they developed long COVID? Did they have certain risk factors? Was there something about the way they were being treated during acute COVID that might have increased or decreased their risk for long COVID?

The models looked for common characteristics, including new drugs, doctor visits and new symptoms, in patients with a positive COVID diagnosis who were at least 90 days away from their acute infection. The models identified patients as having long COVID if they went to a long COVID clinic or had long COVID symptoms and likely had the disease but were not diagnosed.

“We want to incorporate the new patterns that we see with the diagnostic code for COVID and include them in our models to try to improve their performance,” said Handel of the University of Colorado. “Models can learn from a wider variety of patients and become more accurate. We hope we can use our long COVID patient classifier for clinical trial recruitment.


Journal reference:

Pfaff, Emergencies, et al. (2022) Identifying Who Has Long COVID in the United States: A Machine Learning Approach Using N3C Data. The Lancet’s digital health.


About Author

Comments are closed.