A Novel Machine Learning Pipeline for Accurate COVID-19 Prediction and Risk Factor Identification using Longitudinal Electronic Health Records
Alice Feng, Harker School, San Jose, California
​
COVID-19 is a respiratory disease that has caused a worldwide pandemic and put a strain on the global healthcare systems. To better protect and treat higher risk individuals and alleviate the burden on healthcare systems, it is crucial to identify risk factors and build accurate predictive models for COVID-19 related health outcomes. However, current predictive models focus on predicting the risk of mortality only, and rely on COVID-19 specific medical data such as CT scans and COVID-19 lab tests results after COVID-19 diagnosis.
To address these issues, we developed an innovative supervised machine learning pipeline using veteran Electronic Health Records (EHR) to accurately predict COVID-19 related health outcomes including mortality, ventilation, days in hospital or ICU. In particular, we developed a series of unique and effective data processing algorithms, including data cleaning, vector representation, initial feature screening. Then we trained models using state-of-the-art machine learning strategies combined with different parameter settings. Our machine learning pipeline not only consistently outperformed those developed by other research groups using the same set of EHR, but also achieved similar accuracy as those trained on medical data that were only available after COVID-19 diagnosis. In addition, top risk factors for COVID-19 were identified, which include age, diabetes, metabolic syndromes, heart disease, kidney disease etc., and are consistent with epidemiologic findings. Built on veteran’s EHR, our results were especially relevant to veterans and filled in the gap of missing COVID-19 research for veterans who are at much higher risk of severe COVID-19 illness than the general population.
This project demonstrated that longitudinal EHR data can be successfully employed to provide a holistic prediction of an individual’s health risk based on past health records, which is critical for controlling emerging infectious diseases such as COVID-19.
Photo credit: Pixabay