The incremental value of unstructured data via natural language processing in machine learning-based COVID-19 mortality prediction: a comparative study.

Back

Public Health

The incremental value of unstructured data via natural language processing in machine learning-based COVID-19 mortality prediction: a comparative study.

BMC Medical Informatics Decison Making

Research Authors: Rildo Pinto da Silva, Antonio Pazin-Filho

AIIM Authors: Abby Welker, Shiv Patel

Approved by President Reda Riffi

Publication Date: Sep 26, 2025

Comprehensive Summary

This study, conducted by da Silva and Pazin-Filho, tested whether combining unstructured data extracted from medical records and quantitative data can enhance machine learning models in predicting in-hospital mortality among COVID-19 patients. They collected two data sets: structured data, including 21 features from lab tests, monitoring data, and demographic variables, and unstructured data, which was the “history of present illness” section of physician notes. From these notes, they extracted clinical assertions (CAs), for example, “has_symptom affirmed dyspnea” through NLP models. This process resulted in a dataset containing both structured and unstructured information, known as a hybrid dataset, which was used to construct the hybrid models that were then run through different machine learning models. They found that 244 of 844 hospitalizations ended in mortality. Furthermore, they found that models trained only on structured data already performed strongly; for instance, random forest achieved high test performance (AUC ROC ≈ 0.917). Adding unstructured features slightly improved results in some cases (random forest hybrid AUC ROC ≈ 0.926). However, they determined that the inclusion of NLP-extracted unstructured data did not significantly increase predictive power for COVID-19 in-hospital mortality, compared to structured data alone. Even though there wasn’t as large a statistical difference when adding the unstructured data, they argue that such hybrid models have promise as NLP methods, and model calibration improves.

Outcomes and Implications

This research is important because it integrates the surface-level data with the doctors' notes to make a more holistic understanding of each patient. Furthermore, they tested whether the inclusion of these unstructured notes adds predictive value in the clinical sphere. Even though it was not completely successful, it could help generalize the mortality risk prediction to other acute conditions by assessing all available data. While da Silva and Pazin-Filho’s hybrid model did not perform better than the single model, this model can start the development of better models to extract unstructured data. Therefore, in the future, implementing these hybrid models using data from medical records will establish standardized methods for capturing and managing medical information.

Our mission is to

Connect medicine with AI innovation.

No spam. Only the latest AI breakthroughs, simplified and relevant to your field.

Our mission is to

Connect medicine with AI innovation.

No spam. Only the latest AI breakthroughs, simplified and relevant to your field.

Our mission is to

Connect medicine with AI innovation.

No spam. Only the latest AI breakthroughs, simplified and relevant to your field.