Assessing the Impact of the Quality of Textual Data on Feature Representation and Machine Learning Models
Journal:
arXiv
Published Date:
Feb 12, 2025
Abstract
Background: Data collected in controlled settings typically results in
high-quality datasets. However, in real-world applications, the quality of data
collection is often compromised. It is well established that the quality of a
dataset significantly impacts the performance of machine learning models.
Methods: A rudimentary error rate metric was developed to evaluate textual
dataset quality at the token level. Mixtral Large Language Model (LLM) was used
to quantify and correct errors in low quality datasets. The study analyzed two
healthcare datasets: the high-quality MIMIC-III public hospital dataset and a
lower-quality private dataset from Australian aged care homes. Errors were
systematically introduced into MIMIC at varying rates, while the ACH dataset
quality was improved using the LLM.
Results: For the sampled 35,774 and 6,336 patients from the MIMIC and ACH
datasets respectively, we used Mixtral to introduce errors in MIMIC and correct
errors in ACH. Mixtral correctly detected errors in 63% of progress notes, with
17% containing a single token misclassified due to medical terminology. LLMs
demonstrated potential for improving progress note quality by addressing
various errors. Under varying error rates, feature representation performance
was tolerant to lower error rates (<10%) but declined significantly at higher
rates.
Conclusions: The study revealed that models performed relatively well on
datasets with lower error rates (<10%), but their performance declined
significantly as error rates increased (>=10%). Therefore, it is crucial to
evaluate the quality of a dataset before utilizing it for machine learning
tasks. For datasets with higher error rates, implementing corrective measures
is essential to ensure the reliability and effectiveness of machine learning
models.