Features extraction based on Naive Bayes algorithm and TF-IDF for news classification.

Journal: PloS one
Published Date:

Abstract

The rapid proliferation of online news demands robust automated classification systems to enhance information organization and personalized recommendation. Although traditional methods like TF-IDF with Naive Bayes provide foundational solutions, their limitations in capturing semantic nuances and handling real-time demands hinder practical applications. This study proposes a hybrid news classification framework that integrates classical machine learning with modern advances in NLP to address these challenges. Our methodology introduces three key innovations: (1) Domain-Specific Feature Engineering, combining tailored n-grams and entity-aware TF-IDF weighting to amplify discriminative terms; (2) BERT-Guided Feature Selection, leveraging distilled BERT to identify contextually important words and resolve rare-term ambiguities; and (3) Computationally Efficient Deployment, achieving 95.2% of the accuracy of BERT at 1/52.4th of the inference cost. Evaluated on a balanced corpus of Sina News articles in 11 categories, the system demonstrates a test precision of 95.12% (vs. 84.43% for SVM+TF-IDF baseline), with statistically significant improvements confirmed by 5-fold cross-validation(p < 0.01). The critical findings reveal strong performance in distinguishing semantically distinct categories, while exposing challenges in fine-grained differentiation. The efficiency of the framework (2.1 inference latency) and scalability (linear utilization of CPU resources) validate its practicality for real-world deployment. This work bridges the gap between traditional feature engineering and transformer-based models, offering a cost-effective solution for news platforms. Future research will explore hierarchical classification and the adaptation of dynamic topics to further refine semantic boundaries.

Authors

  • Li Zhang
    Department of Animal Nutrition and Feed Science, College of Animal Science and Technology, Huazhong Agricultural University, Wuhan 430070, China.