Topic modeling-based prediction of software defects and root cause using BERTopic, and multioutput classifier.
Journal:
Scientific reports
Published Date:
Jul 14, 2025
Abstract
The occurrence of software defects remains a major obstacle in software engineering, resulting in costly debugging and maintenance efforts. This study introduces a new angle for software defect prediction (SDP), utilizing advanced natural language processing (NLP) and machine learning (ML) techniques. In this work, the proposed methodology, BERT-MOC, combines the power of BERTopic, a transformer-based topic modeling technique, with a multioutput classifier to predict software defects and the root cause (reason) of defects. BERTopic is used to extract the root cause of the defect from textual descriptions of software defects, creating a meaningful representation of the software artifacts. These topic representations are then combined with the defect log data set.A multi-output classifier is trained on the combined dataset to predict multiple outputs, i.e., defect/not defect and defect root cause, simultaneously. As an estimator, Logistic Regression, Decision Tree Classifier, Kneighbor Classifier, Random Forest Classifier, and Ensemble Method-Voting are included in the MultiOutput Classifier. The proposed model is evaluated by the metrics hamming loss, accuracy, F1-score, precision, recall, and Jaccard similarity. The multi-output classifier with ensemble method voting as an estimator achieved the highest performance with 97% accuracy and F1-score to predict the root cause of the defect and 94% accuracy and F1-score to predict defect or not.
Authors
Keywords
No keywords available for this article.