optRF: Optimising random forest stability by determining the optimal number of trees.

Journal: BMC bioinformatics

PMID: 40165065

Abstract

Machine learning is frequently used to make decisions based on big data. Among these techniques, random forest is particularly prominent. Although random forest is known to have many advantages, one aspect that is often overseen is that it is a non-deterministic method that can produce different models using the same input data. This can have severe consequences on decision-making processes. In this study, we introduce a method to quantify the impact of non-determinism on predictions, variable importance estimates, and decisions based on the predictions or variable importance estimates. Our findings demonstrate that increasing the number of trees in random forests enhances the stability in a non-linear way while computation time increases linearly. Consequently, we conclude that there exists an optimal number of trees for any given data set that maximises the stability without unnecessarily increasing the computation time. Based on these findings, we have developed the R package optRF which models the relationship between the number of trees and the stability of random forest, providing recommendations for the optimal number of trees for any given data set.

Authors

Thomas M Lange

Breeding Informatics Group, Georg-August University, Margarethe Von Wrangell-Weg 7, 37075, Göttingen, Germany. thomas.lange@uni-goettingen.de.
Mehmet Gültas

Center for Integrated Breeding Research (CiBreed), Georg-August University, Albrecht-Thaer-Weg 3, 37075 Göttingen, Germany.
Armin O Schmitt

Breeding Informatics Group, Georg-August University, Margarethe Von Wrangell-Weg 7, 37075, Göttingen, Germany.
Felix Heinrich

Breeding Informatics Group, Department of Animal Sciences, Georg-August University, Margarethe von Wrangell-Weg 7, 37075 Göttingen, Germany.

Keywords

Algorithms Big Data Computational Biology Machine Learning Random Forest Software

External Resources

View on PubMed Access via DOI PubMed (40165065)

optRF: Optimising random forest stability by determining the optimal number of trees.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals