MolPipeline: A Python Package for Processing Molecules with RDKit in Scikit-learn.

Journal: Journal of chemical information and modeling

PMID: 39288001

Abstract

The open-source package scikit-learn provides various machine learning algorithms and data processing tools, including the Pipeline class, which allows users to prepend custom data transformation steps to the machine learning model. We introduce the MolPipeline package, which extends this concept to cheminformatics by wrapping standard RDKit functionality, such as reading and writing SMILES strings or calculating molecular descriptors from a molecule object. We aimed to build an easy-to-use Python package to create completely automated end-to-end pipelines that scale to large data sets. Particular emphasis was put on handling erroneous instances, where resolution would require manual intervention in default pipelines. MolPipeline provides the building blocks to enable seamless integration of common cheminformatics tasks within scikit-learn's pipeline framework, such as scaffold splits and molecular standardization, making pipeline building easily adaptable to diverse project requirements.

Authors

Jochen Sieg

Universität Hamburg , ZBH - Center for Bioinformatics, Research Group for Computational Molecular Design , Bundesstraße 43 , 20146 Hamburg , Germany.
Christian W Feldmann

BASF SE, Ludwigshafen, 67056, Germany.
Jennifer Hemmerich

University of Vienna, Department of Pharmaceutical Chemistry, Althanstr. 14, 1090, Vienna, Austria.
Conrad Stork

Faculty of Mathematics, Informatics and Natural Sciences, Department of Computer Science, Center for Bioinformatics, Universität Hamburg , Hamburg, 20146, Germany.
Frederik Sandfort

Westfälische Wilhelms-Universität Münster, Organisch-Chemisches Institut, Corrensstr. 40, 48149, Münster, Germany.
Philipp Eiden

BASF SE , Ludwigshafen 67063 , Germany.
Miriam Mathea

BASF SE , Ludwigshafen 67063 , Germany.

Keywords

Algorithms Cheminformatics Machine Learning Programming Languages Software

External Resources

View on PubMed Access via DOI PubMed (39288001)

MolPipeline: A Python Package for Processing Molecules with RDKit in Scikit-learn.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals