Cost-Efficient Domain-Adaptive Pretraining of Language Models for Optoelectronics Applications.

Journal: Journal of chemical information and modeling
PMID:

Abstract

Pretrained language models have demonstrated strong capability and versatility in natural language processing (NLP) tasks, and they have important applications in optoelectronics research, such as data mining and topic modeling. Many language models have also been developed for other scientific domains, among which Bidirectional Encoder Representations from Transformers (BERT) is one of the most widely used architectures. We present three "optoelectronics-aware" BERT models, OE-BERT, OE-ALBERT, and OE-RoBERTa, that outperform both their counterpart general English models and larger models in a variety of NLP tasks about optoelectronics. Our work also demonstrates the efficacy of a cost-effective domain-adaptive pretraining (DAPT) method with RoBERTa, which significantly reduces computational resource requirements by more than 80% for its pretraining while maintaining or enhancing its performance. All models and data sets are available to the optoelectronics-research community.

Authors

  • Dingyun Huang
    Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.
  • Jacqueline M Cole
    Cavendish Laboratory, University of Cambridge , J. J. Thomson Avenue, Cambridge, CB3 0HE, U.K.