Regularizing transformers with deep probabilistic layers.

Journal: Neural networks : the official journal of the International Neural Network Society

Published Date: Apr 1, 2023

Abstract

Language models (LM) have grown non-stop in the last decade, from sequence-to-sequence architectures to attention-based Transformers. However, regularization is not deeply studied in those structures. In this work, we use a Gaussian Mixture Variational Autoencoder (GMVAE) as a regularizer layer. We study its advantages regarding the depth where it is placed and prove its effectiveness in several scenarios. Experimental result demonstrates that the inclusion of deep generative models within Transformer-based architectures such as BERT, RoBERTa, or XLM-R can bring more versatile models, able to generalize better and achieve improved imputation score in tasks such as SST-2 and TREC or even impute missing/noisy words with richer text.

Authors

Aurora Cobo Aguilera

Department of Signal Theory and Communications, Universidad Carlos III de Madrid, Avda. de la Universidad 30, 28911, Leganés, Madrid, Spain. Electronic address: acobo@tsc.uc3m.es.
Pablo M Olmos
Antonio Artes-Rodriguez
Fernando Pérez-Cruz

Swiss Data Science Institute (ETHZ/EPFL), Universitatstrasse 25, 8006, Zurich, Switzerland. Electronic address: fernando.perezcruz@sdsc.ethz.ch.

Keywords

Language Natural Language Processing Normal Distribution

External Resources

View on PubMed Access via DOI PubMed (36812832)

Regularizing transformers with deep probabilistic layers.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals