The interplay between domain specialization and model size
Journal:
arXiv
Published Date:
Jan 3, 2025
Abstract
Scaling laws for language models have often focused on finding the optimal
model size and token count for training from scratch. However, achieving this
optimal balance requires significant compute resources due to the extensive
data demands when training models from randomly-initialized weights. Continued
pretraining offers a cost-effective alternative, leveraging the compute
investment from pretrained models to incorporate new knowledge without
requiring extensive new data. Recent findings suggest that data quality
influences constants in scaling laws, thereby altering the optimal
parameter-token allocation ratio. Building on this insight, we investigate the
interplay between domain specialization and model size during continued
pretraining under compute-constrained scenarios. Our goal is to identify an
optimal training regime for this scenario and detect patterns in this interplay
that can be generalized across different model sizes and domains. To compare
general and specialized training, we filtered a web-based dataset to extract
data from three domains: legal, medical, and accounting. We pretrained models
with 1.5B, 3B, 7B, and 14B parameters on both the unfiltered and filtered
datasets, then evaluated their performance on domain-specific exams. Results
show that as model size increases, specialized models outperform general models
while requiring less training compute. Additionally, their growing compute
efficiency leads to reduced forgetting of previously learned knowledge.