EpiCoder: Encompassing Diversity and Complexity in Code Generation
Journal:
arXiv
Published Date:
Jan 8, 2025
Abstract
Existing methods for code generation use code snippets as seed data,
restricting the complexity and diversity of the synthesized data. In this
paper, we introduce a novel feature tree-based synthesis framework, which
revolves around hierarchical code features derived from high-level abstractions
of code. The feature tree is constructed from raw data and refined iteratively
to increase the quantity and diversity of the extracted features, which
captures and recognizes more complex patterns and relationships within the
code. By adjusting the depth and breadth of the sampled subtrees, our framework
provides precise control over the complexity of the generated code, enabling
functionalities that range from function-level operations to multi-file
scenarios. We fine-tuned widely-used base models to obtain EpiCoder series,
achieving state-of-the-art performance on multiple benchmarks at both the
function and file levels. In particular, empirical evidence indicates that our
approach shows significant potential in the synthesizing of repository-level
code data. Our code and data are publicly available at
https://github.com/microsoft/EpiCoder.