OpenCoder: Top-Tier Open Code Large Language Models

Overview of OpenCoder

OpenCoder is an open and reproducible code LLM family which includes 1.5B and 8B base and chat models, supporting both English and Chinese languages. Starting from scratch, OpenCoder is trained on 2.5 trillion tokens composed of 90% raw code and 10% code-related web data, reaching the performance of top-tier code LLMs. We provide not only model weights and inference code, but also the reproducible training data, the complete data processing pipeline, rigorous experimental ablation results, and detailed training protocols. Empowering researchers to build and innovate, OpenCoder is your open foundation for advancing code AI.

OpenCoder: A completely open-source Code LLM, built on the transparent data process pipeline and reproducible dataset, which achieves top-tier performance on multiple code LLM evaluation benchmarks.
RefineCode: A high-quality, reproducible code pretraining corpus comprising 960 Billion tokens across 607 programming languages.
Instructive Ablation Studies: Several meaningful ablation experiments aiming at providing meaningful insight for various design choices and training strategies of code LLMs.
Released Resources: Final model weights, complete data processing pipeline, efficient evaluation pipeline, reproducible pretraining dataset, large-scale SFT dataset, and intermediate checkpoints.

Contributors

Citation

@inproceedings{Huang2024OpenCoderTO, title = {OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models}, author = {Siming Huang and Tianhao Cheng and Jason Klein Liu and Jiaran Hao and Liuyihan Song and Yang Xu and J. Yang and J. H. Liu and Chenchen Zhang and Linzheng Chai and Ruifeng Yuan and Zhaoxiang Zhang and Jie Fu and Qian Liu and Ge Zhang and Zili Wang and Yuan Qi and Yinghui Xu and Wei Chu}, year = {2024}, url = {https://arxiv.org/pdf/2411.04905} }