agentica-project/deepscaler

DeepScaleR Project Description

What is the project about?

DeepScaleR is an open-source project focused on democratizing reinforcement learning (RL) for large language models (LLMs). It aims to reproduce and improve upon the performance of models like DeepSeek R1 and OpenAI's O1/O3, specifically in the context of mathematical reasoning. The project emphasizes full transparency by open-sourcing training scripts, models, datasets, and logs.

What problem does it solve?

The project addresses the challenge of applying and scaling RL techniques to improve the performance of LLMs, particularly on complex tasks like mathematical problem-solving. It aims to make advanced RL techniques for LLMs more accessible and reproducible. It also tackles the issue of limited context length in LLMs by scaling the context length during RL training.

What are the features of the project?

Open-Source RL for LLMs: Provides a complete, open-source implementation of RL training for LLMs.
Reproducibility: Offers training scripts (with hyperparameters), models, datasets, and training/evaluation logs to reproduce the results.
Scaled Context Length: Demonstrates a method for iteratively increasing the context length (8K -> 16K -> 24K) during RL training, allowing the model to consider more information.
High Performance: The released DeepScaleR-1.5B-Preview model achieves state-of-the-art results on mathematical reasoning benchmarks, surpassing larger models in some cases.
Detailed Documentation: Includes a blog post detailing the training recipe and insights.
Easy to use: Provides installation and usage instructions.
Ablation Studies: Includes scripts for conducting ablation studies to analyze the impact of different parameters.

What are the technologies used in the project?

Python: The primary programming language.
Reinforcement Learning (RL): Specifically, a modified version of Deepseek's GRPO algorithm.
Large Language Models (LLMs): Builds upon the DeepSeek-R1-Distill-Qwen-1.5B model.
Verl: A heavily modified fork of the Verl RLHF library is used for training.
Hugging Face Transformers: Used for model and dataset hosting.
vLLM: Used for efficient model inference during evaluation.
Ray: Used for distributed training across multiple GPUs and nodes.
Wandb (Weights & Biases): Used for experiment tracking and logging.
XFormers: Used as attention backend.
Parquet: Used as file format for the dataset.

What are the benefits of the project?

Democratizes RL for LLMs: Makes advanced RL techniques more accessible to the research community.
Improved LLM Performance: Demonstrates significant improvements in LLM performance on mathematical reasoning tasks.
Reproducible Research: Facilitates reproducible research and further development in the field.
Scalability: Shows how to scale RL training to larger context lengths.
Open Source: All resources are publicly available, fostering collaboration and innovation.

What are the use cases of the project?

Mathematical Reasoning: Improving the ability of LLMs to solve complex mathematical problems.
Automated Theorem Proving: Potentially contributing to automated theorem proving systems.
Scientific Discovery: Assisting in scientific research by providing tools for reasoning and problem-solving.
Educational Tools: Developing more powerful educational tools for mathematics and related fields.
General RL Research: Serving as a platform for further research and development in RL for LLMs.
Long-Context Tasks: Any task that benefits from LLMs being able to process and reason over longer contexts.