Scaling All-Sync RL with DAPO and LoRA

Introduction

As Large Language Models (LLMs) evolve beyond pre-training toward experiential learning, Reinforcement Learning has emerged as the key to unlocking advanced reasoning capabilities, as demonstrated by models like DeepSeek-R1. However, the computational barriers are immense—training a 671B parameter model typically requires up to 512 H800 GPUs, making cutting-edge RL research inaccessible to most teams.

This work breaks through these limitations by combining Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) with Low-Rank Adaptation (LoRA) in a novel All-Sync RL architecture. Our methodology achieves stable, efficient training of the 671B DeepSeek model using only 48 H800 GPUs—a 10× reduction in hardware requirements while maintaining state-of-the-art performance. We demonstrate this efficiency across three critical applications: balanced reasoning, agentic memory, and human-agent interaction, opening new possibilities for accessible large-scale RL research.

Efficient All-Sync RL on 671B DeepSeek with 6 Nodes and 48 H800 GPUs

We adapted Coati as the trainer backend and SGLang as the inference backend. We developed pipeline parallelism support based on Coati for the trainer and utilized the SGLang router for asynchronous distributed inference optimization at scale. To facilitate this, we designed a hybrid communication mechanism employing Ray, Gloo, and shared disk infrastructure.

Figure 1: H800 GPU Usage Optimization Path

Through LoRA and All-Sync RL optimizations, we reduced GPU requirements from 512 to 48 H800s.

We identified the bottleneck as significant computational resource underutilization in standard on-policy RL algorithms like Group Policy Optimization (GRPO), known as "GPU bubbles." This inefficiency arises because the training step must wait for the inference step to complete trajectory generation, with or without a one-step offset.

To address the training-inference compute imbalance, we introduced a fully synchronous, or All-Sync RL, architecture. This framework eliminates GPU bubbles within a strictly on-policy setting.

With LoRA RL, we significantly reduced model weight communication costs. We also designed and implemented accelerated parallel distributed LoRA merge and quantization on GPUs for 671B DeepSeek architecture models.

A comprehensive end-to-end analysis of the entire RL pipeline confirmed that this synergistic combination resulted in substantial wall-clock time reduction for a single step, from 9 hours to just 1.5 hours.

We found that with large batch sizes (e.g., 512) and large num_generations (e.g., 16 or 64), thousands of samples are generated per step, making the benefit of pause-and-resume for overlength sampling between steps marginal.

Figure 2: Comparison between On-policy RL, All Async RL, All Sync RL

On-policy RL (GPU Bubble)

Trainer
Idle
Trainer
Idle
Wait
Inferencer
Wait
Inferencer

All Async RL (One Step Offset)

Trainer (t)
Trainer (t+1)
Inferencer (t)
Inferencer (t+1)

All Sync RL (On-policy)

Trainer + Inferencer
Trainer + Inferencer
Trainer + Inferencer
Trainer + Inferencer

We successfully trained 671B DeepSeek with only 6 nodes and 48 H800 GPUs, achieving substantial improvements in inference efficiency and cost-effectiveness. Our research relies on support from the open-source community. To contribute back, we plan to make the full training results and open model weights publicly available within the next few weeks.

Implementation and Validation of RL with LoRA and DAPO

We implemented and validated the effectiveness of LoRA training and Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) on RL at the 671B scale.

Adapting LoRA to a 671B MoE model required considerable engineering efforts. We supported pipeline parallelization, parallel distributed merge, and sharded save and load of high-rank LoRA for 671B DeepSeek architecture models. Experiments suggested that rank=128 and alpha=256 effectively balance performance and GPU memory costs.

Figure 3: LoRA Rank Experiments

LoRA rank comparison showing accuracy metrics with standard deviation bands across different rank configurations.

We implemented the DAPO framework by increasing the clip range to prevent entropy collapse, dynamically filtering prompts to stabilize training, using token-level loss for long reasoning tasks, and shaping rewards on long outputs to reduce noise. Our integration of this framework substantially improves GRPO stability. We found that larger num_generations increases costs for easy tasks but reduces costs for harder tasks. We also proposed and implemented a fix for incompatibility between cutting-edge inference engines and RL training:

Equation1: Importance Sampling Fix

Expected (Theoretical)

$$\small{ \mathbb{E}_{a\sim\textcolor{blue}{\pi_{\text{fsdp}}}(\theta_{\mathrm{old}})} \Bigl[ \nabla_\theta \min\Bigl( \frac{\textcolor{blue}{\pi_{\text{fsdp}}}(a, \theta)}{\textcolor{blue}{\pi_{\text{fsdp}}}(a, \theta_{\mathrm{old}})}\,\hat A, \;\mathrm{clip}\bigl(\frac{\textcolor{blue}{\pi_{\text{fsdp}}}(a, \theta)}{\textcolor{blue}{\pi_{\text{fsdp}}}(a, \theta_{\mathrm{old}})},\,1-\epsilon,\,1+\epsilon\bigr)\,\hat A \Bigr) \Bigr] }$$

Rollout with FSDP is too slow, so we use inference backend instead. This creates a distribution gap.

$$\small{ {a\sim\textcolor{blue}{\pi_{\text{fsdp}}}(\theta_{\mathrm{old}})} \Rightarrow {a\sim\textcolor{red}{\pi_{\text{vllm}}}(\theta_{\mathrm{old}})} }$$

VeRL Approach

$$\small{ \mathbb{E}_{a\sim\textcolor{red}{\pi_{\text{vllm}}}(\theta_{\mathrm{old}})} \Bigl[ \nabla_\theta \min\Bigl( \frac{\textcolor{blue}{\pi_{\text{fsdp}}}(a, \theta)}{\textcolor{blue}{\pi_{\text{fsdp}}}(a, \theta_{\mathrm{old}})}\,\hat A, \;\mathrm{clip}\bigl(\frac{\textcolor{blue}{\pi_{\text{fsdp}}}(a, \theta)}{\textcolor{blue}{\pi_{\text{fsdp}}}(a, \theta_{\mathrm{old}})},\,1-\epsilon,\,1+\epsilon\bigr)\,\hat A \Bigr) \Bigr] }$$

Coati Approach

$$\small{ \mathbb{E}_{a\sim\textcolor{red}{\pi_{\text{vllm}}}(\theta_{\mathrm{old}})} \Bigl[ \nabla_\theta \min\Bigl( \frac{\textcolor{blue}{\pi_{\text{fsdp}}}(a, \theta)}{\textcolor{red}{\pi_{\text{vllm}}}(a, \theta_{\mathrm{old}})}\,\hat A, \;\mathrm{clip}\bigl(\frac{\textcolor{blue}{\pi_{\text{fsdp}}}(a, \theta)}{\textcolor{red}{\pi_{\text{vllm}}}(a, \theta_{\mathrm{old}})},\,1-\epsilon,\,1+\epsilon\bigr)\,\hat A \Bigr) \Bigr] }$$

Handling the Mismatch

$$\small{\mathbb{E}_{a\sim\textcolor{red}{\pi_{\mathrm{vllm}}}(\theta_{\mathrm{old}})}\Bigl[\underbrace{\min\Bigl( \frac{\textcolor{blue}{\pi_{\mathrm{fsdp}}}(a,\theta_{\mathrm{old}})}{\textcolor{red}{\pi_{\mathrm{vllm}}}(a,\theta_{\mathrm{old}})}, C\Bigr)}_{\text{truncated importance ratio}}\cdot\nabla_{\theta}\,\min\Bigl( \frac{\textcolor{blue}{\pi_{\mathrm{fsdp}}}(a,\;\theta)}{\textcolor{blue}{\pi_{\mathrm{fsdp}}}(a,\;\theta_{\mathrm{old}})}\,\hat{A}, \mathrm{clip}\Bigl( \frac{\textcolor{blue}{\pi_{\mathrm{fsdp}}}(a,\;\theta)}{\textcolor{blue}{\pi_{\mathrm{fsdp}}}(a,\;\theta_{\mathrm{old}})}, 1-\epsilon,\;1+\epsilon \Bigr)\,\hat{A}\Bigr)\Bigr]}$$

Uses a truncated importance ratio to correct for the distribution mismatch between training and inference backends.

RL Applications: Balanced Reasoning, Agentic Memory, and Human-Agent Interaction

Reinforcement learning with verifiable rewards has proven effective but has largely been limited to reasoning capabilities and math/coding tasks. With our highly efficient RL framework, we extended the boundaries of reinforcement learning with verifiable rewards, focusing on three pillars that are mission-critical from product and user experience perspectives.

Balanced Reasoning

We defined a "Balanced Reasoning" task where the objective was to solve complex problems requiring multi-step reasoning. We penalized longer reasoning trajectories in the reward function even when the final answer was correct, thereby optimizing for both accuracy and token efficiency. We achieved 90% of the baseline R1 model's performance while consuming only 45% of the reasoning tokens.

Figure 4: Balanced Reasoning

Our Balanced Reasoning approach delivers 90% of top-tier performance using only 45% of the tokens.

Samples1: Balanced Reasoning

Agentic Memory

We designed a task requiring the model to generate tens of thousands of tokens while maintaining consistency. By employing our Multi-conv DAPO framework to foster what we term Agentic Memory, the model demonstrated the ability to maintain context and consistency over extremely long generation sequences.

Figure 5: Multi-conv DAPO

Our Multi-conv DAPO

Query (q)
Policy Model
Group of Conversations
(o₁,₁, o₁,₂ ... oG,k)
Rule-Based Verifier
Final Reward & Advantage

Standard GRPO Process

Query (q)
Policy Model
Outputs (o₁, o₂, ...)
Rule-Based Verifier
Reward (r) & Advantage (A)

Figure 6: Agentic Memory

Traditional Method: Long-Context LLM

1 2 ... N Q
Long-Context LLM
A

Agentic Memory via RL (constrained by Single Conv Max Len)

Chunk 1 + Q
LLM
Chunk 2 + Q + M₁
LLM
...
A

Approach for Long-Output Generation

Agentic Memory via RL (constrained by Single Conv Max Len)

Q + M₀
LLM
T₁ + M₁
Q + M₁
LLM
T₂ + M₂
...
Q + Mₙ
LLM
Final Answer A
Verifiable Reward
Input Chunk Q: Question M: Memory T: Thought A: Answer

Samples2: Agentic Memory

Human-Agent Interaction

Leveraging a proprietary dataset of human-agent interactions, we investigated how RL can train models to be more engaging and better infer implicit user needs. We built questions focusing on Emotional Quotient (EQ) rather than logical thinking of LLMs and assessed them with verifiable ground truth. By curating a high-quality subset representing the top 5% of these interactions, we aimed to enhance the model's capabilities in memory retention, tool utilization, and sophisticated conversational skills.

Samples3: Human Agent Interaction

Limitations

The models can inherit biases, generate harmful content, and struggle with factual accuracy in long outputs. They excel at familiar tasks but perform poorly on new ones. Furthermore, the method requires significant computing power and is currently designed for a specific architecture, which limits its use by others.

What's Next

We plan to use more diverse, high-quality data and connect the models to more useful tools. We will also work on making training more efficient through better algorithms. Finally, we will expand our methods to multimodal foundation models to handle images and video.

Author

MIND LABS

Core Contributors

Qihan Liu, Rio Yang, Alex Yin, Andrew Chen

Team

Kaijie Chen, Huan Feng, Hao Fu, Peng Guan, Yuyi Jiang, Yongfang Jiang, Alex Jin, Qiuyu Jin, Nora Lam, Boxu Li, Scott Liu, Xiaoteng Ma, Guian Qiu, Haofeng Wu, Chuyan Zhang, Xinyue Zhu

Acknowledgement

Tianle Cai, Junhong Chen, Rui Li, Zisen Lin, Renjing Xu, Shunyu Yao

Names are listed alphabetically within team and acknowledgement.