DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (Guo et al.)
May 3, 2026
supervised fine-tuning (SFT) - analogous to behavior cloning based on ground-truth
reinforcement fine-tuning (RFT) - rewards good behavior and suppresses bad behavior
-i.e. in a math problem solution
-SFT: minimize difference between wording/steps in generated/accepted solution
-teaches: what structure does a mathematical solution look like?
-RFT: given verifier/reward signal, find the best way to get to the solution
-teaches: which reasoning trajectories are most likely for the correct solution
-"Furthermore, by constraining models to replicate human thought processes, their performance is inherently capped by the human-provided exemplars, which prevents the exploration of superior, non-human-like reasoning pathways"
GRPO
-simplifies training process and reduces resource consumption of Proximal Policy Optimization (PPO)
Post-Training Pipeline
1.RL (R1-Zero)
-Trained jointly on verifiable tasks
Problems:
-outputs become messy
-formatting degrades
-language quality/consistency drops
This happens because reward signal only cares about correctness
2.SFT (R1)
-Small SFT phase to fix structure, language consistency, etc.
3.Full miscellaneous RL
-Tasks mixed with reasoning, helpfulness, chat/alignment, safety, etc.
-both verifiable and mixed rewards