DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (Guo et al.)

May 3, 2026

supervised fine-tuning (SFT) - analogous to behavior cloning based on ground-truth

reinforcement fine-tuning (RFT) - rewards good behavior and suppresses bad behavior

-i.e. in a math problem solution
-SFT: minimize difference between wording/steps in generated/accepted solution
-teaches: what structure does a mathematical solution look like?
-RFT: given verifier/reward signal, find the best way to get to the solution
-teaches: which reasoning trajectories are most likely for the correct solution
-"Furthermore, by constraining models to replicate human thought processes, their performance is inherently capped by the human-provided exemplars, which prevents the exploration of superior, non-human-like reasoning pathways"

GRPO

-simplifies training process and reduces resource consumption of Proximal Policy Optimization (PPO)

Post-Training Pipeline

1.RL (R1-Zero)
-Trained jointly on verifiable tasks

Problems:

-outputs become messy
-formatting degrades
-language quality/consistency drops

This happens because reward signal only cares about correctness

2.SFT (R1)
-Small SFT phase to fix structure, language consistency, etc.
3.Full miscellaneous RL
-Tasks mixed with reasoning, helpfulness, chat/alignment, safety, etc.
-both verifiable and mixed rewards