DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (Guo et al.)

May 3, 2026

supervised fine-tuning (SFT) - analogous to behavior cloning based on ground-truth

reinforcement fine-tuning (RFT) - rewards good behavior and suppresses bad behavior

-i.e. in a math problem solution

-SFT: minimize difference between wording/steps in generated/accepted solution

-teaches: what structure does a mathematical solution look like?

-RFT: given verifier/reward signal, find the best way to get to the solution

-teaches: which reasoning trajectories are most likely for the correct solution

-"Furthermore, by constraining models to replicate human thought processes, their performance is inherently capped by the human-provided exemplars, which prevents the exploration of superior, non-human-like reasoning pathways"

GRPO

-simplifies training process and reduces resource consumption of Proximal Policy Optimization (PPO)

Post-Training Pipeline

1.RL (R1-Zero)

-Trained jointly on verifiable tasks

Problems:

-outputs become messy

-formatting degrades

-language quality/consistency drops

This happens because reward signal only cares about correctness

2.SFT (R1)

-Small SFT phase to fix structure, language consistency, etc.

3.Full miscellaneous RL

-Tasks mixed with reasoning, helpfulness, chat/alignment, safety, etc.

-both verifiable and mixed rewards