Deep Dive into LLMs like ChatGPT (Karpathy)

March 24, 2025

LLM base model - given a sequence of input tokens, give the most likely next token (chat completion)

-"internet document simulator"
-vocabulary size = 100,277 tokens
-bits -> words, and then commonly occurring words/phrases are further combined

How to get from base model to a chat assistant?

in-context learning - learn from context in prompts

few-shot prompt - prompt with examples to enable in-context learning

-one way to get from a base model to a chat assistant is to give it a few-shot prompt with a chat history, and prompt with the next question
-still slightly hallucinates

instruct models - question-answering assistants

-in post-training, human labelers create a dataset of detailed, multi-turn conversations
-base model will learn how to generate tokens in a conversational flow
-supervised fine-tuning (SFT)
-learn system prompt tokens like \\| im_start \\|, \\| im_sep \\|, and \\| im_end \\| to indicate the parts of the prompt and generate structured output

How to prevent hallucinations?

hallucinations - incorrect/made up information from LLMs

-occur because LLMs don't/didn't know how to say "I don't know"
-training data only has patterns with confident answers, so it replicates that format even without the necessary information

Mitigations

Mitigation 1: Use model interrogation to determine knowledge boundaries, augment training dataset with knowledge-based refusals when the model doesn't know

How to interrogate the model to figure out the boundaries of its knowledge?

1.Take a section from a random document in the training set. Provide it as context to an LLM, and ask for questions/answer pairs for that section
2.Interrogate the model in question using the question/answer pairs
3.Check if the answer is correct by comparing the original answer (with context) with the interrogation answer
4.If the question is not answered correctly, use that question to create a new entry in the training dataset that has the answer "I don't know"

Mitigation 2: Allow the model to search

1.Learn to emit \\<SEARCH_START\\> and \\<SEARCH_END\\> tokens when a web search would be useful/necessary
2.Use the query to make a web search, paste the results as context for the model
3.Generate a response using the web search results as context

Vague recollection vs. Working Memory

-knowledge in parameters -> vague recollection
-knowledge in the context window -> working memory
-models are not good at spelling/counting because they see tokens (groups of chars/words), not letters

Reinforcement Learning (verifiable domains)

verifiable domain - concrete answer that can be checked easily

-i.e. "What is 2+2" -> "4"
-as humans, we "think" differently than LLMs
-training data provides methods to get from a prompt to an accurate completion
-we don't know that this is the best way for LLMs to "think"
-We want the system to learn from efficient/correct response processes:
1.Generate multiple different completions for the same prompt
2.Use the correct completions to train the model and encourage that "thought process"/completion pattern
3.Model learns the reliably correct sequences on its own -> emergent reasoning

DeepSeek-R1 Paper: publicly describes their RL pipeline in detail, show emergent capabilities from RL

-model learns how to imitate chains of thought to increase accuracy of problem solving
-cognitive strategies that can't explicitly be taught
-leads to longer responses + higher accuracy

RLHF: Reinforcement Learning from Human Feedback (unverifiable domains)

unverifiable domains - hard to create a concrete heuristic to evaluate output

-i.e. "Write a joke about pelicans"

Process:

1.Create a dataset: Take prompts, various outputs per prompt, and have a human order them from best to worst
2.Train a reward model to simulate human preferences/scores
3.Run Reinforcement Learning with the reward model

Benefits:

-discriminator-generator gap - much easier to discriminate than to generate
-hard for human labelers to create good examples for unverifiable domains, easy to discriminate
-model gets better based on human discrimination with RLHF

Drawbacks:

-we are doing RL with a "lossy" representation of humans -> misleading
-find a way to "game" the reward models -> adversarial examples
-will always happen if RLHF is run for too long, need to cutoff