Models as Probabilistic Infrastructure

The gap between what you assume and what's true

You send the same prompt twice to GPT-4. Same system prompt. Same user message. You get back two meaningfully different responses — one cautious and hedged, one direct and confident.

A few weeks later, the model starts refusing a perfectly reasonable request it had handled without issue for months. Nothing in your code changed. No API version bump. No error. Just different behavior.

These aren't bugs. They're features — features that come from specific design decisions made long before you wrote your first prompt. Most developers treat the model as a black box and work around its behavior empirically. That works until it doesn't.

This chapter opens the box. Not to explain the mathematics — Synapse has four modules on transformer internals if you want that depth. Here the goal is practical: understand the decisions that shaped the model you're calling, what they imply for how it behaves, and how you engineer reliably on top of something that's fundamentally probabilistic.

Two phases, almost nothing in common

Every foundation model you call has been through two training phases with different goals, different data, and different costs.

Pre-training comes first. The model is trained on hundreds of billions of tokens from the internet: books, Reddit, Wikipedia, code repositories, news articles, academic papers. GPT-3 was trained on 570GB of filtered text — roughly 300 billion tokens. The sole objective: given this sequence of tokens, predict the next one. After enough iterations across enough data, the model becomes extraordinarily good at text completion.

This produces a capable but not useful artifact. Ask a pre-trained model a question and it might complete your sentence, ask follow-up questions, or add context — because that's what usually follows a question in internet text. It doesn't know it's supposed to answer. It's predicting what comes next, not responding to you.

A pre-trained model talks like a web page. Not a person.

Post-training transforms it. It uses a tiny fraction of resources compared to pre-training — InstructGPT, OpenAI's first instruction-following model, used 98% of its compute budget on pre-training and 2% on post-training. That 2% is what makes the model behave like an assistant instead of an autocomplete engine.

Post-training step 1: learning to have a conversation

The first post-training step is supervised finetuning (SFT). The model is shown thousands of (prompt, ideal response) pairs — called demonstration data — and finetuned to produce that kind of output.

For InstructGPT, OpenAI collected 13,000 (prompt, response) pairs from human labelers at a cost of roughly $130,000. The labelers weren't trivia annotators. Around 90% had at least a college degree, more than a third had a master's. They were making judgment calls about what a good, complete, helpful response to a complex prompt looks like.

The task distribution matters: 45.6% of InstructGPT's SFT data was open-ended generation, 12.4% was open Q&A, 11.2% was brainstorming, 8.4% was chat. The model learned what a good response looks like across that specific distribution of tasks — which is why it feels more fluent on some tasks than others.

After SFT, the model can hold a conversation. It still doesn't know which responses are good — only which ones look like responses.

Post-training step 2: aligning with human preference

The second step is preference finetuning — teaching the model that some responses are better than others.

The classic approach is RLHF (Reinforcement Learning from Human Feedback), used by GPT-3.5 and Llama 2. It has two stages:

Stage 1 — Train a reward model. Labelers see two responses to the same prompt and pick which is better. This is comparison data — cheaper to collect than ideal responses, but still expensive. LMSYS (Large Model Systems Organization) found labelers take 3–5 minutes per comparison. Anthropic paid $3.50 per comparison. OpenAI measured inter-labeler agreement at approximately 73% — meaning labelers disagreed on 27% of comparisons. Human preference is genuinely inconsistent. Any model trained on it inherits that inconsistency.

Stage 2 — Optimize the model against the reward model. The SFT model is further trained to generate responses that score highly. This uses reinforcement learning — specifically PPO (Proximal Policy Optimization), released by OpenAI in 2017. The result: a model shaped not just to produce text that looks like responses, but to produce responses that humans prefer.

A newer approach, DPO (Direct Preference Optimization), achieves similar results with less complexity. Meta switched from RLHF to DPO for Llama 3. The field hasn't settled on one method — and as Chip Huyen notes, the debate about why either of them works is still unresolved.

Pre-training
  │ Internet-scale data → self-supervised completion → capable but raw model
  ↓
Supervised Finetuning (SFT)
  │ 13K demonstration pairs → learns conversation shape and task formats
  ↓
Preference Finetuning (RLHF or DPO)
  │ Human comparison data → learns which responses humans prefer
  ↓
The model you call via API

What post-training means for you

These decisions have direct engineering consequences.

The model has baked-in values. Preference finetuning embeds thousands of human judgments about good, safe, appropriate responses. This creates systematic defaults: the model hedges uncertainty more than facts warrant, avoids certain topics, formats responses opinionatedly, and refuses requests that pattern-match to low-scoring examples from training. You can override these defaults in your system prompt, but you're pushing against encoded behavior, not a blank slate.

Refusals are pattern-matching, not reasoning. When the model declines a request, it isn't analyzing that request from first principles. It's activating patterns from preference finetuning. That's why refusals can be inconsistent and context-sensitive — the same request phrased differently, or in a different system prompt context, activates different patterns. A clear system prompt that establishes legitimate context often unlocks behavior that a bare prompt can't.

Model updates silently change behavior. Post-training is an ongoing process. When a provider releases a new model version, the preference data has changed. Behavior you depended on can shift — prompts that worked cleanly start producing inconsistent output, and there's no error to catch it. This isn't a bug in your code. It's the model's learned defaults drifting.

The only engineering response: version-pin your production models and run evaluation before upgrading. We'll build that infrastructure in Chapter 3.

How outputs are generated: sampling

Understanding the model's training explains what it knows. Sampling explains how it produces output.

After processing your input, the model doesn't produce a word. It produces a logit vector — one score for every token in its vocabulary (typically around 100,000 tokens for modern models). These logits represent confidence scores for each possible next token.

A softmax function converts logits to probabilities. Every token gets a probability. They sum to 1. In most contexts, the single most probable next token gets somewhere between 30% and 70% of the probability mass. Thousands of other tokens share the remainder in a long tail.

The model samples from that distribution to pick the next token. Then it appends that token to the context and repeats the process. A 200-token response is 200 sequential sampling decisions, each conditioned on everything that came before.

This is why the same prompt returns different outputs. There's no determinism here. Each call is a new walk through a probabilistic space.

The knobs you control: temperature and top-p

Temperature is the most important sampling parameter for builders. It rescales the logit vector before softmax is applied — specifically, it divides each logit by the temperature value.

Low temperature compresses the distribution. The highest-probability tokens dominate. The model becomes more consistent, more predictable, more likely to produce the same output for the same input.

High temperature flattens the distribution. Low-probability tokens become more likely. The model becomes more creative, more varied, less predictable.

Temperature 0.1  → spiky distribution → consistent, sometimes repetitive
Temperature 0.7  → balanced distribution → coherent but varied (creative default)
Temperature 1.5  → flat distribution → creative but increasingly incoherent

A practical mapping by task type:

Task	Temperature
Data extraction, classification	0 – 0.3
Factual Q&A, summarization	0.3 – 0.7
Creative writing, brainstorming	0.7 – 1.2

Top-p (nucleus sampling) limits the sampling pool to the smallest set of tokens whose cumulative probability exceeds p. At top_p=0.9, the model only samples from tokens that together account for 90% of the probability mass, ignoring the long tail. This prevents very low-probability tokens from being selected even at high temperatures.

In practice, use temperature or top-p at extremes — not both simultaneously. Most providers default top-p to 1.0 and recommend leaving it there unless you have a specific reason.

One crucial clarification: temperature 0 is not deterministic. Even at the lowest setting, the same prompt can return different outputs across model versions, across hardware configurations, and occasionally within the same session due to floating-point nondeterminism in parallel computation. If you need consistent outputs, you need evaluation — not just a lower temperature.

Structured outputs: constraining what the model can say

For applications that need machine-readable output, structured outputs (JSON mode, tool use schemas, constrained generation) are the most practical reliability improvement available.

Without constraints, the model might return your JSON in a markdown code block, add a preamble, or use a slightly different field name than you specified. Every parser handles this differently. Most fail silently.

Structured output works by constraining the sampling process: at each step, any token that would violate the specified schema is masked out and given zero probability before sampling. The model can only produce tokens that keep the output valid.

from openai import OpenAI
from pydantic import BaseModel

class SupportSummary(BaseModel):
    issue: str
    sentiment: str  # "frustrated" | "neutral" | "satisfied"
    resolution_status: str  # "open" | "resolved" | "escalated"

client = OpenAI()
completion = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {"role": "system", "content": "Extract support ticket fields."},
        {"role": "user", "content": ticket_text}
    ],
    response_format=SupportSummary,
)
summary = completion.choices[0].message.parsed

OpenAI's structured outputs guarantee schema conformance — the model cannot produce malformed JSON. Anthropic's tool use with strict schemas achieves the same. The model still has probabilistic behavior within the schema — it can still get a field value wrong — but it can no longer produce output your parser can't handle.

For any pipeline that processes LLM output programmatically: use structured outputs. The alternative is defensive parsing code that will fail on an edge case you didn't anticipate.

The engineering consequence

Pull all of this together:

Every response is a chain of token-sampling decisions. Each token is shaped by a pre-training distribution over hundreds of billions of tokens and adjusted by thousands of human preference judgments. The model you call today was trained on different comparison data than the model you called three months ago. Temperature and top-p introduce randomness by design.

This has a single engineering consequence: you cannot test an AI system the way you test deterministic software.

A unit test that passes today will pass again tomorrow — because the logic didn't change. An LLM eval that passes today might fail next week on the same model, because the model's probabilistic behavior covered your test cases by chance this run but won't the next.

What reliable AI engineering looks like instead:

Statistical pass rates, not binary pass/fail. "94.3% of test cases produce valid, correctly structured output" is a meaningful measurement. "It works" is not.
Evaluation sets that represent real production inputs. Test cases you write in development cover the cases you anticipated. They won't cover what your users actually send.
Version pinning in production. Pin to specific model snapshots. Test before upgrading.
Production sampling. Route a fraction of live traffic through evaluation to catch distribution shift you didn't anticipate.

Chapter 3 builds this infrastructure from scratch: how to design evaluation sets, choose evaluation methods, and make quality measurement a first-class engineering practice.

Primary sources: Chip Huyen, AI Engineering (O'Reilly, 2025), Chapter 2. Ouyang et al., "Training language models to follow instructions with human feedback" (InstructGPT, OpenAI 2022). Rafailov et al., "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (2023).