Prompt Engineering
You can’t change the model’s weights, but you can change what it sees. Prompt engineering is the art of steering an LLM’s probabilities — by example, by instruction, and by inviting it to think out loud.
The five-bullet version
- The prompt is the only handle you have on the model. Everything you write is steering the output distribution.
- Zero-shot: tell the model what to do. Works for clear tasks, but no examples means no anchor.
- Few-shot: show 2–8 worked examples. Often the cheapest way to get a meaningful quality jump.
- Chain-of-thought: ask for reasoning before the answer. Big gains on math, logic, multi-step tasks.
- The 2026 baseline: structured instructions + role + 2-3 examples + step-by-step on hard cases. Skip the magic incantations.
§ 00 · PROMPTS ARE PROBABILITY NUDGESWhat “prompt engineering” really is
Every token a model writes comes from a probability distribution over its vocabulary. The prompt is the only thing influencing that distribution. There’s no “hidden setting” you can flip to change behavior — once a model is deployed, weights are frozen and your only knob is what you put in front of it.
That’s the entire game. Prompt engineering is the practice of arranging the context so the next-token distribution lands where you want it. Every technique in this lesson is a specific way to shape that distribution.
Three coarse ways to nudge the distribution, in increasing sophistication:
- Tell the model what to do.Plain instructions. “Translate the following to French.” This is zero-shot.
- Show the model what you want. Worked examples in the prompt. The model picks up the pattern. This is few-shot.
- Make the model reason. Ask for the work shown before the answer. The model uses its own tokens as scratch paper. This is chain-of-thought.
§ 01 · ZERO-SHOT — JUST ASKThe shortest path to an answer
Zero-shotzero-shot prompting. Asking the model to perform a task using only an instruction, with no worked examples in the prompt. Relies on the model's pretraining to know how to do the task. is the baseline. You write the instruction, the model executes. Most of what people do with ChatGPT is zero-shot:
- “Summarize this article in three bullets.”
- “Translate to Spanish.”
- “Write a polite decline to this meeting invite.”
Modern instruction-tuned models are remarkably good at zero-shot for common tasks. The training corpus included millions of (instruction, response) pairs, so the model has learned to recognize and execute a wide variety of instructions immediately. Zero-shot fails when:
- The task is unusual or domain-specific (legal redlines in a niche contract type, code style for an obscure framework).
- The output format is precise and matters (specific JSON schema, specific bullet structure, specific opening line).
- The task has hidden traps the model doesn’t know to look for (the bat-and-ball problem, ambiguous pronouns in a long passage).
§ 02 · FEW-SHOT — SHOW, DON’T TELLExamples beat explanations
Few-shotfew-shot prompting. Including 2–8 worked examples of the task in the prompt before asking the model to do a new instance. The examples define the task by demonstration. prompting was the original breakthrough of GPT-3: you could teach a model a new task simply by including 2–8 examples of input/output pairs in the prompt. No fine-tuning, no gradient updates. The model picks up the pattern from the examples alone.
This is more useful than it sounds. The examples carry several signals the instruction can’t:
- Exact output format. Examples show the model whether you want JSON, bullets, a sentence, or a single word.
- Style and tone. Formal? Terse? Apologetic? Examples set the voice.
- Edge cases. If your examples include a tricky case handled correctly, the model is far more likely to handle similar tricky cases in new inputs.
No examples, no scaffolding. The model takes the most fluent guess. For a question like this, that guess is wrong — 'ball costs 10¢' sounds right but math says 5¢.
Three rules for picking few-shot examples:
- Diverse. Examples that all look the same teach the model only one pattern. Include variation in length, in input shape, in the kind of answer.
- Representative. Examples should look like the queries you actually expect. Toy examples teach the model toy behavior.
- Correct. Wrong examples actively hurt. The model will learn the pattern you demonstrated, including the mistake.
§ 03 · CHAIN-OF-THOUGHT — MAKE IT THINKGiving the model space to reason
The single biggest discovery in prompting history is chain-of-thoughtchain-of-thought. Prompting the model to produce step-by-step reasoning before its final answer. Famously triggered by 'Let's think step by step', though any framing that asks for intermediate steps works. prompting: ask the model to show its reasoningbefore committing to an answer. The phrase “Let’s think step by step” became famous in 2022 because adding it to math word problems jumped accuracy from ~17% to ~78% on a standard benchmark, on the same model, with no other change.
Why this works is subtle. The model has fixed compute per token. When you force a short answer (“answer in one word”), it has to do all the reasoning inside a single forward pass, which is about as much computation as one matrix multiplication. When you let the model write reasoning steps, each step gets its own forward pass, and the model can use its own tokens as scratch paper. The total compute available grows with the length of the reasoning.
Three CoT variants worth knowing:
- Zero-shot CoT.Just append “Let’s think step by step” (or any phrase inviting reasoning). Cheap, often enough.
- Few-shot CoT. Include 2-3 examples where the answer shows its work, then ask your real question. The model imitates the structure of the example reasoning.
- Self-consistency.Sample 5-10 CoT reasonings (with temperature > 0), each producing an answer. Take majority vote. More expensive, but more reliable on hard math/logic.
Where CoT helps most: arithmetic, logic puzzles, multi-step planning, anything where a single token isn’t enough compute. Where it helps least: factual recall, translation, summarization. Don’t reflexively bolt it onto every prompt — it costs tokens, latency, and sometimes adds errors when the task doesn’t need it.
§ 04 · WHAT ACTUALLY WORKS IN 2026Filtering signal from noise
Prompt engineering accumulated a lot of folk wisdom — some of it real, some of it cargo cult. After a few years of evidence, here’s what consistently moves the needle:
- Be specific.“Write a polite refusal” vs “Write a 2-sentence polite refusal that doesn’t say sorry and proposes an alternative date.” The second one works in one shot.
- Show the format.Especially for structured output. Either give an example or describe the exact schema. Don’t leave format up to the model unless you genuinely don’t care.
- Put instructions first or last.Both work; the middle of a long prompt is where instructions get lost (“lost in the middle” — a real and well-replicated phenomenon for long contexts).
- Use structure. Numbered lists, XML tags, headings — structure helps the model parse a complex prompt the same way it helps a human reader.
<context>…</context><question>…</question>is a common pattern. - Few-shot for unusual tasks, CoT for hard reasoning. Match the technique to the failure. Don’t use all of them everywhere.
What largely doesn’t help (or actively hurts), despite being widely cargo-culted:
- Threatening or bribing the model.“You will be fired if you get this wrong.” Marginal at best, often neutral, sometimes worse — it pushes the model into adversarial mode rather than helpful mode.
- Pleading or being polite.“Please please please this is very important.” No detectable effect on most modern models.
- Naming famous experts.“You are Albert Einstein.” Cute, mostly doesn’t help. Naming a rolewith relevant expertise does help — “you are an experienced incident responder” — but the famous-name version is mostly vibes.
- Long unrelated context. Long prompts with irrelevant material hurt rather than help. Lost-in-the-middle is real; relevant tokens win.
§ 05 · TAKING THIS FORWARDBeyond hand-crafted prompts
Prompt engineering is the bottom rung of a ladder. As tasks get more complex, you climb to:
- Prompt chains.Multi-step pipelines where the output of one prompt feeds the next. (“Extract claims” → “Verify each claim against retrieved docs” → “Write summary”.) Each step is simple; the combination handles things a single prompt can’t.
- RAG. Augment prompts with retrieved relevant text at query time. Covered in the RAG lesson. The right context is more powerful than any prompt phrasing trick.
- Tool use / agents. Let the model call functions, observe results, decide what to do next. Covered in the agentic patterns lesson. The prompt becomes the system instruction for a loop, not a one-shot.
- Fine-tuning. When a prompt would have to be long and intricate every time, bake the behavior into weights instead. Use LoRA and a small dataset; the prompt gets simpler at inference.
The honest summary: prompt engineering matters, but it’s diminishing as models get smarter. Five years ago the difference between a great and mediocre prompt was 30 points on a benchmark. Today it’s often 5. What that means in practice: don’t obsess over phrasing. Spend the effort on structure, examples, and the layers above — RAG, tool use, evals.
§ · GOING DEEPERChain-of-thought and the techniques that actually move evals
Chain-of-thought (Wei et al. 2022) was the breakthrough. Asking the model to “think step by step” before answering unlocks reasoning capabilities that disappear under direct prompting. Kojima et al. (2022) showed the magic phrase alone suffices; you don’t need few-shot examples. For math and logic, CoT is essentially mandatory.
Three follow-ups improve on CoT. Self-consistency(Wang et al. 2022): sample several CoT trajectories and majority vote. Tree-of-thought (Yao et al. 2023): explore a branching search of intermediate steps. Least-to-most prompting (Zhou et al. 2022): decompose hard problems into easier sub-problems first. For most application work, explicit CoT plus instruction-tuned models (Ouyang et al. 2022) is enough. Reasoning models (o1, R1, Claude with extended thinking) automate the trick — you no longer need to prompt for it.
§ · FURTHER READINGReferences & deeper sources
- (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models · NeurIPS
- (2022). Large Language Models are Zero-Shot Reasoners · NeurIPS
- (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models · ICLR
- (2022). Training language models to follow instructions with human feedback (InstructGPT) · NeurIPS
- (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models · NeurIPS
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.