Trajectory Learning

One-Line Summary: Trajectory learning is the family of techniques that learn from full agent rollouts (state-action-outcome sequences) rather than from isolated examples — it includes simple replay (store trajectories, retrieve at run time) and stronger forms (parametric updates via fine-tuning or LoRA on successful trajectories).

Prerequisites: ReasoningBank, SONA self-learning neural patterns, micro-LoRA adapters

What Is Trajectory Learning?

Standard supervised learning takes (input, label) pairs. Trajectory learning takes (sequence-of-states, sequence-of-actions, outcome) — the temporal structure is the unit of learning. The intuition: an agent is good not because it makes good single decisions but because it makes good sequences of decisions; learning that should preserve sequence structure.

Trajectory learning has two main forms in 2026 harnesses:

Non-parametric: Trajectories stored as memory, retrieved at runtime, used to bias prompts (ReasoningBank pattern).
Parametric: Trajectories used as training data for a small adapter that biases the model. Micro-LoRA is the canonical 2026 form (see micro-lora-adapters-at-the-harness-layer.md).

Most production deployments use non-parametric (cheaper, more controllable) with parametric reserved for stable, high-volume contexts.

How It Works

For non-parametric learning:

Log every trajectory with sufficient fidelity to reconstruct the agent's decisions.
Label outcomes (succeeded / failed / partial). The labeling step is where most engineering effort lands.
Index trajectories for retrieval (vector embedding of initial state + metadata).
At runtime, retrieve relevant trajectories and inject summaries into context.

For parametric:

Filter trajectories to high-confidence successes.
Fine-tune (full or LoRA) on the (state → action) mapping from those trajectories.
Deploy the adapter as a runtime bias on the base model.

Why It Matters

The compounding-improvement property of agent systems comes from trajectory learning. Without it, an agent is exactly as smart on day 100 as on day 1. With it, the agent gets noticeably better at the patterns the user encounters most.

For research-grade systems, this is the path to autonomous self-improvement. For production systems, it's a way to capture institutional knowledge without manual prompting.

Key Technical Details

Outcome labels are the hard part: Most trajectories are not cleanly labeled. LLM-as-judge plus user-explicit signals (thumbs up/down) plus downstream metrics (tests passing) are the usual fallbacks.
Negative trajectories are valuable: Knowing a strategy failed is often as useful as knowing what worked.
Distribution shift: A model trained on trajectories from a Python codebase doesn't necessarily transfer to JavaScript. Per-domain trajectory pools are common.
Privacy boundaries are critical: Trajectories may contain proprietary code, customer data, secrets. Trajectory stores need the same access controls as the underlying data.
Catastrophic forgetting (in parametric learning): Fine-tuning on trajectories can degrade base-model capabilities. LoRA adapters are preferred for this reason — additive, removable.
Sample efficiency varies wildly: Some agent tasks need 10 high-quality trajectories to show learning; others need 10,000. Domain matters more than algorithm.

How Harnesses & Frameworks Implement This

Harness / Framework	Trajectory learning
Claude Code	None natively (transcripts persist; no learning loop)
Claude Agent SDK	DIY
ruflo	First-class — non-parametric (ReasoningBank, SONA) + parametric (micro-LoRA)
LangGraph	Checkpointers preserve trajectories; learning DIY
AutoGen	DIY
CrewAI	Limited
OpenAI Agents SDK	Tracing captures trajectories; learning DIY
Codex CLI / Cursor	✗

Connections to Other Concepts

reasoning-bank.md — The non-parametric storage layer.
sona-self-learning-neural-patterns.md — A pattern-extraction layer over trajectories.
micro-lora-adapters-at-the-harness-layer.md — The parametric form.
multi-step-plan-evaluation.md — Outcome labeling overlaps.