Production Harness Architecture for AI Agents
Why a Harness Fails in Production
What is harness durability?
Harness durability is the property of an agent harness that lets it survive process kills, deploys, network failures, and partial writes — without corrupting agent state, replaying tool calls that have side effects, or double-billing the user. A durable harness is not one that never crashes. It is one where crashing is boring: a new process picks up the task where the old one left off, the in-flight tool either resumes or is safely re-issued, and the user never sees two refunds for the same return.
Track A's Module 1 introduced the harness as the runtime around the model — state, tools, control loop, policy, runner, checkpoint. The Foundations capstone closed Track A by asking when this whole machine even belongs in your system — workflow vs RAG vs agent — and arrived at a decision rule that gates the rest. Track B picks up immediately afterwards, on the assumption that you have already decided "yes, this should be an agent." Foundations stopped there because at prototype scale all six harness responsibilities can live in memory and a crash is a footnote. Production adds one requirement on top: all six have to keep being true while a real fleet of processes goes up and down underneath them.
What changes when you move from prototype to production?
The shape of a prototype agent loop is something like this: a long-lived Python process holds the state object in memory, drives the tick loop, calls tools synchronously, and exits when the task finishes. If the process dies mid-run, you re-run the whole task from the prompt. At prototype scale that is fine — you control when the process runs, and the cost of replay is your own time.
Production breaks every assumption in that paragraph. A real fleet looks like this:
- Processes are short-lived. Containers get killed by deploys, OOM killers, autoscaler scale-downs, spot interruptions, kubelet evictions, and the occasional SIGKILL. In most production fleets pods turn over often enough that you cannot assume any single process is alive long enough to finish a multi-tick task. Anything you cared about that lived only in memory is gone with the pod.
- Tools call out to the world. Stripe charges, S3 puts, Postgres writes, Slack posts, transactional emails, webhooks. These have side effects on other systems that you do not own.
- Many tasks run concurrently. A failure in one task should not poison the others, and the harness has to schedule, resume, and retry across the whole fleet without losing tasks or running them twice.
The harness has to assume any of its processes can disappear at any moment, and the agent's job still has to finish.
What are the three classes of failure?
Three failure modes recur in every production agent post-mortem. Understanding the shape of each one is what lets you reason about the primitives that fix them.
1. Process kill mid-tick
The process is in the middle of a tick — it has called the model, gotten back a tool call, and is about to dispatch it. SIGKILL arrives (deploy, OOM, eviction, spot interrupt). The process is gone. What does the new process know about where the old one was?
If the only state was in memory: nothing. The new process starts from the prompt and re-runs every prior tick, including any tool calls those ticks made. If any of those tool calls had side effects, you just did them again.
2. Network partition
The process is alive, but the network call it made is in an unknown state. The model API took the request and stopped responding — did it generate? The tool API received the POST — did it commit? Timeouts are the worst case, because they tell you neither "succeeded" nor "failed." They tell you "I don't know."
A naive retry on a timeout doubles your write rate every time the downstream system is slow. A naive give-up loses the write on every transient blip.
3. Partial write
The most common production failure mode, and the hardest to spot. The tool succeeded externally — Stripe charged the card, S3 stored the object, Postgres committed the row — but the harness did not durably record the result before it crashed. The next process has no idea the tool ran. Replay re-runs it. Stripe charges the card again.
This one is structural. It is the bug that idempotency exists to prevent.
A concrete example: the double-refund
A customer-service agent is processing a return request. The tick sequence:
- Tick 4: model decides to call
issue_refund(order_id="A-1234", amount_cents=8900). - The harness dispatches the call. Stripe accepts it, queues the refund, returns
re_3PaZ...1.2 seconds later. - The harness has the response in a local variable. Before it can write
tick_4_result = "re_3PaZ..."to durable storage, the pod gets evicted by a rolling deploy. - Kubernetes schedules a new pod. The harness sees task
T-77is unfinished. It loads the last durable checkpoint, which is from tick 3 (before the refund). It replays from tick 4. - Tick 4: model decides to call
issue_refund(order_id="A-1234", amount_cents=8900). - The harness dispatches the call. Stripe accepts it — as a fresh request, because nothing in it signals "this is a retry" — and refunds the customer a second $89, debiting the merchant a second time.
This is not an edge case. It is the default behavior of a harness that has not solved partial writes. The fix is structural; the next three steps walk it.
The three primitives that make a harness durable
Every fix to the failure modes above comes from one of three primitives, working together.
- Idempotency. Make every irreversible tool call safe to repeat. A retry that lands on the same call ID is a no-op, not a second charge. Step 2.
- Checkpoints. Durably record state at every tick boundary so a new process can pick up where the old one left off, instead of from the beginning. Step 3.
- Retry policy. When a call fails, decide which failures to retry, when to retry them, and when to give up. Naive retries amplify outages; no retries lose transient work. Step 4.
These three compose. Idempotency lets retries be safe. Checkpoints let retries happen on a new process. Retry policy decides when to invoke either. Take away any one and the other two stop being useful — a checkpoint that resumes a non-idempotent call is a corruption machine; a retry without idempotency is a duplication machine; idempotency without checkpointing means every crash starts from zero.
Step 5 closes the module by asking the practical question: do you build these three from scratch on top of Postgres and a job queue, or do you adopt a durable execution platform — Temporal, Inngest, Restate, Trigger.dev — that packages them into a runtime?
Where the framing comes from
Anthropic's Building Effective Agents makes the case that "agent–computer interface" design — the shape of the tools the model gets — is production-load-bearing on its own. This module sits one layer down: how the harness around those tools keeps them from corrupting state when the process they are running in disappears. Track A taught the loop. Track B teaches how to keep the loop running under load.
The vocabulary — idempotency, checkpointing, retry policy, durable execution — is older than agents. It comes from distributed systems and workflow orchestration: Temporal, AWS Step Functions, Apache Airflow. Agents are a new application of the same primitives, with one twist: the model's non-determinism makes replay subtler, which Step 3 takes head-on.
Try it: Stage 1 of the right pane is a short 5-tick agent run playing through the three failure classes. Inject Process kill, Network partition, or Partial write at different ticks and watch what survives into the next process. The differences become obvious once you can see the state the new process starts from in each case.
Idempotency — Safe to Re-run
What is idempotency?
A tool call is idempotent if running it once or N times produces the same observable outcome on the downstream system. Two refund calls with the same key refund the customer once. Five S3 puts with the same content hash leave one object. Three database upserts on the same unique row leave one row. The mathematical word ("idem-potent" — same-powered) is doing real work here: the result of the operation is invariant under repetition.
Every primitive in the rest of this module assumes idempotency. A retry on a failed call assumes the second attempt is safe. A resumed task that re-issues an in-flight call assumes the second issue does not duplicate the side effect. Without idempotency, retry and checkpointing become corruption machines instead of recovery machines.
Why does the harness need this?
Step 1's double-refund example is the canonical reason. The harness cannot tell, after a crash, whether the refund tool actually charged Stripe. It can only ask Stripe — and the safe way to ask is to send the same request again with a key that lets Stripe recognize "I have already done this; here is the result I gave you last time."
This is not a bug in the harness. It is a property of distributed systems: when the network is between you and the downstream service, success and failure are not the only two outcomes. There is a third one — unknown — that the harness has to handle, and the only fix for unknown is the ability to safely repeat the request.
How does the idempotency key pattern work?
The harness alone cannot make a tool idempotent. Idempotency is a contract with the downstream system the tool calls. The key has to be:
- Generated by the caller (the harness), not the callee.
- Stable across retries of the same logical event.
- Recognized by the downstream system, which uses it to deduplicate.
Stripe's idempotency keys are the canonical implementation. You send Idempotency-Key: <your-key> as a header on a POST /v1/charges. Stripe stores the response keyed by that header for 24 hours. Replaying the same request with the same key returns the cached response and does not charge the card again. Stripe also compares the parameters of the second request against the first — if they differ, the API returns an error instead of cached-but-stale data. The dedupe contract is strict: same key, same body, same result.
Other downstream systems implement the same idea with different vocabulary:
- S3: the object key is the idempotency key for
PutObject. Two PUTs tos3://bucket/key.txtwith the same body leave one object. (Versioned buckets are the exception — there, each PUT writes a new version.) - Postgres:
INSERT ... ON CONFLICT (id) DO NOTHING(orDO UPDATEfor upserts) makes the insert idempotent on the unique key. - Twilio, SendGrid, Resend: explicit idempotency-key headers, same shape as Stripe.
- Webhooks you build: add an
Idempotency-Keyheader and a "seen this key" table on the receiver. The pattern is the same on both sides of the wire.
What should you make idempotent?
Every irreversible tool call. The shortlist that bites teams in practice:
- Money. Charges, refunds, payouts, subscription changes.
- Messages. Outbound emails, SMS, push notifications, Slack messages. The user only wants one.
- Writes that the user can see. Posts, comments, tickets, calendar events.
- Database writes. Especially any insert that allocates an ID, or any update that increments a counter.
Reads are usually idempotent for free — fetching a customer record twice gives you the same record twice. You do not need a key for them, though caching them under one is a separate, useful pattern (see Track B Module 4, Cost & Latency).
How should you name the key?
Tie the key to the logical event, not the HTTP request. A logical event is something like "issue the refund decided on tick 4 of task T-77." That event has the same identity on every replay, every retry, every resumed process. Its key should too.
A workable scheme:
key = sha256(task_id || tick_number || tool_name || canonical(args))
For the double-refund case: task_id = "T-77", tick_number = 4, tool_name = "issue_refund", args = {order_id: "A-1234", amount_cents: 8900}. Same key on every replay of tick 4. Stripe deduplicates the second call against the first.
The anti-pattern
The most common idempotency bug is generating a random UUID per call:
# Defeats the purpose — every retry gets a fresh key
key = str(uuid.uuid4())
stripe.Charge.create(amount=8900, idempotency_key=key)
The key changes on every retry, so Stripe sees each call as new. You get duplicate charges. This pattern shows up in real code reviews because frameworks sometimes generate request IDs automatically and the engineer assumes that is the idempotency key. It is not — the request ID changes on retry; the idempotency key must not.
What if the downstream system does not support keys?
Two strategies, in order of preference.
Wrap with a local idempotency table. Before calling the tool, the harness writes the key to its own DB with status pending. After the call completes, it updates the row to done with the response. On replay, the harness checks the table first: if the key is done, return the stored response; if pending, the previous attempt was interrupted in the unknown state — surface this to the operator or use a side-channel to reconcile. This adds a Postgres write per tool call but covers any downstream system.
Refactor to a different downstream. If the tool is "send Slack message via the legacy internal webhook" and that webhook is not idempotent, consider switching to one that is. Many internal services can be made idempotent with one schema change (add a unique constraint, accept a key header) and that is often a smaller investment than living with the corruption risk.
The harness can also fall back to check-then-write: before issuing the refund, ask Stripe "did we already refund this charge?" and only issue if not. This works for some APIs but is racy under concurrency — two harness processes can both see "not refunded" and both issue. Real idempotency keys avoid the race.
How does this fit with the tool-design discipline?
Anthropic's Building Effective Agents frames the agent–computer interface as a load-bearing design surface — tool names, parameter shapes, and return contracts matter as much as model choice. Idempotency belongs in that interface, not bolted on afterwards. A tool whose contract is "POST this payload, get back this response, side effect once per Idempotency-Key" is a tool the harness can use safely. A tool whose contract is "POST this payload and pray" is a corruption vector.
The practical rule: when you add a tool to the registry, ask which downstream side effects it has, and write down which header / parameter / field carries the idempotency key. If the answer is "none," that is the first thing to fix.
Try it: Stage 2 of the right pane runs the same issue_refund tick twice — once with idempotency keys turned off, once with them on. Run to issue the first refund, then Crash and replay to simulate the deploy-mid-tick from Step 1. With keys off, the customer is refunded a second $89. With keys on, the downstream dedupes against the key and the customer is refunded once. The fix is a single header.
Checkpoints & Resumption
What is checkpointing?
Checkpointing is durably persisting agent state to storage outside the process, so a new process can pick up the task where the old one left off. The previous step showed why retries have to be safe to repeat; this step shows where the harness records "what tick we are on" so that the retry happens from the right tick, not from tick zero.
The shape is mundane: at chosen points in the loop, the harness writes the state object — messages, tool results, todos, the next planned action — to a database or object store. On startup, the harness looks for unfinished tasks, loads each one's most recent checkpoint, and resumes the loop from there.
What should you persist?
The state object the agent loop uses to decide its next move, and nothing else.
Persist:
- The conversation history — system prompt, user messages, model responses, tool messages so far. This is what the next model call needs as context.
- The last completed tool result — what the tool returned, with the idempotency key used. This is what the next tick reads from.
- The next planned action — if the model decided "call
lookup_ordernext" before the crash, persist that decision so the new process does not re-roll the dice on the model and risk a different decision. - Todos and intermediate scratch state — anything the agent has committed to that should survive the new process.
Do not persist:
- In-memory caches. They rehydrate on the new process. Pinning them to the checkpoint just bloats the row.
- Draft tokens, partial model outputs. Either the model call completed (persist the full response) or it did not (re-issue). There is no middle state worth saving.
- The Python stack frame. Process-internal call stacks are not portable. The state object has to be data, not code.
A reasonable rule: if you cannot serialize it to JSON and rebuild the agent from it, it does not belong in the checkpoint.
Where should you persist it?
Durable storage outside the process. Three workable choices:
- Postgres or another transactional database — strongest consistency, easiest to query, ties cleanly into application data. Most agent harnesses built in-house end up here.
- Object storage (S3, GCS, R2) — cheaper for large state blobs, but lacks the per-row update semantics you want for "update the checkpoint after every tick." Usually used as a side store for the conversation history while a transactional DB holds the pointer.
- A durable execution platform's event log — Temporal, Inngest, and similar systems handle persistence for you and replay your code against the event history. Covered in Step 5.
Local disk is not durable across machines. If your fleet has more than one host, a checkpoint on /tmp/agent_state.json is gone the moment the pod schedules onto a different node.
When should you checkpoint?
This is the policy decision, and it is a real cost-vs-recovery tradeoff. Three common policies, with the failure recovery each one gives you:
| Policy | Writes per task | Work lost on crash | When to pick it |
|---|---|---|---|
| Per tick (default) | ~1 write per tick | At most one tick | Most agents. The write is cheap; the recovery is clean. |
| Per tool-call boundary | ~2 writes per tick (before and after each tool call) | At most one half-tick | Tools with long-running side effects (file uploads, model calls inside tools) where re-issuing is expensive. |
| Every N ticks | ~1 write per N ticks | Up to N-1 ticks | High-throughput, low-cost-per-tick agents where DB writes are the bottleneck. Rare in practice. |
For most production agents, per-tick is the right default. Tool calls dominate latency and cost; one extra DB write per tick is in the noise.
How does resumption actually work?
On startup the harness looks for unfinished tasks, loads each one's checkpoint, and decides what to do with the in-flight call (if any):
- Load the checkpoint. Read the row keyed by
task_id, deserialize the state, see which tick the task got to. - Inspect the in-flight call, if there is one. If the checkpoint shows "tick 4: about to call
issue_refundwith key K-77-4-refund-...", the harness has to figure out whether that call already happened on the downstream side. - Re-issue with the same idempotency key. This is why Step 2 came first. The harness re-POSTs to Stripe with the same key. Stripe either does the refund (it had not yet) or returns the cached response (it had). Either way, the result is correct, and the new process now knows.
- Persist the result, advance the tick, continue the loop. From here it is a normal loop again until the next crash or completion.
The whole pattern is structural: durable checkpoint + idempotent tool call = safe resumption. Neither half works alone.
What about the model's non-determinism?
There is one subtle bug that every from-scratch harness hits the first time. The model is non-deterministic — even at temperature 0, different SDK versions, different routing inside the provider, and different prompt cache states can produce different tokens. If you "resume" by re-running the model call that the old process made, you may get a different decision and end up calling a different tool.
That is a corruption mode the harness has to design around. Two workable answers:
- Cache the model output as part of the checkpoint. Before dispatching a tool call, write
tick_4.model_response = "<full response>"to the checkpoint. On resume, the new process reads the cached response instead of re-calling the model. This is what most production harnesses do. - Commit to whatever the model said last time, even if it has to re-call. Less common; only works if you have an external "last decision" log that is durable on its own.
Skip both and you get the pathological case: process crashes mid-tool-call, new process re-runs the model, model decides to do something different, the previous tool's side effect is now orphaned and the new tool's side effect is uncoordinated with it. This is the agent equivalent of a torn write — and the model's stochasticity makes it more likely than the equivalent failure in a deterministic workflow.
Where this comes from
The "checkpoint at every step, replay from the event history" pattern is the same one durable workflow systems have used for years. Temporal's design is the reference — its full-history replay model records every step's input and output, and on recovery replays your workflow function against that history to reach the same point. Your function code is treated as deterministic glue between durable events; non-determinism (like a model call) has to be wrapped in an "activity" whose result is recorded.
Agent harnesses inherit the same constraint. Treat the model call as the activity whose output you record. Treat the tool call as the activity whose idempotency key you control. Treat your tick loop as the deterministic shell that replays cleanly.
Try it: Stage 3 of the right pane lets you set the checkpoint policy and the tick at which the process crashes. Watch how much work is lost — and where the resumed run picks up — as you slide between per tick, per tool-call, and every N ticks. The default per-tick policy is the cheapest correct answer for most agents.
Retry Policy — When to Try Again
What is a retry policy?
A retry policy is the rules the harness follows when a tool call fails: which errors to retry, how long to wait between attempts, how many attempts to make, and when to give up. It is the third primitive — after idempotency and checkpointing — that makes a harness durable. Idempotency makes the retry safe; checkpointing makes the retry possible after a crash; the retry policy decides when the retry should happen at all.
Bad retry policy is the most common source of self-inflicted outages in agent systems. A loop that retries forever burns money. A loop that does not retry at all loses transient writes. A loop that retries in lockstep across many concurrent tasks turns a small downstream slowdown into a self-DDoS.
Which errors should you retry?
Classify the error before deciding. Two categories, with very different handling.
Transient errors — retry. These come from temporary conditions and usually succeed on a second try.
- Rate-limit responses (HTTP 429).
- Timeouts and connection resets.
- Server-side 5xx (502, 503, 504 especially).
- Sometimes 408 (request timeout, server-side).
- DNS or TCP-level failures.
Permanent errors — do not retry. These will fail the same way every time. Retrying them only burns the budget you will need for the next transient failure.
- 4xx validation errors (400, 422). The payload is wrong; sending it again will not fix the payload.
- Authentication or authorization failures (401, 403). The credential is wrong or the action is not allowed.
- Policy refusals from the model API.
- "Not found" responses (404) on a resource lookup. The resource still does not exist on the second call.
- Semantic failures inside a tool — "user not found", "order already refunded", "amount exceeds policy."
The hardest case is the ambiguous one. A timeout might be transient (network blip, retry will succeed) or permanent-ish (downstream is down for an hour). The right move is to retry with a bounded policy — not infinite — so an extended outage trips the give-up branch instead of looping.
What is the standard retry recipe?
Exponential backoff with jitter. The recipe is short:
delay = random(0, min(max_delay, base * 2^attempt))
This is the full-jitter variant from the AWS Architecture Blog's exponential backoff and jitter post, which is still the standard reference and worth reading in full. For a base = 200ms and max_delay = 30s, the first retry waits a uniformly-random value in 0–400ms, the second 0–800ms, the third 0–1600ms, and so on, capped at the max_delay. The randomization is across the whole window — not just a small wobble around an exponential value — which is what spreads concurrent retries.
A reasonable starting point for most tools: base = 200ms, max_attempts = 4, max_delay = 30s, full jitter. Tune from there.
Why exponential?
A linear backoff (wait 1s, 2s, 3s) does not back off fast enough during a real outage. The downstream system needs space to recover; doubling the wait gives it room. After 4 retries with base = 200ms and exponential, you have waited roughly 3 seconds in total. After 4 retries with linear base = 1s, you have waited 10 seconds — and the downstream is no less recovered.
Why jitter?
Without jitter, every task that hit the failure at roughly the same time retries at roughly the same time. Imagine 200 agent tasks all timed out at T = 0 because the Stripe API was slow. Without jitter, all 200 retry at T = 200ms, all 200 retry again at T = 600ms, all 200 retry again at T = 1400ms. The herd hits the recovering downstream in lockstep, knocks it back over, and amplifies the outage.
Jitter spreads the herd. The retries arrive uniformly across the wait window, so the downstream sees a smooth ramp of traffic instead of a wall. The AWS post above includes a graph that makes this vivid.
How many attempts is enough?
In practice, retry caps tend to land in the low single digits — a small handful of attempts. The reasoning: if four attempts spread across ~30 seconds did not succeed, the downstream is not having a blip — it is having an incident. Continuing to retry from the harness will not fix the incident; it will only make the harness's logs harder to read and burn budget.
There are two important refinements.
Per-tool overrides. Different downstream systems deserve different policies. Stripe charges might cap at 5 attempts over a minute because the cost of failing the user is high. A best-effort cache lookup might cap at 1 retry over 200ms because the cost of waiting is higher than the value of the result.
Retry budget per task. Even with per-call caps, a pathological task can chain many failed calls and burn through your fleet's capacity. Set a per-task cap on total retry time or total retries, so a poisoned task fails cleanly instead of starving the rest of the fleet. Stripe's own SDKs and most cloud SDKs apply this implicitly; doing it explicitly in the harness lets you reason about it.
When should you NOT retry the model call itself?
If the tool failed for a transient reason, retry. If the model's decision was wrong — it called the wrong tool, it asked the user the wrong question, it produced malformed output — retrying with the same input will not help. The model is deterministic enough on the same input that the second call will produce roughly the same wrong decision.
The right move for a semantic failure: pass the error back into the model's context as a tool result, and let the next tick decide. That is what the agent loop is for. The model sees "Tool returned error: user A-1234 not found" and decides whether to ask the user, try a different lookup, or give up. This is the difference between a transport-level retry (harness's job) and a semantic recovery (model's job).
Mixing the two — having the harness retry the whole tick in a tight loop because the model produced a bad tool call — leads to runaway loops. Foundations Module 6 (planning and reflection) is where the model learns to recover from semantic failures; this step is only about transport.
What happens when retries run out?
The harness gives up on that tick, marks the task as needing attention, and stops. Do not loop forever, and do not silently drop. The pattern that works:
- Dead-letter the task. Move it to a
failed_tasksqueue or table with the full state, the error, and the retry history. - Page the right human. A failed agent task is a real incident if the task is user-facing. Page someone with enough context to look at the trace.
- Surface to the user, gracefully. If the task was for an end user, return a "we ran into a problem, we'll get back to you" rather than a stack trace.
- Make the failure recoverable. Idempotency keys are still in the state — when an operator is ready to retry, the same key can be re-issued safely.
The shape is the same as classic distributed-system dead-letter queues. Agent harnesses inherit the pattern wholesale.
Try it: Stage 4 of the right pane has three sliders — max attempts, base delay, jitter — and two presets, Sensible defaults and Aggressive. A fixed stream of 12 deterministic calls mixes transient, permanent, and successful outcomes. Watch how max-attempts changes the count of successful recoveries (good) vs wasted retries on chronic transient failures that never recover (bad), and how the expected backoff total stretches as you crank base-delay. The Aggressive preset (max=10, base=50ms, jitter=0) is what burns the budget on those chronic-transient calls; the sim's classifier short-circuits the actually-permanent failures up front, which is the production move when you have the classifier.
Durable Execution Platforms
What is a durable execution platform?
A durable execution platform is a runtime that records every step of a workflow to durable storage, replays from the recorded history after a crash, and treats your function code as deterministic glue between durable events. The three primitives from the rest of this module — idempotency, checkpointing, retry policy — are not features on the side; they are the runtime's job.
You write something that looks like normal code:
const order = await ctx.run("lookup_order", () => lookupOrder(id));
const refund = await ctx.run("issue_refund", () => stripe.refunds.create({...}));
await ctx.run("notify_user", () => sendEmail(refund));
The runtime intercepts each ctx.run call, persists its inputs before the call and its outputs after, and on resume after a crash, replays the function and skips past every step whose output is already on the durable log. Your code never has to think about "did this already happen?" because the runtime does.
When should you build vs buy?
The previous three steps showed the primitives. You can implement them by hand on top of Postgres + a job queue (idempotency table, per-task state row, exponential backoff in your worker code). Many production agent harnesses are built exactly that way. The tradeoff is operational surface — you own the schema, the worker fleet, the DLQ, the monitoring.
Or you can adopt a platform. You give up some control over the runtime (your code has to obey the platform's determinism rules) in exchange for getting the primitives packaged, tested, and observable out of the box. As fleet sizes and concurrency grow, more teams have been moving from rolled-their-own harnesses to one of the durable execution platforms below — usually because the operational savings dominate at scale, not because they have to.
The honest decision rule:
- Roll your own if your fleet is small (single-digit concurrent agents), your team has strong DB and queue ops experience, your throughput is low, and the cost of a platform outweighs the cost of running the Postgres-plus-workers stack yourself.
- Adopt a platform if you have more than a handful of concurrent agents, your team would rather not page on the harness layer, or you specifically want replay-from-history and step-level observability built in.
The four main platforms
A short, opinionated tour. Pick the one whose SDK matches your stack; the underlying ideas are the same across all four.
Temporal
Temporal is the reference design. Workflow code in Go, Java, TypeScript, or Python. Event-sourced replay — every workflow step's input and output is appended to a history, and on recovery the workflow function replays against the history until it reaches the latest event. SDKs are mature, the documentation is excellent, and the self-hosted open-source server is real (not a teaser for the SaaS).
Most production agent harnesses, if you read their architecture posts carefully, end up looking like Temporal one way or another — the same event-sourced replay model, the same workflow/activity split, the same determinism constraints. Temporal is the version with the most mileage.
The cost: operational footprint. Self-hosting Temporal is more work than self-hosting Postgres. Temporal Cloud removes that cost but adds dollar cost.
Inngest
Inngest is TypeScript-first and SaaS-first. The model is step functions: you write a function that calls step.run(...) for each durable step, and Inngest persists each step's result. Strong DX for Node.js teams, opinionated event-driven model, and reportedly the fastest of the four to get a first working durable workflow against — many teams report getting end-to-end within a single working session.
The cost: less control than Temporal. A self-hosted option exists, but the hosted product is still the main path and the operational story is simplest there.
Restate
Restate is the newest of the four. Single-binary self-host, TypeScript and Java SDKs, virtual-object model that bundles state and behavior together. Lighter operational footprint than Temporal, and the SDK ergonomics are closer to writing normal application code than Temporal's.
The cost: smaller community than Temporal and Inngest, smaller library of integrations, and fewer production case studies to crib from. If you are evaluating, weigh how much you value the ecosystem around the platform.
Trigger.dev
Trigger.dev is TypeScript-only, SaaS, with the strongest DX for "background jobs with retries and observability" of any of the four. v3 added durable runs and queues. If your agent harness is going to live alongside a Next.js or other TS app and you want a single place for background work and durable agent runs, it is a strong contender.
The cost: TypeScript only, SaaS-only for the polished experience (the self-hosted story is workable but not the main path).
What about platform-native alternatives?
Three you should consider if you are already on the platform.
- Vercel Workflow DevKit — durable workflows that run on Vercel's Fluid Compute, integrated with the rest of the platform. Cheap to adopt if your app is already Vercel-hosted. Narrower than Temporal — fewer SDKs, smaller feature surface — but the "it's already here" tax is zero.
- AWS Step Functions — JSON state machines that orchestrate Lambda and other AWS services. Older than the other options, more enterprise, less developer-friendly. Wins when your fleet is already deep in AWS and your durable logic is mostly stitching managed services together.
- Cloudflare Workflows — Cloudflare's durable workflow primitive. Newer; aimed at the Workers ecosystem. Same shape as the others, narrower SDK surface today.
The pattern with platform-native options: cheaper adoption, narrower scope, more lock-in. Fine tradeoff if you are committed to the platform; expensive to migrate off if your needs grow past what the platform offers.
How does this map onto the three primitives?
A quick reconciliation with Steps 2–4. Every platform above bakes in:
- Idempotency — usually implicit. Steps re-run on replay are skipped against the recorded history; if the step calls a downstream API, you still need an idempotency key for that external call (the platform's history protects you within the platform; the key protects you outside it).
- Checkpointing — the history is the checkpoint. Every recorded step is a durable record the next process can resume from.
- Retry policy — first-class. Each step or activity has its own retry config; the platform handles backoff and jitter; permanent-vs-transient classification is usually a function of which exception type your step throws.
The platforms do not invent the primitives. They package them into a runtime so you write your agent loop without thinking about the plumbing.
A timing note
This market is moving fast. The four-platform list above is current as of early 2026; six months from now there will likely be new entrants and repositioning. Treat any specific platform recommendation — including this one — as a snapshot. Re-evaluate every ~6 months, and weigh the platform's roadmap as much as its current feature set. The choice is sticky enough to be worth thinking about carefully (migrations cost real time), but not so sticky that a wrong choice is fatal.
Where Module 1 ends and Module 2 begins
The three primitives plus a runtime that packages them is what makes a harness durable. But durable is not enough on its own — a durable harness that you cannot see inside is a black box that survives crashes but does not let you debug them. The next module is observability: span-per-tick traces, what to log and what to redact, the metrics that actually matter for an agent, and how to replay a failing trace until the root cause comes out.
The shape of the rest of Track B is: durability (this module) → visibility (Module 2) → safety (Module 3, guardrails) → economics (Module 4, cost and latency) → continuous evaluation (Module 5) → release process (Module 6, deployment and rollout) → incident response (Module 7) → multi-agent coordination (Module 8) → the operational capstone (Module 9). Each module builds on the durable harness this one defined.
Try it: Stage 5 of the right pane is a small platform-picker. Toggle the filter chips — Self-hosted only, TypeScript-first, Multi-language, Minimal ops — and watch which of the four platform cards stay highlighted and which dim. The exercise is not to memorize the platforms; it is to see how the constraints actually drive the choice.