Long-running agents
Build agents that persist for days, weeks, or months — surviving restarts, waking on demand, and managing work that spans far longer than any single request.
Agents spend most of their time waiting. Waiting for user input (seconds to days), LLM responses (seconds to minutes), tool results (seconds to hours), human approvals (hours to days), or scheduled wake-ups (minutes to months). On a traditional VM or container, you pay for all that idle time. An agent that is 99% dormant and 1% active still costs you 100% of a server.
Durable Objects invert this model. An agent exists as an addressable entity with persistent state, but consumes zero compute when hibernated. When something happens — an HTTP request, a WebSocket message, a scheduled alarm, an inbound email — the platform wakes the agent, loads its state from SQLite, and hands it the event. The agent does its work, then goes back to sleep.
This is the actor model ↗: each agent has an identity, durable state, and wakes on message. You do not manage servers, routing, health checks, or restart logic. The platform handles placement, scaling, and recovery.
The economics follow directly:
| VMs / Containers | Durable Objects | |
|---|---|---|
| Idle cost | Full compute cost, always | Zero (hibernated) |
| Scaling | Provision and manage capacity | Automatic, per-agent |
| State | External database required | Built-in SQLite |
| Recovery | You build it (process managers, health checks) | Platform restarts, state survives |
| Identity / routing | You build it (load balancers, sticky sessions) | Built-in (name to agent) |
| 10,000 agents, each active 1% of the time | 10,000 always-on instances | ~100 active at any moment |
For agents — which are inherently bursty, stateful, and long-lived — this is a natural fit.
A long-running agent is not a process that runs continuously. It is an entity that exists continuously but runs intermittently. Understanding the lifecycle is key to building agents that work reliably over long timelines.
Wake → onStart() → handle events → idle (~2 min) → hibernation ▲ │ └──────────────── alarm or request wakes agent ────────┘
Eviction (crash / redeploy) can happen at any point.State persists in SQLite. Agent restarts on next event.this.state— persisted to SQLite on everysetState()callthis.sqldata — all SQLite tables you create- Scheduled tasks — stored in SQLite, trigger alarms to wake the agent
- Connection state —
connection.setState()data for each WebSocket client - Fiber checkpoints —
stash()data fromrunFiber()
Any higher-level abstractions built on SQLite also survive, since they share the same durable storage.
- In-memory variables — class fields not stored via
setState()orthis.sql - Running timers —
setTimeout,setIntervalare lost on hibernation/eviction - Open fetch requests — in-flight HTTP calls are abandoned
- Local closures — callbacks and promise chains are lost
The implication: any work that matters must be persisted or recoverable. The SDK provides primitives for this — schedules, fibers, queues — but understanding the boundary between "in-memory" and "durable" is essential.
Throughout this doc, we build up a project manager agent that:
- Lives for the duration of a project (weeks or months)
- Tracks tasks, assigns work to sub-agents, and reports progress
- Wakes up on schedule to check deadlines and send reminders
- Reacts to external events (webhooks from GitHub, emails from team members)
- Handles long-running operations (CI pipelines, code reviews, deployments)
- Survives any number of restarts and evictions along the way
import { Agent } from "agents";
type ProjectState = { name: string; status: "planning" | "active" | "review" | "complete"; tasks: Task[]; plan: Plan | null;};
type Task = { id: string; title: string; status: "pending" | "in_progress" | "blocked" | "complete"; assignee?: string; dueDate?: string; completedAt?: number; externalJobId?: string;};
export class ProjectManager extends Agent<ProjectState> { initialState: ProjectState = { name: "", status: "planning", tasks: [], plan: null };}The Plan type is introduced in Planning as a durability strategy. We add capabilities to this agent section by section.
A hibernated agent can be woken by any of these sources:
| Wake source | How it works | Example |
|---|---|---|
| HTTP request | Any request to the agent's URL triggers onRequest() | A webhook from GitHub |
| WebSocket connection | A client connects, triggering onConnect() | A team member opens the dashboard |
| RPC call | Another Worker or agent calls a method via service binding or @callable | A coordinator agent delegates a task |
| Scheduled alarm | A stored schedule fires, triggering your callback | Daily standup reminder at 9am |
An inbound email triggers onEmail() | A team member replies to a status email |
The pattern extends naturally to any event source that can reach a Worker — anything from telephony webhooks to chat platform bots. An external signal arrives, the platform wakes the agent, and the agent handles it.
The agent does not need to be "started" or "deployed" separately for each wake source — they all route to the same Durable Object instance. The agent's identity (its name) is the routing key.
export class ProjectManager extends Agent<ProjectState> { async onStart() { // Daily deadline check at 9am UTC — idempotent, safe across restarts await this.schedule( "0 9 * * *", "checkDeadlines", {}, { idempotent: true } );
// Progress sync every 30 minutes await this.scheduleEvery(1800, "syncProgress"); }
async onRequest(request: Request): Promise<Response> { const url = new URL(request.url);
if (url.pathname.endsWith("/github-webhook")) { const event = await request.json(); await this.handleGitHubEvent(event); return new Response("OK"); }
return Response.json({ project: this.state.name, status: this.state.status }); }
async checkDeadlines() { /* ... find overdue tasks, broadcast alerts ... */ } async syncProgress() { /* ... check on sub-agents, update task statuses ... */ }}Sometimes an agent needs to do work that takes longer than the idle eviction window (~70–140 seconds). Streaming an LLM response, orchestrating a multi-step tool chain, or waiting on a slow API all risk the agent being evicted mid-flight.
keepAlive() prevents this by creating a heartbeat that resets the inactivity timer:
export class ProjectManager extends Agent<ProjectState> { async generateProjectPlan(goal: string) { const result = await this.keepAliveWhile(async () => { const plan = await this.callLLM(`Create a project plan for: ${goal}`); const tasks = await this.callLLM( `Break this into tasks: ${JSON.stringify(plan)}` ); return { plan, tasks }; });
this.setState({ ...this.state, status: "active", plan: result.plan, tasks: result.tasks }); }}keepAliveWhile() is the recommended approach — it guarantees the heartbeat is cleaned up when the work finishes (or throws). For manual control, keepAlive() returns a disposer:
const dispose = await this.keepAlive();try { await longWork();} finally { dispose();}keepAlive is for work measured in minutes, not hours. For truly long-running operations, use a different strategy:
| Duration | Strategy |
|---|---|
| Seconds | Normal request handling |
| Minutes | keepAlive() / keepAliveWhile() |
| Minutes to hours | Workflows |
| Hours to days | Async pattern: start job, hibernate, wake on completion |
An agent can be evicted at any time — a deploy, a platform restart, or hitting resource limits. If the agent was mid-task, that work is lost unless it was checkpointed.
runFiber() provides crash-recoverable execution. It persists a row in SQLite for the duration of the work, and lets you stash() intermediate state. If the agent is evicted, the fiber row survives, and onFiberRecovered() is called on the next activation.
export class ProjectManager extends Agent<ProjectState> { async executeTask(task: Task) { await this.runFiber(`task:${task.id}`, async (ctx) => { const resources = await this.gatherResources(task); ctx.stash({ phase: "prepared", resources, task });
const result = await this.runSubAgent(task, resources); ctx.stash({ phase: "executed", result, task });
await this.updateTaskStatus(task.id, "complete", result); }); }
async onFiberRecovered(ctx: FiberRecoveryContext) { if (!ctx.name.startsWith("task:")) return; const { phase, task } = ctx.snapshot as { phase: string; task: Task };
if (phase === "prepared") { await this.executeTask(task); } else if (phase === "executed") { await this.updateTaskStatus( task.id, "complete", (ctx.snapshot as { result: unknown }).result ); } }}The pattern is: checkpoint before expensive work, recover from the last checkpoint. This is not automatic replay — you decide what recovery means for your domain.
For the full API reference — FiberContext, FiberRecoveryContext, concurrent fibers, inline vs fire-and-forget patterns — refer to Durable Execution.
The project manager frequently kicks off work that takes far longer than any single activation — a CI pipeline runs for 20 minutes, a design review takes a day, a video asset takes hours to generate. The agent should not stay alive for any of this. Instead, it starts the work, persists the job ID in state, and hibernates. When the result arrives — via a callback, a poll, or a workflow completion — the agent wakes, correlates the result, and moves on.
The project manager starts a CI pipeline for a task. The pipeline takes 20 minutes. Rather than holding a connection open, the agent registers its own URL as the callback and goes to sleep:
export class ProjectManager extends Agent<ProjectState> { async startCIPipeline(task: Task) { const response = await fetch("https://ci.example.com/api/pipelines", { method: "POST", body: JSON.stringify({ repo: "org/project", branch: "main", callback_url: `${this.url}/ci-callback?taskId=${task.id}` }) });
const { pipelineId } = await response.json(); this.updateTask(task.id, { status: "in_progress", externalJobId: pipelineId }); }
async onRequest(request: Request): Promise<Response> { const url = new URL(request.url); if (url.pathname.endsWith("/ci-callback")) { const taskId = url.searchParams.get("taskId"); const result = await request.json(); this.updateTask(taskId, { status: result.status === "success" ? "complete" : "blocked" }); return new Response("OK"); } // ... other routes }}Not every external service supports callbacks. When the project manager submits a video asset for generation, it needs to check back periodically until the job completes:
export class ProjectManager extends Agent<ProjectState> { async startVideoGeneration(task: Task) { const response = await fetch("https://video-api.example.com/generate", { method: "POST", body: JSON.stringify({ prompt: task.title }) }); const { jobId } = await response.json(); this.updateTask(task.id, { status: "in_progress", externalJobId: jobId }); await this.schedule(60, "pollExternalJob", { taskId: task.id, jobId, attempt: 1 }); }
async pollExternalJob(payload: { taskId: string; jobId: string; attempt: number; }) { const response = await fetch( `https://video-api.example.com/status/${payload.jobId}` ); const status = await response.json();
if (status.state === "complete" || status.state === "failed") { this.updateTask(payload.taskId, { status: status.state === "complete" ? "complete" : "blocked" }); return; }
const nextDelay = Math.min(60 * payload.attempt, 600); await this.schedule(nextDelay, "pollExternalJob", { ...payload, attempt: payload.attempt + 1 }); }}A production deployment involves multiple steps that must each retry independently — build, test, stage, promote. The project manager should not manage these steps internally; it delegates to a Workflow that handles retries and step sequencing:
export class ProjectManager extends Agent<ProjectState> { async startDeployment(task: Task) { const instanceId = await this.runWorkflow("DEPLOY_WORKFLOW", { taskId: task.id, environment: "production" }); this.updateTask(task.id, { status: "in_progress", externalJobId: instanceId }); }
async onWorkflowComplete( workflowName: string, instanceId: string, result?: unknown ) { const task = this.state.tasks.find((t) => t.externalJobId === instanceId); if (task) this.updateTask(task.id, { status: "complete" }); }}The CI pipeline finishes 20 minutes later. The webhook wakes the project manager. The task status is updated. But now what? If the agent was using an LLM to orchestrate work — deciding which task to run next, drafting a status report, reasoning about blockers — it needs to pick up that reasoning thread. The original prompt, the in-flight tool call, the chain of thought — all gone from memory.
This is the fundamental challenge of long-running AI agents. Most frameworks assume tool calls complete within the LLM's timeout and do not address this directly.
Three approaches work today:
Replay the full conversation history. AIChatAgent persists all messages in SQLite. When the result arrives, append it to the history and re-invoke the LLM. This is the simplest approach but re-processes the entire context window.
Stash a continuation summary. Before hibernating, persist a compact description of what the agent was doing and what to do with the result:
ctx.stash({ task: "Waiting for CI results", onSuccess: "Mark task complete, move to next step in plan", onFailure: "Notify team, schedule retry in 1 hour", relevantContext: { taskId, planStep: 3 }});On recovery, use the stash to construct a focused prompt rather than replaying everything.
Use the plan as context. If the agent has a structured plan, the plan itself provides sufficient context: "I am on step 3 of 7, the step was 'run CI pipeline', the result just arrived." This is the most robust approach for long-running agents — the plan is both a recovery mechanism and a context reconstruction strategy. Refer to the next section.
A structured plan is not just useful for showing progress to users — it is a durability mechanism. An agent with a plan can recover from any interruption by looking at where it left off.
type Plan = { goal: string; steps: PlanStep[]; currentStep: number; createdAt: string; updatedAt: string;};
type PlanStep = { id: string; description: string; status: "pending" | "in_progress" | "complete" | "failed" | "skipped"; result?: unknown;};
export class ProjectManager extends Agent<ProjectState> { async createPlan(goal: string) { const steps = await this.keepAliveWhile(async () => { return this.callLLM(` Break down this project goal into concrete steps. Return a JSON array of { id, description } objects. Goal: ${goal} `); });
this.setState({ ...this.state, plan: { goal, steps: steps.map((s: { id: string; description: string }) => ({ ...s, status: "pending" as const })), currentStep: 0, createdAt: new Date().toISOString(), updatedAt: new Date().toISOString() } });
await this.schedule(0, "executeNextStep"); }
async executeNextStep() { const { plan } = this.state; if (!plan || plan.currentStep >= plan.steps.length) { this.setState({ ...this.state, status: "complete" }); return; }
const step = plan.steps[plan.currentStep];
try { const result = await this.keepAliveWhile(() => this.executeStep(step));
const updatedSteps = plan.steps.map((s) => s.id === step.id ? { ...s, status: "complete" as const, result } : s ); this.setState({ ...this.state, plan: { ...plan, steps: updatedSteps, currentStep: plan.currentStep + 1, updatedAt: new Date().toISOString() } });
await this.schedule(0, "executeNextStep"); } catch (error) { const updatedSteps = plan.steps.map((s) => s.id === step.id ? { ...s, status: "failed" as const } : s ); this.setState({ ...this.state, plan: { ...plan, steps: updatedSteps, updatedAt: new Date().toISOString() } }); } }}This pattern has several advantages for long-running agents:
- Recovery is trivial — on restart, check
plan.currentStepand resume - Progress is visible — clients see which steps are done and what is next
- Re-planning is possible — if a step fails or requirements change, the agent can revise the remaining steps without losing completed work
- Human oversight — the plan is a natural approval checkpoint ("here is what I am going to do — proceed?")
- Context reconstruction — the plan tells the LLM where it is, what happened, and what to do next, without replaying the full conversation
A project manager does not do everything itself. It delegates specialized work to sub-agents — each with their own identity, state, and lifecycle.
export class ProjectManager extends Agent<ProjectState> { async delegateTask(task: Task) { const researcher = await this.subAgent( ResearchAgent, `research-${task.id}` );
const findings = await researcher.research(task.title);
this.updateTask(task.id, { status: "complete" }); return findings; }}Sub-agents are independent Durable Objects. They have their own state, their own schedules, and their own lifecycle. The parent does not need to stay alive while the sub-agent works — it can start the work, hibernate, and be woken by a callback or scheduled check.
The patterns above handle the project manager's coordination work — scheduling, delegating, polling. But the project manager also uses an LLM directly: generating plans, summarizing progress, drafting status emails. Those LLM calls stream tokens over a connection that cannot be resumed if the agent is evicted mid-response.
For chat-oriented agents built on AIChatAgent, this is an even sharper problem — the user is watching the response stream in real time and sees it stop mid-sentence. unstable_chatRecovery wraps each chat turn in a runFiber, providing automatic keepAlive during streaming and a recovery hook when the agent restarts:
import { AIChatAgent } from "@cloudflare/ai-chat";import type { ChatRecoveryContext, ChatRecoveryOptions} from "@cloudflare/ai-chat";
class ProjectChat extends AIChatAgent<Env> { override unstable_chatRecovery = true;
override async onChatRecovery( ctx: ChatRecoveryContext ): Promise<ChatRecoveryOptions> { // ctx.partialText — text generated before eviction // ctx.recoveryData — whatever you stashed via this.stash() // ctx.messages — full conversation history return {}; }}The right recovery strategy depends on the LLM provider:
| Provider | Strategy | How it works | Token cost |
|---|---|---|---|
| Workers AI | Continue from partial | continueLastTurn() — model continues via assistant prefill | Low |
| OpenAI (Responses API) | Retrieve completed response | Stash responseId during streaming, retrieve on recovery | Zero |
| Anthropic | Synthetic continuation | Persist partial, send a synthetic user message asking the model to continue | Medium |
| Other | Try prefill, fall back to synthetic | continueLastTurn() if the provider supports it, synthetic message otherwise | Varies |
An agent that runs for months accumulates data: conversation history, timeline events, completed tasks, schedule records. Without management, this grows unbounded.
Schedule periodic cleanup to prune old data and archive completed work:
export class ProjectManager extends Agent<ProjectState> { async onStart() { await this.schedule("0 0 * * *", "housekeeping", {}, { idempotent: true }); }
async housekeeping() { const cutoff = Date.now() - 30 * 24 * 60 * 60 * 1000; const toArchive = this.state.tasks.filter( (t) => t.status === "complete" && (t.completedAt ?? 0) < cutoff ); for (const task of toArchive) { this .sql`INSERT INTO archived_tasks (id, data) VALUES (${task.id}, ${JSON.stringify(task)})`; } this.setState({ ...this.state, tasks: this.state.tasks.filter( (t) => !toArchive.some((a) => a.id === t.id) ) });
this.deleteWorkflows({ status: ["complete", "errored"], createdBefore: new Date(Date.now() - 7 * 24 * 60 * 60 * 1000) }); }}For agents that use AIChatAgent, conversation history can grow large over extended lifespans. Without management, a 3-month conversation will exhaust the LLM's context window long before the project ends.
Strategies for managing conversation size:
- Sliding window — keep only the last N messages in the active context. Simple and predictable.
- Summarization — periodically summarize older messages and replace them with a compact summary. Original messages can remain in SQLite for audit.
- Selective retention — retain messages that contain decisions, approvals, and key context while pruning routine exchanges.
A long-running agent eventually completes its purpose. The project ships, the investigation concludes, the monitoring window closes. Clean up explicitly:
export class ProjectManager extends Agent<ProjectState> { async completeProject() { const schedules = this.getSchedules(); for (const schedule of schedules) { await this.cancelSchedule(schedule.id); }
this.setState({ ...this.state, status: "complete" });
// All SQLite data, schedules, and state are permanently deleted await this.destroy(); }}this.destroy() is permanent. If you may need the agent's data later, archive it to an external store (R2, D1, or an API call) before destroying. For agents that might be reactivated, simply mark them as complete and let them hibernate — they cost nothing when idle.
Both Workflows and agent-internal primitives (schedules, fibers, queues) support long-running work. The right choice depends on the nature of the work:
| Agent-internal | Workflows | |
|---|---|---|
| Best for | Agent-centric work: scheduling, polling, state updates | Independent multi-step pipelines |
| Durability | SQLite (survives eviction) | Workflow engine (survives everything) |
| Retries | this.retry(), schedule-level retries | Per-step retries with backoff |
| Max duration | Minutes per activation (with keepAlive) | 30 minutes per step, unlimited steps |
| Human approval | Build it yourself (state + WebSocket) | Built-in waitForApproval() |
| Complexity | Lower — everything is in the agent | Higher — separate class, wrangler config |
A pragmatic rule: if the work is about the agent managing its own lifecycle (checking deadlines, syncing state, sending reminders), use schedules and fibers. If the work is a discrete pipeline that could fail and retry independently (deploy, data processing, report generation), use a Workflow.
The project manager agent uses both: schedules for its own rhythms (daily standups, progress syncs), and Workflows for heavyweight operations (deployments, CI pipelines).
Long-running agents on Cloudflare are not long-running processes. They are durable entities that wake, work, and sleep — potentially over weeks or months. The key primitives:
| Primitive | Purpose |
|---|---|
setState() / this.sql | Persist state across activations |
schedule() / scheduleEvery() | Wake the agent at future times |
keepAlive() / keepAliveWhile() | Prevent eviction during active work |
runFiber() / stash() | Checkpoint and recover long tasks |
unstable_chatRecovery | Recover interrupted LLM streams |
onRequest() / onEmail() / RPC | Wake on external events |
runWorkflow() | Delegate heavyweight multi-step work |
subAgent() | Delegate specialized work to child agents |
| Structured plans in state | Enable recovery, visibility, and re-planning |
For the project manager agent, these compose into an agent that:
- Plans — breaks goals into steps, persists the plan in state
- Executes — runs steps one at a time, hibernating between them
- Reacts — wakes on webhooks, emails, and schedules
- Recovers — resumes from the last checkpoint after any interruption
- Delegates — hands off work to sub-agents and Workflows
- Maintains — prunes old data, archives completed work, manages its own lifecycle
- Ends — cleans up and destroys itself when the project is done
The agent does not need to run continuously to do any of this. It just needs to exist.
- Durable Execution —
runFiber(),stash(), and crash recovery - Schedule tasks — delayed, cron, and interval tasks
- Retries — retry options and patterns
- Workflows — durable multi-step processing
- Store and sync state —
setState()and persistence - WebSockets — lifecycle hooks and hibernation
- Callable methods — RPC via
@callableand service bindings - Email routing — receiving inbound email
- Webhooks — receiving external events
- Human in the loop — approval flows