Skip to content

Long-running agents

Build agents that persist for days, weeks, or months — surviving restarts, waking on demand, and managing work that spans far longer than any single request.

Why Cloudflare for long-running agents

Agents spend most of their time waiting. Waiting for user input (seconds to days), LLM responses (seconds to minutes), tool results (seconds to hours), human approvals (hours to days), or scheduled wake-ups (minutes to months). On a traditional VM or container, you pay for all that idle time. An agent that is 99% dormant and 1% active still costs you 100% of a server.

Durable Objects invert this model. An agent exists as an addressable entity with persistent state, but consumes zero compute when hibernated. When something happens — an HTTP request, a WebSocket message, a scheduled alarm, an inbound email — the platform wakes the agent, loads its state from SQLite, and hands it the event. The agent does its work, then goes back to sleep.

This is the actor model: each agent has an identity, durable state, and wakes on message. You do not manage servers, routing, health checks, or restart logic. The platform handles placement, scaling, and recovery.

The economics follow directly:

VMs / ContainersDurable Objects
Idle costFull compute cost, alwaysZero (hibernated)
ScalingProvision and manage capacityAutomatic, per-agent
StateExternal database requiredBuilt-in SQLite
RecoveryYou build it (process managers, health checks)Platform restarts, state survives
Identity / routingYou build it (load balancers, sticky sessions)Built-in (name to agent)
10,000 agents, each active 1% of the time10,000 always-on instances~100 active at any moment

For agents — which are inherently bursty, stateful, and long-lived — this is a natural fit.

The lifecycle of a long-running agent

A long-running agent is not a process that runs continuously. It is an entity that exists continuously but runs intermittently. Understanding the lifecycle is key to building agents that work reliably over long timelines.

Wake → onStart() → handle events → idle (~2 min) → hibernation
▲ │
└──────────────── alarm or request wakes agent ────────┘
Eviction (crash / redeploy) can happen at any point.
State persists in SQLite. Agent restarts on next event.

What survives

  • this.state — persisted to SQLite on every setState() call
  • this.sql data — all SQLite tables you create
  • Scheduled tasks — stored in SQLite, trigger alarms to wake the agent
  • Connection stateconnection.setState() data for each WebSocket client
  • Fiber checkpointsstash() data from runFiber()

Any higher-level abstractions built on SQLite also survive, since they share the same durable storage.

What does not survive

  • In-memory variables — class fields not stored via setState() or this.sql
  • Running timerssetTimeout, setInterval are lost on hibernation/eviction
  • Open fetch requests — in-flight HTTP calls are abandoned
  • Local closures — callbacks and promise chains are lost

The implication: any work that matters must be persisted or recoverable. The SDK provides primitives for this — schedules, fibers, queues — but understanding the boundary between "in-memory" and "durable" is essential.

Running example: a project manager agent

Throughout this doc, we build up a project manager agent that:

  • Lives for the duration of a project (weeks or months)
  • Tracks tasks, assigns work to sub-agents, and reports progress
  • Wakes up on schedule to check deadlines and send reminders
  • Reacts to external events (webhooks from GitHub, emails from team members)
  • Handles long-running operations (CI pipelines, code reviews, deployments)
  • Survives any number of restarts and evictions along the way
TypeScript
import { Agent } from "agents";
type ProjectState = {
name: string;
status: "planning" | "active" | "review" | "complete";
tasks: Task[];
plan: Plan | null;
};
type Task = {
id: string;
title: string;
status: "pending" | "in_progress" | "blocked" | "complete";
assignee?: string;
dueDate?: string;
completedAt?: number;
externalJobId?: string;
};
export class ProjectManager extends Agent<ProjectState> {
initialState: ProjectState = {
name: "",
status: "planning",
tasks: [],
plan: null
};
}

The Plan type is introduced in Planning as a durability strategy. We add capabilities to this agent section by section.

Waking up: how agents get activated

A hibernated agent can be woken by any of these sources:

Wake sourceHow it worksExample
HTTP requestAny request to the agent's URL triggers onRequest()A webhook from GitHub
WebSocket connectionA client connects, triggering onConnect()A team member opens the dashboard
RPC callAnother Worker or agent calls a method via service binding or @callableA coordinator agent delegates a task
Scheduled alarmA stored schedule fires, triggering your callbackDaily standup reminder at 9am
EmailAn inbound email triggers onEmail()A team member replies to a status email

The pattern extends naturally to any event source that can reach a Worker — anything from telephony webhooks to chat platform bots. An external signal arrives, the platform wakes the agent, and the agent handles it.

The agent does not need to be "started" or "deployed" separately for each wake source — they all route to the same Durable Object instance. The agent's identity (its name) is the routing key.

TypeScript
export class ProjectManager extends Agent<ProjectState> {
async onStart() {
// Daily deadline check at 9am UTC — idempotent, safe across restarts
await this.schedule(
"0 9 * * *",
"checkDeadlines",
{},
{
idempotent: true
}
);
// Progress sync every 30 minutes
await this.scheduleEvery(1800, "syncProgress");
}
async onRequest(request: Request): Promise<Response> {
const url = new URL(request.url);
if (url.pathname.endsWith("/github-webhook")) {
const event = await request.json();
await this.handleGitHubEvent(event);
return new Response("OK");
}
return Response.json({
project: this.state.name,
status: this.state.status
});
}
async checkDeadlines() {
/* ... find overdue tasks, broadcast alerts ... */
}
async syncProgress() {
/* ... check on sub-agents, update task statuses ... */
}
}

Staying alive during long work

Sometimes an agent needs to do work that takes longer than the idle eviction window (~70–140 seconds). Streaming an LLM response, orchestrating a multi-step tool chain, or waiting on a slow API all risk the agent being evicted mid-flight.

keepAlive() prevents this by creating a heartbeat that resets the inactivity timer:

TypeScript
export class ProjectManager extends Agent<ProjectState> {
async generateProjectPlan(goal: string) {
const result = await this.keepAliveWhile(async () => {
const plan = await this.callLLM(`Create a project plan for: ${goal}`);
const tasks = await this.callLLM(
`Break this into tasks: ${JSON.stringify(plan)}`
);
return { plan, tasks };
});
this.setState({
...this.state,
status: "active",
plan: result.plan,
tasks: result.tasks
});
}
}

keepAliveWhile() is the recommended approach — it guarantees the heartbeat is cleaned up when the work finishes (or throws). For manual control, keepAlive() returns a disposer:

TypeScript
const dispose = await this.keepAlive();
try {
await longWork();
} finally {
dispose();
}

When keepAlive is not enough

keepAlive is for work measured in minutes, not hours. For truly long-running operations, use a different strategy:

DurationStrategy
SecondsNormal request handling
MinuteskeepAlive() / keepAliveWhile()
Minutes to hoursWorkflows
Hours to daysAsync pattern: start job, hibernate, wake on completion

Surviving crashes: fibers and recovery

An agent can be evicted at any time — a deploy, a platform restart, or hitting resource limits. If the agent was mid-task, that work is lost unless it was checkpointed.

runFiber() provides crash-recoverable execution. It persists a row in SQLite for the duration of the work, and lets you stash() intermediate state. If the agent is evicted, the fiber row survives, and onFiberRecovered() is called on the next activation.

TypeScript
export class ProjectManager extends Agent<ProjectState> {
async executeTask(task: Task) {
await this.runFiber(`task:${task.id}`, async (ctx) => {
const resources = await this.gatherResources(task);
ctx.stash({ phase: "prepared", resources, task });
const result = await this.runSubAgent(task, resources);
ctx.stash({ phase: "executed", result, task });
await this.updateTaskStatus(task.id, "complete", result);
});
}
async onFiberRecovered(ctx: FiberRecoveryContext) {
if (!ctx.name.startsWith("task:")) return;
const { phase, task } = ctx.snapshot as { phase: string; task: Task };
if (phase === "prepared") {
await this.executeTask(task);
} else if (phase === "executed") {
await this.updateTaskStatus(
task.id,
"complete",
(ctx.snapshot as { result: unknown }).result
);
}
}
}

The pattern is: checkpoint before expensive work, recover from the last checkpoint. This is not automatic replay — you decide what recovery means for your domain.

For the full API reference — FiberContext, FiberRecoveryContext, concurrent fibers, inline vs fire-and-forget patterns — refer to Durable Execution.

Handling long async operations

The project manager frequently kicks off work that takes far longer than any single activation — a CI pipeline runs for 20 minutes, a design review takes a day, a video asset takes hours to generate. The agent should not stay alive for any of this. Instead, it starts the work, persists the job ID in state, and hibernates. When the result arrives — via a callback, a poll, or a workflow completion — the agent wakes, correlates the result, and moves on.

Pattern: webhook callback

The project manager starts a CI pipeline for a task. The pipeline takes 20 minutes. Rather than holding a connection open, the agent registers its own URL as the callback and goes to sleep:

TypeScript
export class ProjectManager extends Agent<ProjectState> {
async startCIPipeline(task: Task) {
const response = await fetch("https://ci.example.com/api/pipelines", {
method: "POST",
body: JSON.stringify({
repo: "org/project",
branch: "main",
callback_url: `${this.url}/ci-callback?taskId=${task.id}`
})
});
const { pipelineId } = await response.json();
this.updateTask(task.id, {
status: "in_progress",
externalJobId: pipelineId
});
}
async onRequest(request: Request): Promise<Response> {
const url = new URL(request.url);
if (url.pathname.endsWith("/ci-callback")) {
const taskId = url.searchParams.get("taskId");
const result = await request.json();
this.updateTask(taskId, {
status: result.status === "success" ? "complete" : "blocked"
});
return new Response("OK");
}
// ... other routes
}
}

Pattern: polling with schedule

Not every external service supports callbacks. When the project manager submits a video asset for generation, it needs to check back periodically until the job completes:

TypeScript
export class ProjectManager extends Agent<ProjectState> {
async startVideoGeneration(task: Task) {
const response = await fetch("https://video-api.example.com/generate", {
method: "POST",
body: JSON.stringify({ prompt: task.title })
});
const { jobId } = await response.json();
this.updateTask(task.id, { status: "in_progress", externalJobId: jobId });
await this.schedule(60, "pollExternalJob", {
taskId: task.id,
jobId,
attempt: 1
});
}
async pollExternalJob(payload: {
taskId: string;
jobId: string;
attempt: number;
}) {
const response = await fetch(
`https://video-api.example.com/status/${payload.jobId}`
);
const status = await response.json();
if (status.state === "complete" || status.state === "failed") {
this.updateTask(payload.taskId, {
status: status.state === "complete" ? "complete" : "blocked"
});
return;
}
const nextDelay = Math.min(60 * payload.attempt, 600);
await this.schedule(nextDelay, "pollExternalJob", {
...payload,
attempt: payload.attempt + 1
});
}
}

Pattern: workflow delegation

A production deployment involves multiple steps that must each retry independently — build, test, stage, promote. The project manager should not manage these steps internally; it delegates to a Workflow that handles retries and step sequencing:

TypeScript
export class ProjectManager extends Agent<ProjectState> {
async startDeployment(task: Task) {
const instanceId = await this.runWorkflow("DEPLOY_WORKFLOW", {
taskId: task.id,
environment: "production"
});
this.updateTask(task.id, {
status: "in_progress",
externalJobId: instanceId
});
}
async onWorkflowComplete(
workflowName: string,
instanceId: string,
result?: unknown
) {
const task = this.state.tasks.find((t) => t.externalJobId === instanceId);
if (task) this.updateTask(task.id, { status: "complete" });
}
}

Reconstructing context after a long wait

The CI pipeline finishes 20 minutes later. The webhook wakes the project manager. The task status is updated. But now what? If the agent was using an LLM to orchestrate work — deciding which task to run next, drafting a status report, reasoning about blockers — it needs to pick up that reasoning thread. The original prompt, the in-flight tool call, the chain of thought — all gone from memory.

This is the fundamental challenge of long-running AI agents. Most frameworks assume tool calls complete within the LLM's timeout and do not address this directly.

Three approaches work today:

Replay the full conversation history. AIChatAgent persists all messages in SQLite. When the result arrives, append it to the history and re-invoke the LLM. This is the simplest approach but re-processes the entire context window.

Stash a continuation summary. Before hibernating, persist a compact description of what the agent was doing and what to do with the result:

TypeScript
ctx.stash({
task: "Waiting for CI results",
onSuccess: "Mark task complete, move to next step in plan",
onFailure: "Notify team, schedule retry in 1 hour",
relevantContext: { taskId, planStep: 3 }
});

On recovery, use the stash to construct a focused prompt rather than replaying everything.

Use the plan as context. If the agent has a structured plan, the plan itself provides sufficient context: "I am on step 3 of 7, the step was 'run CI pipeline', the result just arrived." This is the most robust approach for long-running agents — the plan is both a recovery mechanism and a context reconstruction strategy. Refer to the next section.

Planning as a durability strategy

A structured plan is not just useful for showing progress to users — it is a durability mechanism. An agent with a plan can recover from any interruption by looking at where it left off.

TypeScript
type Plan = {
goal: string;
steps: PlanStep[];
currentStep: number;
createdAt: string;
updatedAt: string;
};
type PlanStep = {
id: string;
description: string;
status: "pending" | "in_progress" | "complete" | "failed" | "skipped";
result?: unknown;
};
export class ProjectManager extends Agent<ProjectState> {
async createPlan(goal: string) {
const steps = await this.keepAliveWhile(async () => {
return this.callLLM(`
Break down this project goal into concrete steps.
Return a JSON array of { id, description } objects.
Goal: ${goal}
`);
});
this.setState({
...this.state,
plan: {
goal,
steps: steps.map((s: { id: string; description: string }) => ({
...s,
status: "pending" as const
})),
currentStep: 0,
createdAt: new Date().toISOString(),
updatedAt: new Date().toISOString()
}
});
await this.schedule(0, "executeNextStep");
}
async executeNextStep() {
const { plan } = this.state;
if (!plan || plan.currentStep >= plan.steps.length) {
this.setState({ ...this.state, status: "complete" });
return;
}
const step = plan.steps[plan.currentStep];
try {
const result = await this.keepAliveWhile(() => this.executeStep(step));
const updatedSteps = plan.steps.map((s) =>
s.id === step.id ? { ...s, status: "complete" as const, result } : s
);
this.setState({
...this.state,
plan: {
...plan,
steps: updatedSteps,
currentStep: plan.currentStep + 1,
updatedAt: new Date().toISOString()
}
});
await this.schedule(0, "executeNextStep");
} catch (error) {
const updatedSteps = plan.steps.map((s) =>
s.id === step.id ? { ...s, status: "failed" as const } : s
);
this.setState({
...this.state,
plan: {
...plan,
steps: updatedSteps,
updatedAt: new Date().toISOString()
}
});
}
}
}

This pattern has several advantages for long-running agents:

  • Recovery is trivial — on restart, check plan.currentStep and resume
  • Progress is visible — clients see which steps are done and what is next
  • Re-planning is possible — if a step fails or requirements change, the agent can revise the remaining steps without losing completed work
  • Human oversight — the plan is a natural approval checkpoint ("here is what I am going to do — proceed?")
  • Context reconstruction — the plan tells the LLM where it is, what happened, and what to do next, without replaying the full conversation

Delegating to sub-agents

A project manager does not do everything itself. It delegates specialized work to sub-agents — each with their own identity, state, and lifecycle.

TypeScript
export class ProjectManager extends Agent<ProjectState> {
async delegateTask(task: Task) {
const researcher = await this.subAgent(
ResearchAgent,
`research-${task.id}`
);
const findings = await researcher.research(task.title);
this.updateTask(task.id, { status: "complete" });
return findings;
}
}

Sub-agents are independent Durable Objects. They have their own state, their own schedules, and their own lifecycle. The parent does not need to stay alive while the sub-agent works — it can start the work, hibernate, and be woken by a callback or scheduled check.

Recovering interrupted LLM streams

The patterns above handle the project manager's coordination work — scheduling, delegating, polling. But the project manager also uses an LLM directly: generating plans, summarizing progress, drafting status emails. Those LLM calls stream tokens over a connection that cannot be resumed if the agent is evicted mid-response.

For chat-oriented agents built on AIChatAgent, this is an even sharper problem — the user is watching the response stream in real time and sees it stop mid-sentence. unstable_chatRecovery wraps each chat turn in a runFiber, providing automatic keepAlive during streaming and a recovery hook when the agent restarts:

TypeScript
import { AIChatAgent } from "@cloudflare/ai-chat";
import type {
ChatRecoveryContext,
ChatRecoveryOptions
} from "@cloudflare/ai-chat";
class ProjectChat extends AIChatAgent<Env> {
override unstable_chatRecovery = true;
override async onChatRecovery(
ctx: ChatRecoveryContext
): Promise<ChatRecoveryOptions> {
// ctx.partialText — text generated before eviction
// ctx.recoveryData — whatever you stashed via this.stash()
// ctx.messages — full conversation history
return {};
}
}

The right recovery strategy depends on the LLM provider:

ProviderStrategyHow it worksToken cost
Workers AIContinue from partialcontinueLastTurn() — model continues via assistant prefillLow
OpenAI (Responses API)Retrieve completed responseStash responseId during streaming, retrieve on recoveryZero
AnthropicSynthetic continuationPersist partial, send a synthetic user message asking the model to continueMedium
OtherTry prefill, fall back to syntheticcontinueLastTurn() if the provider supports it, synthetic message otherwiseVaries

Managing state over time

An agent that runs for months accumulates data: conversation history, timeline events, completed tasks, schedule records. Without management, this grows unbounded.

Housekeeping

Schedule periodic cleanup to prune old data and archive completed work:

TypeScript
export class ProjectManager extends Agent<ProjectState> {
async onStart() {
await this.schedule("0 0 * * *", "housekeeping", {}, { idempotent: true });
}
async housekeeping() {
const cutoff = Date.now() - 30 * 24 * 60 * 60 * 1000;
const toArchive = this.state.tasks.filter(
(t) => t.status === "complete" && (t.completedAt ?? 0) < cutoff
);
for (const task of toArchive) {
this
.sql`INSERT INTO archived_tasks (id, data) VALUES (${task.id}, ${JSON.stringify(task)})`;
}
this.setState({
...this.state,
tasks: this.state.tasks.filter(
(t) => !toArchive.some((a) => a.id === t.id)
)
});
this.deleteWorkflows({
status: ["complete", "errored"],
createdBefore: new Date(Date.now() - 7 * 24 * 60 * 60 * 1000)
});
}
}

Conversation history management

For agents that use AIChatAgent, conversation history can grow large over extended lifespans. Without management, a 3-month conversation will exhaust the LLM's context window long before the project ends.

Strategies for managing conversation size:

  • Sliding window — keep only the last N messages in the active context. Simple and predictable.
  • Summarization — periodically summarize older messages and replace them with a compact summary. Original messages can remain in SQLite for audit.
  • Selective retention — retain messages that contain decisions, approvals, and key context while pruning routine exchanges.

End of life

A long-running agent eventually completes its purpose. The project ships, the investigation concludes, the monitoring window closes. Clean up explicitly:

TypeScript
export class ProjectManager extends Agent<ProjectState> {
async completeProject() {
const schedules = this.getSchedules();
for (const schedule of schedules) {
await this.cancelSchedule(schedule.id);
}
this.setState({ ...this.state, status: "complete" });
// All SQLite data, schedules, and state are permanently deleted
await this.destroy();
}
}

this.destroy() is permanent. If you may need the agent's data later, archive it to an external store (R2, D1, or an API call) before destroying. For agents that might be reactivated, simply mark them as complete and let them hibernate — they cost nothing when idle.

When to use Workflows vs agent-internal patterns

Both Workflows and agent-internal primitives (schedules, fibers, queues) support long-running work. The right choice depends on the nature of the work:

Agent-internalWorkflows
Best forAgent-centric work: scheduling, polling, state updatesIndependent multi-step pipelines
DurabilitySQLite (survives eviction)Workflow engine (survives everything)
Retriesthis.retry(), schedule-level retriesPer-step retries with backoff
Max durationMinutes per activation (with keepAlive)30 minutes per step, unlimited steps
Human approvalBuild it yourself (state + WebSocket)Built-in waitForApproval()
ComplexityLower — everything is in the agentHigher — separate class, wrangler config

A pragmatic rule: if the work is about the agent managing its own lifecycle (checking deadlines, syncing state, sending reminders), use schedules and fibers. If the work is a discrete pipeline that could fail and retry independently (deploy, data processing, report generation), use a Workflow.

The project manager agent uses both: schedules for its own rhythms (daily standups, progress syncs), and Workflows for heavyweight operations (deployments, CI pipelines).

Summary

Long-running agents on Cloudflare are not long-running processes. They are durable entities that wake, work, and sleep — potentially over weeks or months. The key primitives:

PrimitivePurpose
setState() / this.sqlPersist state across activations
schedule() / scheduleEvery()Wake the agent at future times
keepAlive() / keepAliveWhile()Prevent eviction during active work
runFiber() / stash()Checkpoint and recover long tasks
unstable_chatRecoveryRecover interrupted LLM streams
onRequest() / onEmail() / RPCWake on external events
runWorkflow()Delegate heavyweight multi-step work
subAgent()Delegate specialized work to child agents
Structured plans in stateEnable recovery, visibility, and re-planning

For the project manager agent, these compose into an agent that:

  1. Plans — breaks goals into steps, persists the plan in state
  2. Executes — runs steps one at a time, hibernating between them
  3. Reacts — wakes on webhooks, emails, and schedules
  4. Recovers — resumes from the last checkpoint after any interruption
  5. Delegates — hands off work to sub-agents and Workflows
  6. Maintains — prunes old data, archives completed work, manages its own lifecycle
  7. Ends — cleans up and destroys itself when the project is done

The agent does not need to run continuously to do any of this. It just needs to exist.