BlogResearch

Long-running workflow explained: durable execution

Long-running workflow explained — what durable execution means, how it handles failures, retries, and state, and why it matters for AI automation.

Osman RamadanJune 9, 20264 min read

Long-running workflow explained: durable execution

A long-running workflow is any automated process that takes more than a few seconds to complete and must survive infrastructure hiccups along the way. We're talking minutes, hours, sometimes days. A web scraping job that crawls 10,000 pages. A data pipeline that processes overnight batch uploads. An AI research workflow that queries multiple LLMs, waits for human approval, and then generates a final report.

The defining trait isn't duration — it's that the workflow must not lose progress if something crashes halfway through. Unlike generic AI automation posts, this guide shows real CodeWords workflows — not just theory.

What makes a workflow "long-running"?

Short-lived workflows — webhook receives data, transforms it, posts to Slack — finish in under a second. If they fail, you retry the whole thing. No harm done.

Long-running workflows can't afford full restarts. Consider a workflow that:

Fetches 500 leads from a CRM
Enriches each lead with web scraping via Firecrawl
Scores each lead using an LLM
Updates the CRM with scores
Sends a summary to Slack

If step 3 fails on lead #347, restarting from step 1 wastes the work already done on leads 1–346. That's the core problem durable execution solves.

Microsoft's Azure Durable Functions documentation describes this pattern as "orchestrator functions" — code that checkpoints its state and can resume from the last successful step after a failure.

How durable execution works

Durable execution frameworks persist workflow state at each checkpoint. The mechanics vary by platform, but the pattern is consistent:

Checkpointing. After each significant step, the workflow saves its current state — which steps completed, what data they produced, where in the sequence the workflow stands. CodeWords uses Redis for state persistence across workflow runs, making checkpoints fast and reliable.

Replay. When a workflow resumes after failure, the framework replays the execution log. Completed steps return their saved results without re-executing. Execution picks up from the first incomplete step.

Idempotency. Steps must be safe to retry. Sending an email twice is a problem. Writing to a database with an upsert isn't. Well-designed long-running workflows make every step idempotent or track which side effects have already fired.

Timeouts and heartbeats. Long-running steps (like waiting for a human approval or an external API callback) use heartbeat signals to confirm the workflow is still alive. If heartbeats stop, the orchestrator marks the step as failed and triggers recovery logic.

Temporal.io's engineering blog provides a solid technical deep dive into how replay-based durable execution eliminates the need for manual state management.

Why long-running workflows matter for AI automation

AI workflows are inherently long-running for three reasons:

LLM calls are slow. A single call to GPT-4 or Claude can take 5–30 seconds. A workflow that chains multiple LLM calls — summarize, then analyze, then generate — easily hits minutes. CodeWords provides native access to OpenAI, Anthropic, and Google Gemini without API key management, but the latency of model inference still adds up.

Research patterns are sequential. Deep research workflows scrape multiple sources, feed results into LLMs, and iterate. Each iteration depends on the previous one's output. These workflows routinely run for 10–30 minutes.

Batch processing at scale. Processing 1,000 items through an AI pipeline — automated lead management, content generation, competitor monitoring — takes time. Losing progress on item 800 because of a transient network error is unacceptable.

A 2025 McKinsey report on AI in operations noted that the most impactful AI automation use cases involve multi-step processes that traditional automation couldn't handle — exactly the scenarios where long-running workflows shine.

How CodeWords handles long-running workflows

CodeWords runs workflows in ephemeral E2B sandboxes — isolated Python environments with full library access. For long-running patterns, the platform provides:

Redis state persistence for checkpointing and cross-run memory
Scheduling via cron for recurring long-running jobs
Serverless execution that eliminates timeout concerns from traditional cloud functions
500+ integrations via Composio and Pipedream for connecting external systems at any checkpoint

The combination means you can build workflows that run for hours — scraping, processing, reasoning — without managing infrastructure or worrying about state loss. Describe the workflow to Cody, and the platform handles execution durability.

When do you need durable execution?

Not every workflow needs it. Quick rules:

Under 30 seconds, deterministic steps? Standard serverless functions work fine.
Over 30 seconds, multiple external calls? You need checkpointing.
Involves human-in-the-loop or waiting for callbacks? You definitely need durable execution.
Processes batches where partial progress matters? Non-negotiable.

Platforms like Zapier and Make handle simple trigger-action flows well but hit ceilings on truly long-running patterns. n8n offers more flexibility with self-hosted execution, but you manage the infrastructure. CodeWords abstracts the infrastructure while giving you Python-level control over the logic — see CodeWords pricing for execution cost details.

Long-running workflow support is one of the clearest dividing lines between automation tools built for demos and automation tools built for production.

Start building durable workflows on CodeWords →

What makes a workflow "long-running"?

How durable execution works

Why long-running workflows matter for AI automation

How CodeWords handles long-running workflows

When do you need durable execution?

Your first agent is free to build.