The Durable Execution Problem: A Familiar Story
Imagine a multi-step business process: validate a claim, run safety checks, process a payout, send notifications. Halfway through, the server crashes. What happens next?
In traditional architectures, the answer is often "it depends." Maybe the operation times out and triggers duplicate processing. Maybe partial state corrupts what comes next. For workflows spanning minutes, hours, or even days, interruptions are all too likely.
This is the durable execution problem: ensuring that a multi-step process completes reliably, even through failures, restarts, and deployments. The industry has produced several solutions:
- External orchestration engines (e.g., Temporal, AWS Step Functions) — battle-tested but require dedicated infrastructure and introduce a critical external dependency.
- Cloud-managed workflow services — eliminate operational overhead but introduce vendor lock-in and regulatory concerns.
- Homegrown queue-based systems — avoid external dependencies but trade them for bespoke complexity: each team implements its own retry logic, state management, and compensation flows.
Airbnb faced all of these tradeoffs across multiple teams. Instead of having every service re-invent the wheel, they built Skipper — an embedded workflow engine that runs inside each service, using the database the service already depends on.
Source: Airbnb Engineering Blog
![]()
Skipper's Core Design: Workflows and Actions
Skipper centers on two abstractions:
- Workflows — define the orchestration logic: what happens in what order, and under what conditions.
- Actions — encapsulate individual operations (API calls, database updates, notifications). Each action is automatically checkpointed, so its result survives crashes and restarts.
A Concrete Example
Here's a durable workflow for processing a listing's photo review:
class ListingPublicationWorkflow : Workflow() {
private val actions = actions()
@StateField val photosApproved: Boolean? = false
@WorkflowMethod
suspend fun publishListing(submission: ListingSubmission): PublicationResult {
// Submit photos for review
val reviewId = actions.submitPhotosForReview(submission.getListingId())
// Wait for photo review completion (manual or automated)
val reviewTimedOut = waitUntil(() -> photosApproved != null,
Duration.ofHours(24))
if (reviewTimedOut || !photosApproved) {
actions.notifyHost(submission.getHostId(), "Photos require updates")
return PublicationResult.rejected("Photo review failed")
}
// Publish the listing
actions.activateListing(submission.getListingId())
actions.notifyHost(submission.getHostId(), "Your listing is now live!")
return PublicationResult.success(submission.getListingId())
}
@SignalMethod
fun completePhotoReview(approved: Boolean) {
photosApproved = approved
}
}
This code reads naturally: submit photos, wait for photo review, publish. There's no retry logic, queue management, or async coordination visible in the workflow itself.
How Durability Works
When a workflow starts, Skipper checkpoints each action's result to the database. If the workflow needs to wait (via waitUntil), Skipper persists the current state and the workflow hibernates, consuming no compute resources.
When conditions change — a signal arrives, a timer expires, or the service restarts — Skipper replays the workflow method from the beginning. Previously executed actions don't re-execute; they return their checkpointed results instantly. The workflow picks up from where it left off.
This is fundamentally different from event-sourced orchestration systems. Skipper persists state fields directly — there's no event log to replay. This makes execution leaner, especially for workflows with many signals or long histories, though it trades some auditability for that efficiency.

The Happy Path: Getting Out of the Way
Most workflow engines impose overhead on every execution, even when nothing goes wrong. External orchestration engines require network round-trips to a central cluster for every activity invocation.
Skipper takes a different approach. When a workflow starts, two things happen at the database level:
- The workflow instance is created
- A delayed timeout task is scheduled as a durability guarantee
Then the workflow executes entirely in-process. Actions run as normal method calls on an in-memory execution queue, checkpoints are batched, and the workflow can run to completion without any further coordination.
The delayed task acts as a safety net: if the process crashes mid-execution, the persistent scheduler picks up the workflow after a lease period expires and replays it. If the workflow completes normally, the timeout task fires harmlessly and is discarded.
Result: In the happy case (no crashes), Skipper adds very little overhead — just a few database writes. The workflow executes almost as if there were no workflow engine at all.
Key Tradeoffs
| What You Gain | What You Trade |
|---|---|
| No infrastructure to manage (no separate cluster) | Workflow methods must be deterministic (no side effects, no randomness, no time-dependent logic) |
| Uses existing database (MySQL, DynamoDB) | At-least-once execution — actions may execute more than once in edge cases; actions should be idempotent |
| Simple programming model (Java/Kotlin classes) | Evolution complexity — changing a workflow's structure can break in-flight workflows; versioning strategies needed |
| Independent scaling per service | Not suitable for cross-language or cross-service orchestration |
Limitations and Cautions
- Determinism is hard. Developers new to the pattern often accidentally introduce non-determinism (e.g., calling
System.currentTimeMillis()inside a workflow method). This breaks replay. - Idempotency is mandatory. Because actions may execute more than once, every action must be safe to run multiple times.
- Workflow versioning is painful. There's no built-in migration tooling. Teams must create new method versions, migrate traffic, and deprecate old versions.
- Debugging is harder. Log timestamps and call sequences reflect replays, not original execution. Better observability tooling (replay visualization) would help.
Next Steps for Learning
If this pattern interests you, consider exploring:
- Temporal — the most popular external orchestration engine for durable execution
- AWS Step Functions — a cloud-managed alternative
- Apache Airflow — for data pipeline orchestration
For a practical deep dive into building distributed systems on AWS, check out our guide on scaling Python to a cloud cluster with Ray.

Conclusion: When to Use an Embedded Workflow Engine
Durable workflow execution is a fundamental capability for reliable distributed systems. Skipper represents a specific point in the design space: an embedded engine that trades centralized orchestration for operational simplicity.
This approach won't fit every situation. But for services seeking durable execution without infrastructure overhead — particularly where minimizing dependencies is paramount — the embedded model offers compelling advantages.
The core insight generalizes beyond Airbnb's implementation: replay-based execution with checkpointed actions can provide durability without coordination services.