The Durable Execution Problem: A Familiar Story

Imagine a multi-step business process: validate a claim, run safety checks, process a payout, send notifications. Halfway through, the server crashes. What happens next?

In traditional architectures, the answer is often "it depends." Maybe the operation times out and triggers duplicate processing. Maybe partial state corrupts what comes next. For workflows spanning minutes, hours, or even days, interruptions are all too likely.

This is the durable execution problem: ensuring that a multi-step process completes reliably, even through failures, restarts, and deployments. The industry has produced several solutions:

  • External orchestration engines (e.g., Temporal, AWS Step Functions) — battle-tested but require dedicated infrastructure and introduce a critical external dependency.
  • Cloud-managed workflow services — eliminate operational overhead but introduce vendor lock-in and regulatory concerns.
  • Homegrown queue-based systems — avoid external dependencies but trade them for bespoke complexity: each team implements its own retry logic, state management, and compensation flows.

Airbnb faced all of these tradeoffs across multiple teams. Instead of having every service re-invent the wheel, they built Skipper — an embedded workflow engine that runs inside each service, using the database the service already depends on.

Source: Airbnb Engineering Blog

Airbnb Skipper embedded workflow engine architecture diagram showing service and database interaction Dev Environment Setup

Skipper's Core Design: Workflows and Actions

Skipper centers on two abstractions:

  1. Workflows — define the orchestration logic: what happens in what order, and under what conditions.
  2. Actions — encapsulate individual operations (API calls, database updates, notifications). Each action is automatically checkpointed, so its result survives crashes and restarts.

A Concrete Example

Here's a durable workflow for processing a listing's photo review:

class ListingPublicationWorkflow : Workflow() {
    private val actions = actions()
    @StateField val photosApproved: Boolean? = false

    @WorkflowMethod
    suspend fun publishListing(submission: ListingSubmission): PublicationResult {
        // Submit photos for review
        val reviewId = actions.submitPhotosForReview(submission.getListingId())

        // Wait for photo review completion (manual or automated)
        val reviewTimedOut = waitUntil(() -> photosApproved != null,
            Duration.ofHours(24))

        if (reviewTimedOut || !photosApproved) {
            actions.notifyHost(submission.getHostId(), "Photos require updates")
            return PublicationResult.rejected("Photo review failed")
        }

        // Publish the listing
        actions.activateListing(submission.getListingId())
        actions.notifyHost(submission.getHostId(), "Your listing is now live!")
        return PublicationResult.success(submission.getListingId())
    }

    @SignalMethod
    fun completePhotoReview(approved: Boolean) {
        photosApproved = approved
    }
}

This code reads naturally: submit photos, wait for photo review, publish. There's no retry logic, queue management, or async coordination visible in the workflow itself.

How Durability Works

When a workflow starts, Skipper checkpoints each action's result to the database. If the workflow needs to wait (via waitUntil), Skipper persists the current state and the workflow hibernates, consuming no compute resources.

When conditions change — a signal arrives, a timer expires, or the service restarts — Skipper replays the workflow method from the beginning. Previously executed actions don't re-execute; they return their checkpointed results instantly. The workflow picks up from where it left off.

This is fundamentally different from event-sourced orchestration systems. Skipper persists state fields directly — there's no event log to replay. This makes execution leaner, especially for workflows with many signals or long histories, though it trades some auditability for that efficiency.

Cloud infrastructure diagram illustrating durable execution with Skipper across multiple services Software Concept Art

The Happy Path: Getting Out of the Way

Most workflow engines impose overhead on every execution, even when nothing goes wrong. External orchestration engines require network round-trips to a central cluster for every activity invocation.

Skipper takes a different approach. When a workflow starts, two things happen at the database level:

  1. The workflow instance is created
  2. A delayed timeout task is scheduled as a durability guarantee

Then the workflow executes entirely in-process. Actions run as normal method calls on an in-memory execution queue, checkpoints are batched, and the workflow can run to completion without any further coordination.

The delayed task acts as a safety net: if the process crashes mid-execution, the persistent scheduler picks up the workflow after a lease period expires and replays it. If the workflow completes normally, the timeout task fires harmlessly and is discarded.

Result: In the happy case (no crashes), Skipper adds very little overhead — just a few database writes. The workflow executes almost as if there were no workflow engine at all.

Key Tradeoffs

What You GainWhat You Trade
No infrastructure to manage (no separate cluster)Workflow methods must be deterministic (no side effects, no randomness, no time-dependent logic)
Uses existing database (MySQL, DynamoDB)At-least-once execution — actions may execute more than once in edge cases; actions should be idempotent
Simple programming model (Java/Kotlin classes)Evolution complexity — changing a workflow's structure can break in-flight workflows; versioning strategies needed
Independent scaling per serviceNot suitable for cross-language or cross-service orchestration

Limitations and Cautions

  • Determinism is hard. Developers new to the pattern often accidentally introduce non-determinism (e.g., calling System.currentTimeMillis() inside a workflow method). This breaks replay.
  • Idempotency is mandatory. Because actions may execute more than once, every action must be safe to run multiple times.
  • Workflow versioning is painful. There's no built-in migration tooling. Teams must create new method versions, migrate traffic, and deprecate old versions.
  • Debugging is harder. Log timestamps and call sequences reflect replays, not original execution. Better observability tooling (replay visualization) would help.

Next Steps for Learning

If this pattern interests you, consider exploring:

  • Temporal — the most popular external orchestration engine for durable execution
  • AWS Step Functions — a cloud-managed alternative
  • Apache Airflow — for data pipeline orchestration

For a practical deep dive into building distributed systems on AWS, check out our guide on scaling Python to a cloud cluster with Ray.

Developer writing Java Kotlin workflow code for durable execution with Skipper Technical Structure Concept

Conclusion: When to Use an Embedded Workflow Engine

Durable workflow execution is a fundamental capability for reliable distributed systems. Skipper represents a specific point in the design space: an embedded engine that trades centralized orchestration for operational simplicity.

This approach won't fit every situation. But for services seeking durable execution without infrastructure overhead — particularly where minimizing dependencies is paramount — the embedded model offers compelling advantages.

The core insight generalizes beyond Airbnb's implementation: replay-based execution with checkpointed actions can provide durability without coordination services.

Further Reading

This content was drafted using AI tools based on reliable sources, and has been reviewed by our editorial team before publication. It is not intended to replace professional advice.