Building Stateful Serverless: Durable Lambda
I built Durable Lambda as a learning project to explore actor-based state management on AWS serverless infrastructure. It brings Cloudflare Durable Objects-style guarantees to Lambda—single-instance execution, distributed locking, and automatic persistence—using only DynamoDB, SQS, and EventBridge.
The Problem
Serverless scales horizontally, but stateful applications create headaches:
- Race conditions in concurrent Lambda executions
- Distributed state consistency issues
- Loss of single-instance execution guarantees
- Manual coordination of state mutations
I needed a system that could guarantee only one Lambda ever mutates an actor's state, while keeping everything distributed and scalable.
The Architecture
Durable Lambda uses a layered approach:
1. Distributed Locking Layer
At the core is DynamoDB-based distributed locking. Before any Lambda executes actor code, it must acquire a lock:
1234567891011121314151617181920
try {
await ddb.send(
new PutCommand({
TableName: "DurableLocks",
Item: {
actorId,
lockHolder: lambdaInstanceId,
expiresAt: now + ttlSeconds * 1000,
ttl: Math.floor((now + ttlSeconds * 1000) / 1000),
},
ConditionExpression:
"attribute_not_exists(actorId) OR expiresAt < :now",
ExpressionAttributeValues: { ":now": now },
}),
)
return true
} catch (error) {
if (error.name === "ConditionalCheckFailedException") return false
throw error
}
The key is the conditional expression: only succeed if the lock doesn't exist or has expired. This prevents concurrent mutations.
2. State Persistence with Versioning
Actor state lives in DynamoDB with optimistic locking via versions:
12345678910111213141516
async function save<T>(state: DurableState<T>) {
const nextVersion = state.version + 1
await ddb.send(
new PutCommand({
TableName: tableName,
Item: {
...state,
version: nextVersion,
updatedAt: new Date().toISOString(),
},
ConditionExpression: "attribute_not_exists(version) OR version = :v",
ExpressionAttributeValues: { ":v": state.version },
}),
)
state.version = nextVersion
}
If the version doesn't match, the write fails—preventing concurrent modifications from different Lambda instances.
3. Message Ordering with SQS FIFO
For async operations, messages queue in SQS FIFO with per-actor ordering:
12345678
await sqs.send(
new SendMessageCommand({
QueueUrl: queueUrl,
MessageGroupId: actorId, // Ensures ordering per actor
MessageDeduplicationId: eventId,
MessageBody: JSON.stringify({ actorId, eventId, payload }),
}),
)
Messages group by actorId, guaranteeing order. The Lambda function polls this queue and processes messages sequentially per actor.
4. Event Bus for Signals and Timers
EventBridge handles inter-actor communication and scheduled operations:
12345678910111213
await bus.send(
new PutEventsCommand({
Entries: [
{
EventBusName: busName,
Source: "durable.lambda",
DetailType: "TimerFired",
Detail: JSON.stringify({ actorId, delayMs }),
Time: new Date(Date.now() + delayMs),
},
],
}),
)
EventBridge rules forward timer events and signals back to the Lambda, triggering actor execution.
Invocation Modes
Synchronous Calls
For sync calls, the handler loads state immediately, executes the user function, and returns the result:
1234567891011121314
if (sync) {
const state = await runtime.state.load<T>(actorId)
const ctx = {
actorId,
state: state.data,
version: state.version,
save: async (ns?: T) => {
state.data = ns ?? ctx.state
await runtime.state.save(state)
},
}
const result = await fn(ctx, userPayload)
return result || { ok: true, state: ctx.state }
}
Latency: 50-100ms (lock acquisition + state load + execution + save).
Asynchronous Calls
For async, the message queues to SQS and returns immediately:
12
await fn(runtime as any, event)
return { ok: true, processed: true, actorId }
SQS triggers the Lambda later via event source mapping, processing messages in order per actor.
The Runtime
I implemented a minimal but complete runtime with core services:
- StateService - Load/save state with versioning
- LockService - DynamoDB-based distributed locks with TTL
- QueueService - SQS message batching and coalescing
- EventService - EventBridge timers and signals
- WorkflowService - Multi-step workflows with state tracking
Each service is strictly typed with TypeScript interfaces to prevent type-unsafe operations.
Infrastructure with CDK
The DurableFabric construct provisions everything:
1234
const fabric = new DurableFabric(this, "DurableFabric", {
lambdas: [handler1, handler2],
prefix: "MyApp",
})
This creates:
- 3 DynamoDB tables (State, Locks, Workflows)
- 1 SQS FIFO queue with dead-letter queue
- 1 EventBridge bus with rules for timers and signals
- Proper IAM roles and permissions
- Automatic environment variable injection
Key Technical Decisions
Why DynamoDB locks over Redis? - No external service dependency. Conditional puts are atomic. TTL auto-cleanup prevents lock leaks.
Why SQS FIFO over Kinesis? - Built-in ordering per partition key (actorId). Lower cost. Dead-letter queue handling included.
Why EventBridge over SNS/custom scheduler? - Time-based delivery native. Rules-based filtering. Clean separation between sync and async paths.
Why version numbers over timestamps? - Prevents ABA problem. Deterministic conflict detection. Works even with clock skew.
Type Safety Throughout
I stripped all any types:
1234567891011
interface DurableEvent extends Record<string, unknown> {
actorId: string
payload: Record<string, unknown>
sync?: boolean
}
export interface WorkflowService {
create(): Promise<string>
resolve(id: string, output: Record<string, unknown>): Promise<void>
get(id: string): Promise<Record<string, unknown> | undefined>
}
This prevents runtime errors at compile time.
CLI Tool with Bun
I built the CLI using Bun's shell API:
1234
import { $ } from "bun"
await $`git clone https://github.com/0xdsqr/durable-lambda.git --depth 1 ${projectPath}/.tmp`
await $`cp -r ${projectPath}/.tmp/examples/basic/. ${projectPath}/`
Bun executes shell commands with TypeScript interpolation, no subprocess library needed.
What You Get
- 6 Working Examples - Counter, rate limiter, leaderboard, order processing, distributed cron, circuit breaker
- Nix Environment - Reproducible dev setup with Bun, TypeScript, AWS CDK, Biome
- Monitoring Guide - CloudWatch log queries, DynamoDB metrics, SQS depth tracking
- CDK Construct - Deploy with one line of code
- Full Type Safety - Zero
anytypes in core library
Performance Characteristics
| Operation | Latency |
|---|---|
| Lock acquisition | 5-10ms |
| State load/save | 10-20ms |
| Sync call end-to-end | 50-100ms |
| Async message processing | 100-200ms |
| Max state size | 400KB (DynamoDB limit) |
Learning Outcomes
Building this taught me:
- How distributed locking prevents race conditions
- Actor model semantics and guarantees
- DynamoDB condition expressions for atomic operations
- SQS FIFO ordering guarantees
- Bun's performance and developer experience
- CDK construct composition
This is v1.0.0—a learning project that actually works for stateful serverless applications.
Check out the GitHub repo for full documentation and deployment guide.

