Scheduled Prompts, From First Principles

How do you keep a perfectly reliable promise using machines that fail all the time?

Here's a feature that sounds trivial and turns out to be a distributed-systems problem in disguise: let a user save a prompt and a schedule ("every Monday at 9am, summarize last week's portfolio news and email me") and have the backend run it, on time, forever.

Most teams reach for a cron library and ship something that demos fine and quietly loses work in production. This post takes the opposite path. We'll build the feature the dumbest obvious way, watch it break with concrete crashes, and fix exactly one thing at a time, until the design makes sense on its own terms instead of needing to be memorized.

Almost the entire design is one idea applied five times.

TL;DR: Split "deciding when" from "doing the work." Make the handoff between them a durable database row, not a function call. Then everything follows from taking that one rule seriously: durability, retries, scaling, exactly the right number of runs.

Here's the whole journey. Each rung is an attempt that breaks and forces a fix:

The promise, and why it's secretly hard

Sarah is an analyst. She sets this up once and never thinks about it again:

"Every Monday at 9:00 am, run: summarize last week's portfolio news and email me a brief."

The promise is a brief in her inbox, every Monday, forever. The trouble is what that promise sits on top of:

We deploy the backend several times a week, and every deploy kills every running process.
Pods get evicted, OOM-killed, and rescheduled by Kubernetes without warning.
The AI run is slow (minutes) and fails often (LLM timeouts, rate limits, flaky tools).
At 9:00 Monday, hundreds of users may fire at once.

The question isn't just "how do I run a prompt on a timer." It's how you keep that promise when everything around you keeps breaking. Five truths will wreck each attempt:

A process can die between any two instructions. Assume it will, at the worst moment.
When a process dies, its memory dies with it. Anything not written to the database is gone.
You can't atomically "change the database and call an external service." One can happen without the other. (The dual-write problem, which bites hard in Attempt 7.)
The AI is non-deterministic. Run the same prompt twice, it does different things.
Clocks on different machines disagree.

Attempt 0: the timer in memory

The first thing everyone writes: when Sarah saves her schedule, start a background task that sleeps until 9am and then runs the prompt.

async def on_schedule_created(sched):
    await sleep_until(sched.next_time)   # wait until Monday 9am
    await run_the_prompt(sched.prompt)   # then run it

It works in a demo. Then you deploy on Friday. The deploy restarts every pod, and every sleep_until was living in memory (truth #2). They all vanish. Monday comes and nothing happens, and nothing is even logged, because nothing remembers Sarah had a schedule at all.

Aha #1: A schedule that lives in memory dies on the next deploy. It has to be written down somewhere that outlives any process: a database row.

So schedules go in a Postgres table, each row remembering the prompt and the next time it's due.

Attempt 1: a checker loop in the API

Now we add a loop that wakes every few seconds and asks "anything due?" We put it in the API server, since it's already running.

while True:
    for s in db.query("SELECT * FROM schedules WHERE next_run_at <= now()"):
        await run_the_prompt(s.prompt)     # ← runs RIGHT HERE, in the API process
        s.next_run_at = compute_next(s)
    await sleep(5)

Schedules now survive restarts. But run_the_prompt is a 5-minute AI job, and it's running inside the API process, on the same event loop that serves the website. At 9am Monday, 300 due schedules mean 300 five-minute AI jobs fighting Sarah's portal clicks for the same CPU. The website falls over.

Aha #2: The thing that notices work and the thing that does the (slow, heavy) work must be separate processes. Noticing is cheap; doing is expensive. Share a process and the expensive one starves everything.

We split into a tiny scheduler (watches the clock) and a pool of workers (run prompts). Now we can scale workers freely without touching the website. But how does the scheduler hand work to a worker? That handoff is the whole ballgame.

Attempt 2: the handoff (the most important step)

The obvious handoff: the scheduler advances "next Monday" and kicks off the run on a worker.

for s in due:
    s.next_run_at = compute_next(s)         # advance the schedule   → saved to disk
    worker.start_in_background(s.prompt)     # tell a worker to run it → lives in memory

Now watch truth #1 (die anytime) meet truth #2 (memory dies):

The root cause: the handoff was a function call, and function calls live in memory. The fix is to make the work item a row, written before anyone acts on it, and to advance the schedule in the same transaction, so there's no gap to crash into:

with one_transaction():
    db.insert(runs, status="PENDING", schedule=s.id, slot="2026-06-01T09:00")
    s.next_run_at = compute_next(s)
# commit: either BOTH happen, or NEITHER.

Aha #3: The handoff itself must be a durable row, not a function call. The scheduler's whole job becomes "write a row that says a run is owed." A worker's whole job becomes "find an owed row and do it." Nothing is ever in-flight purely in memory.

We now have two tables: the plan (scheduled_prompts) and the to-do list (prompt_runs), which is simultaneously the work queue and the history.

You might ask: why does the scheduler advance the schedule, not the worker? Because advancing the clock is about time passing, not work finishing. If the worker owned that step, then a run that's slow, stuck, or failing would freeze the schedule: a single failed Monday would leave next_run_at parked in the past and next Monday would never be scheduled. Keeping the cursor with the scheduler means last week's failure has zero effect on next week's tick. Scheduler owns time; workers own work.

Attempt 3: many workers grab the same row

We need several workers for the 9am rush. But three of them run "give me a PENDING run" at the same instant and all grab Sarah's row, so it runs three times, three emails, triple cost. You might reach for a separate lock service. You don't need one; Postgres does it natively:

SELECT * FROM prompt_runs WHERE status='PENDING'
FOR UPDATE SKIP LOCKED          -- lock my row; skip rows already locked by others
LIMIT 1;

Aha #4: The database is already a perfect traffic cop. "Claim a job" is just "lock a row and skip the locked ones."

Attempt 4: a worker dies mid-run

A worker claims Sarah's run (marks it RUNNING) and starts the 5-minute job. Two minutes in, it's OOM-killed. The row is now stuck at RUNNING forever, and no other worker will touch it, because it isn't PENDING anymore. Lost work again, different fingerprint.

The mistake was treating "claimed" as permanent ownership. A claim must be able to expire.

The worker holds a short lease (say 3 minutes) and must renew it every 30s (a heartbeat) to prove it's alive. A healthy worker feeds it and keeps the run, even a 45-minute one. A dead worker stops, the lease lapses, and a small reaper process flips the run back to PENDING.

Aha #5: A claim is a lease, not a deed. The lease (3 min) is unrelated to how long the work takes (up to 60 min). It measures "is the worker breathing?", renewed constantly, so a crash is caught in about 3 minutes no matter how long the job would have run.

Attempt 5: the frozen worker wakes up

The lease introduces a sneaky bug. Worker A claims the run, then freezes (a long GC pause), not dead, just stuck. Its lease lapses, the reaper frees the run, Worker B claims it and starts fresh. Then A unfreezes and writes "done, here's my result," right on top of B's live run.

Aha #6: Each attempt gets a one-time stamp (attempt_id), and only the current stamp may write. The zombie's write is tagged #A1, the system expects #A2, so it's rejected harmlessly. This is a fencing token: what makes lease-based recovery safe instead of merely hopeful.

Attempt 6: the AI call just fails

LLM timeouts, rate limits, a flaky tool. The run throws. Drop it and Sarah silently gets nothing. Retry instantly forever and you hammer a struggling service while a broken prompt loops, burning money.

Aha #7: Retry, but back off (wait longer each time, with a little randomness), and give up into a visible pile, never into silence.

attempt 1 fails → wait ~2 min      attempt 2 fails → wait ~4 min
attempt 3 fails → DEAD_LETTER → alert + tell Sarah it didn't work

The dead-letter pile is the parking lot for runs that failed even after all retries, visible, alertable, replayable. Failure becomes information, not a void.

Attempt 7: a retry does things twice

This is where truth #4 (non-determinism) and truth #3 (the dual-write problem) come for us. A retry re-runs the whole prompt from scratch, so anything visible the first attempt did before dying happens again: a duplicate email, duplicated chat history, a tool action fired twice.

The first two have the same shape of fix: don't do the side effect inside the retryable work. Do it once, at the moment of success. (The AI loop never sends the email; each attempt gets a fresh conversation, and only the winner is kept.)

But "send the email once, after success" hides the deepest trap. When exactly?

You can't win A or B, because "update a row" and "call an email service" can't be made atomic. So:

Aha #8: If you can't atomically change a row and call a service, turn the service call into a row too. In the same transaction that marks the run SUCCEEDED, write an "email owed" row into an outbox table. A separate notifier drains it and sends, retrying until it sticks.

Notice this is the exact same move as Aha #3: back then we made "a run is owed" durable; now we make "an email is owed" durable. Same problem, same solution.

Attempt 8: the calendar edge cases

A few quicker breaks, all about correct scheduling rather than reliability:

The platform was down over Sarah's 9am slot. On recovery, do we run it once, skip it, or run every missed slot? That's a misfire policy (default: run once).
"9am" in whose time zone? (truth #5). We store Sarah's zone and compute the next fire. So 9am stays 9am across daylight-saving.
The previous weekly run is still going when the next is due. An overlap policy decides: skip, run anyway, or queue.

Aha #9: "Scheduling" isn't one cron calculation; it's a small set of policies for the messy edges. Exposing them is what makes it customizable.

The punchline: it was one idea the whole time

We added a lot of machinery: a to-do list, claiming, leases, heartbeats, a reaper, fencing tokens, retries, an outbox. But nearly all of it is the same single idea:

Anything that exists only in memory can vanish in a crash. So turn it into a durable row, and make the transition atomic.

The second, smaller principle that keeps recurring: because we deliver at least once, every repeatable step must be safe to repeat (one email not three, a fresh conversation per try, read-only tools).

The finished picture

Now the architecture isn't a list to memorize. Every box is an old friend, labeled with the failure it prevents:

And we didn't over-engineer the queue: this all lives in Postgres, because the work is rare and slow. The bottleneck is the AI run, never the queue. One store means the plan, the queue, the history, the retries, and the outbox are all in one place you can inspect with a single SELECT. Redis is used only for the live "thinking… searching…" progress view, which is fine to be fast-but-forgettable.

One mental model to keep

Here it is in one sentence:

It's a shared to-do list that no single machine can lose: every "I'll do it" is written down before it's attempted, every claim expires so nothing gets stuck, and every external action is itself a written-down to-do so it can't fall through the cracks.

Durability, retries, scaling, and correctness all fall out of that one commitment: write it down before you act, and make acting safe to repeat. What looked like a calendar feature was a lesson in distributed systems all along.