The Queue Was in the Database the Whole Time

Two months ago I published Local-First Cron Architecture and recommended GitHub as the task queue between a Cloudflare Worker scheduler and a local daemon. The design worked. Then it grew, and last week I caught myself writing a function called reconcile_inbox_rows() whose entire job was healing Postgres rows that my own queue design had stranded.

When you control both ends of a pipeline and you’re still writing a reconciler for it, the architecture is telling you something. This is the honest sequel.

What the April Design Became

In production, the flow looked like this:

Postgres row (task created)
    -> dispatch.py commits the task JSON to a GitHub repo (Contents API)
    -> local daemon polls the repo every 2 seconds
    -> daemon executes, writes results back to Postgres

push_task_to_github lived at dispatch.py:150. POLL_INTERVAL = 2 lived at config.py:94. And here’s the part I avoided looking at for weeks: the producer and the consumer were processes on the same machine. Every task started as a Postgres row, traveled to GitHub’s servers, sat in a repo until a poll cycle noticed it, then came home to be executed three feet from where it was born. The April design solved a real problem (a remote scheduler that can’t reach local tools), but the queue half of it had quietly become a round trip to another continent between two siblings.

The Patches That Accumulated

Queues built on version control grow the failure modes they deserve. Mine grew four:

A personal access token in a flat file, read at dispatch time. Auth for a queue between two local processes now depended on a credential on disk and GitHub’s API being up.
Retries disabled entirely. MAX_RETRIES = 0, with a comment that read “Failures report once and die.” Next to it, a dead RETRY_DELAYS = [300, 900] kept “for future re-enable.” That future never came, because retrying safely through the GitHub lane meant solving duplicate delivery first.
A completed-guard dict in task-state.json. GitHub file deletes occasionally failed after a task succeeded, and the daemon would re-run the task on the next poll. The fix was a local dict remembering completed task IDs: an at-least-once hack, persisted to a flat file, papering over the lack of an atomic claim.
reconcile_inbox_rows(). The Postgres side and the flat-file side disagreed about reality often enough that I wrote a function to walk the rows and repair the split-brain.

Each patch was reasonable in isolation. Together they were a confession: I was reimplementing, badly and in JSON files, guarantees that Postgres has shipped for a decade.

The Replacement

Shipped 2026-06-10: the queue is the tasks table. Eight claim columns (claimed_by, claimed_at, lease_expires_at, attempts, max_attempts, not_before, last_error, last_exit_code), a trigger that fires NOTIFY tasks_inbox_available on insert, and a worker that claims atomically:

WITH next AS (
  SELECT id FROM tasks
  WHERE status = 'open'
    AND (not_before IS NULL OR not_before <= now())
  ORDER BY created_at
  FOR UPDATE SKIP LOCKED
  LIMIT 1
)
UPDATE tasks t
SET status = 'claimed', claimed_by = :worker, claimed_at = now()
FROM next WHERE t.id = next.id
RETURNING t.*;

FOR UPDATE SKIP LOCKED is the whole trick. Two workers can run this concurrently and Postgres hands each a different row, or no row, with no coordination code on my side. Failures got a real contract too: a task exiting with code 75 (EX_TEMPFAIL, sitting in sysexits.h since the 1980s) marks a transient failure and reschedules with 60/300/900-second backoff via not_before. Terminal failures escalate to a human gate instead of dying silently.

The worker sleeps on LISTEN and wakes in under 1 second when a task arrives. If the LISTEN connection drops, it degrades to a 30-second poll, which is the old design’s steady state demoted to a fallback.

Concern	GitHub lane	Postgres lane
Wake latency	2-second poll	under 1s on NOTIFY
Atomic claim	completed-guard dict	`FOR UPDATE SKIP LOCKED`
Retries	`MAX_RETRIES = 0`	exit 75 + 60/300/900s backoff
Auth	PAT in a flat file	the DB connection already open
Consistency	`reconcile_inbox_rows()`	one system of record

What gets deleted: the PAT file, the 2-second poll, the completed-guard, the reconciler, and the entire Postgres -> GitHub -> poll -> Postgres round trip.

The Honest Status

“Gets deleted,” not “got deleted.” 50 proof tasks went through the new lane with zero stranded rows, but that was proof mode, without real executors attached. The GitHub lane is still running dormant alongside until a real-executor cutover gate passes. The old code dies only after that. I’ve been burned enough by my own April optimism to keep the parachute on.

The April piece even had a section titled “Why not a database?” My answer then was that the volume was too low to justify one. The volume argument was correct and beside the point: the database was already in the system, holding the tasks at both ends.

The Broader Lesson

Before you add a queue, a cache, or a coordination layer, check whether the database you already trust has the primitive. SKIP LOCKED has been in Postgres since 9.5. LISTEN/NOTIFY is older than most of the message brokers people reach for. Every component you add is a component you’ll eventually write a reconciler for.

The best queue is the one already holding your data.