Long-running jobs break every assumption you built for synchronous APIs. Your load balancer times out after 30s. Your mobile client doesn't know whether to retry. Your retry logic re-runs a job that already half-completed.
Here's the real scenario:
You're processing a video upload. The job takes 2–8 minutes. Millions of users.
What do you expose to the client?
A) Polling endpoint — client hits /jobs/:id/status every 5s until done
B) Webhook — job fires a POST to client's callback URL on completion
C) SSE / WebSocket — server pushes progress updates in real time
D) Synchronous wait — keep the HTTP connection open until the job finishes
One scales to millions without coupling your infrastructure to client uptime.
The others have hard production failure modes most teams don't discover until 3 AM.
The deeper problem isn't transport — it's these 4 things nobody gets right the first time:
→ Idempotency. Every job must be safe to re-run. If your retry logic can double-charge, double-send, or double-process — you don't have retries, you have bugs waiting.
→ Progress granularity. "0% → 100%" is useless for a 6-minute job. You need intermediate states: queued, processing, transcoding, uploading, complete. Clients need something to show users.
→ Timeout vs failure. A job that stops responding isn't the same as a job that failed. Dead workers, OOM kills, spot instance evictions — your queue needs a heartbeat or a visibility timeout, not just a try/catch.
→ Deduplication. The client will retry. Your queue will redeliver. You need a dedup key scoped to the original request — not the job run.
Pick one — A, B, C, or D — and tell me why. Full breakdown in the comments.
30DaysOfSystemDesign #SystemDesign #BackendEngineering #DistributedSystems