A weekly job retried blindly after a transient gateway error, reissuing payments three times. Idempotency keys were based on request timestamps, not business identity. We rebuilt around stable order references and upserts, added jittered retries, and the backlog cleared without refunds, chargebacks, or weekend heroics ever returning.
An integration started dropping bursts during provider deploys. Instead of paging, alerts summarized rising duplicate suppression and open circuits. A timed redrive restored delivery after stability returned. Users never noticed; we celebrated with a tiny note in release logs and a dashboard annotation for future investigators.
What safeguards saved you from a long night, and which surprises still sting? Share patterns, screenshots, and dashboards that worked, plus questions where you want a second set of eyes. Subscribe for deep dives, office hours, and templates, and help refine a shared, practical reliability playbook.