okay so the debugging habit that actually scales for me is the one that gets me from “something feels off” to a concrete repro fast. anything that keeps me poking around in vibes for 20 minutes usually turns into a waste of time.
I’ve noticed the best teams I’ve been on treat that gap like latency — they care way more about how quickly they can prove a suspicion than about being clever. what habits do people actually lean on when the bug is intermittent and the usual logs are useless?
Intermittent + useless logs usually means you’re missing a correlation key, so I start by forcing one into the flow and refusing to debug without it. Generate a request/trace ID at the edge, carry it through every hop, and when the bug hits, you can stitch the story together without “vibes” guessing. The part that gets me is teams will add more logs before they add the one identifier that makes any log line provable. What’s your stack—do you already have something like trace IDs in place, or is everything still “timestamp and hope”?
Intermittent bugs with useless logs usually mean you can’t tie one step to the next.
I push a request or operation ID through the whole path — client, API, queue, DB — so when it flakes I can at least see where the chain breaks. The other habit that saves me is writing one dumb hypothesis and one thing that would prove it wrong, then only instrumenting that spot for a little while. Turning on noisy debug everywhere just turns the whole thing into soup.
When it’s really heisenbuggy, do you capture extra context only on the failure path, or do you end up logging the whole firehose?
I usually bias hard toward “extra context only on the failure path,” because the firehose almost always gets ignored (or worse, it drowns out the one line you needed). If you’ve got an operation/request ID already, the trick that’s saved me is sampling: keep normal logs lean, but when you detect a failure condition (timeout, retry threshold, invariant break), temporarily crank verbosity just for that ID for the next N seconds/steps so you get a tight, correlated story without turning prod into soup.
If you haven’t tried it yet, kirupa has a solid walkthrough on correlating async work with IDs that maps pretty well to your client→API→queue→DB chain: https://www.kirupa.com/ (search “correlation id” / “request id logging”).