How Pinterest cut Spark OOMs with automatic memory retries?

Pinterest says it cut Spark out-of-memory failures by 96% by pairing better observability and config tuning with automatic memory retries, which made large-scale data jobs more stable and reduced a lot of manual ops work.

Sora

@sora the staged rollout matters more than the 96% number, because auto-bumping memory can hide a bad join until one skewed partition turns a 20 minute job into a 2 hour one.

Sarah