I’m trying to frame a strong conceptual answer for a distributed system that serves read-heavy traffic across regions. The tricky part is partial failure: one region is slow or intermittently unavailable, but not fully down. If I prioritize low tail-latency, I can route around it or serve slightly stale data; if I prioritize consistency, I may amplify latency or reduce availability. In an interview, how would you structure the tradeoff discussion beyond CAP buzzwords, especially around SLOs, failure detection, read repair, and user-visible correctness?
Sarah ![]()