Abstract
As our organization has gotten very good at protecting server SLOs with reliability best practices like scaling globally distributed at-scale architectures, toil mitigation, and continuous reliability improvements we noticed that a majority of incidents impacting our end-users were not showing up as an SLO miss.
In many cases these outages were not even observable from the server side - for example, the rollout of a new version of the consumer mobile application (that our services powers) to an app store could break one or more critical feature(s) due to bugs in client code. This reality has led to a change in the way we approach reliability - we’re shifting our focus from server reliability to product reliability.
We’re not yet finished with the transition, but we’re starting to see very positive results. Our talk shares challenges we've solved so far, lessons we've learned, and our vision for the future.