How Google Cloud Intelligently Manages Service Level

Executive Summary

  • Google has a nuanced view of service level. In this article, we cover why Google states that the highest service level is not desirable.

Introduction to the Google Philosophy

In many cases, extraordinary service levels are promoted. However, Google lays out in the book Site Reliability Engineering: How Google Runs Production Systems is not an appropriate way to think of service level.

Our References for This Article

If you want to see our references for this article and related Brightwork articles, see this link.

“You might expect Google to try to build 100% reliable services—ones that never fail. It turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better! Extreme reliability comes at a cost: maximizing stability limits how fast new features can be developed and how quickly products can be delivered to users, and dramatically increases their cost, which in turn reduces the numbers of features a team can afford to offer.”

More Reliable Than the Consuming Device?

Google also brings up the observation that the service level at the server needs not be more reliable than the consumer of the service?

Further, users typically don’t notice the difference between high reliability and extreme reliability in a service, because the user experience is dominated by less reliable components like the cellular network or the device they are working with. Put simply, a user on a 99% reliable smartphone cannot tell the difference between 99.99% and 99.999% service reliability! With this in mind, rather than simply maximizing uptime, Site Reliability Engineering seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that users’ overall happiness—with features, service, and performance—is optimized.

Interpreting Risk as a Continuum

In SRE, we manage service reliability largely by managing risk. We conceptualize risk as a continuum. We give equal importance to figuring out how to engineer greater reliability into Google systems and identifying the appropriate level of tolerance for the services we run. Doing so allows us to perform a cost/benefit analysis to determine, for example, where on the (nonlinear) risk continuum we should place Search, Ads, Gmail, or Photos. Our goal is to explicitly align the risk taken by a given service with the risk the business is willing to bear. We strive to make a service reliable enough, but no more reliable than it needs to be. That is, when we set an availability target of 99.99%,we want to exceed it, but not by much: that would waste opportunities to add features to the system, clean up technical debt, or reduce its operational costs. In a sense, we view the availability target as both a minimum and a maximum. The key advantage of this framing is that it unlocks explicit, thoughtful risktaking.”

Conclusion

It is interesting considering Google’s interpretation and setting of service level versus the categorical explanations of service level on the part of vendors like SAP and Oracle. What goes unobserved is that due to SAP and Oracle’s degenerative support, the service level proposals are limited to sales presentations.