Google Cloud Engine outage caused by 'large backlog of queued mutations'

Ad giant added memory to servers, restarted, watched things get worse ... is on top of things again now

Thu 2 Apr 2020 // 07:54 UTC

A 14-hour Google cloud platform outage that we missed in the shadow of last week's G Suite outage was caused by a failure to scale, an internal investigation has shown.

The outage, which occurred on 26 March, brought down Google's cloud services in multiple regions, including Dataflow, Big Query, DialogFlow, Kubernetes Engine, Cloud Firestore, App Engine, and Cloud Console. The systems were affected for a total of 14 hours.

The outage was caused by a lack of memory in the company's cache servers, according to an internal investigation by the company published today. "The trigger of the incident was a bulk update of group memberships that expanded to an unexpectedly high number of modified permissions, which generated a large backlog of queued mutations to be applied in real-time," the investigation said.

"The processing of the backlog was degraded by a latent issue with the cache servers, which led to them running out of memory; this in turn resulted in requests to IAM timing out. The problem was temporarily exacerbated in various regions by emergency rollouts performed to mitigate the high memory usage."

Google resolved the issue by installing more memory into the cache servers and restarting them. But by this point, a heap of stale data had built up, which led to further issues which system engineers had to battle with for several more hours. The systems were back up and operating at 05:55AM UTC the following morning.

In response to the issues, Google said that it is "ensuring that the cache servers can handle bulk updates of the kind which triggered this incident" and that "efforts are underway to optimize the memory usage and protections on the cache servers, and allow emergency configuration changes without requiring restarts."

"To allow us to mitigate data staleness issues more quickly in future, we will also be sharding out the database batch processing to allow for parallelization and more frequent runs. We understand how important regional reliability is for our users and apologize for this incident." ®

Topics

Special Features

Vendor Voice

Resources

SaaS

Google Cloud Engine outage caused by 'large backlog of queued mutations'

Ad giant added memory to servers, restarted, watched things get worse ... is on top of things again now

More about

More about

Narrower topics

Broader topics

More about

More about

More about

Narrower topics

Broader topics

TIP US OFF

Other stories you might like

Google One VPN axed for everyone but Pixel loyalists ... for now

Google will delete data collected from 'private' browsing

Google joins the custom server CPU crowd with Arm-based Axion chips

Industrial systems integrating digitalisation

Google location tracking deal could be derailed by politics

Google sues app devs, claims they're Play Store crypto scammers with 100k+ victims

Google laying off staff again and moving some roles to 'hubs,' freeing up cash for AI investments

Google will pump more than $100B into AI, says DeepMind boss

Japan turns up heat on Apple, Google with threat of hefty fines

AI spam is winning the battle against search engine quality

Google plunks down $1 billion for extra Japan-US submarine cable

Next Vision, or Vision Next? What we really thought about Google and Intel's AI events

About Us

Our Websites

Your Privacy