Facebook, Instagram, WhatsApp outage: Cause of crash revealed
The tech juggernaut has opened up about what really caused yesterday’s massive, six-hour crash that cost Mark Zuckerberg billions.
Facebook’s six-hour outage on Tuesday not only left social media addicts frustrated – it also cost CEO Mark Zuckerberg a staggering AU$9.6 billion.
The severe outage – which also saw sister platforms Instagram and WhatsApp crash – spooked stock holders, who started ditching their shares in a mass sell-off, causing the market to plummet by five per cent.
The platforms re-emerged just before 9am AEDT yesterday – and now things are back online, Facebook has reached out to the public to explain the massive “failure”.
In a note penned by Facebook’s vice president of infrastructure, Santosh Janardhan, he said it was an “error of our own making” caused when engineers tried to conduct “routine maintenance”.
He explained that “a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centres globally”.
Meanwhile, a system which was supposed to kick in to prevent “mistakes” like the one seen yesterday, failed due to a “bug in that audit tool“.
“This change caused a complete disconnection of our server connections between our data centres and the internet. And that total loss of connection caused a second issue that made things worse.”
The first problem then interfered with Facebook’s DNS, or Domain Name System, a system that connects domain names to the right IP addresses which allows users to visit websites.
“The end result was that our DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the internet to find our servers,” Janardhan continued.
“All of this happened very fast. And as our engineers worked to figure out what was happening and why, they faced two large obstacles: first, it was not possible to access our data centres through our normal means because their networks were down, and second, the total loss of DNS broke many of the internal tools we’d normally use to investigate and resolve outages like this.”
To make matters worse, workers then faced an uphill battle as they tried to respond to the outage because Facebook’s internal security systems were impacted, preventing access.
And even when the issues were fixed, Facebook faced an extra delay as it could not simply bring everything back online simultaneously, as that would cause a huge surge which could see the system crash again.
The company is now reviewing the disaster in the hope of preventing it from happening again,
“Every failure like this is an opportunity to learn and get better, and there’s plenty for us to learn from this one,” the post continued.
“After every issue, small and large, we do an extensive review process to understand how we can make our systems more resilient. That process is already underway.
“We’ve done extensive work hardening our systems to prevent unauthorised access, and it was interesting to see how that hardening slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making.”