You probably noticed that Facebook, WhatsApp, Instagram, Oculus VR, and Messenger were down on October 4, 2021. Naturally, this led to wild speculation regarding what actually happened. Was Facebook hacked? Is this some sort of government coverup? Facebook finally answered these questions for us.
As it turns out, the issue was caused by the network Facebook has built to connect all of its computing facilities together.
In a lengthy blog post, Facebook’s Santosh Janardhan said that everything broke during a routine maintenance job. “During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally,” the post said.
Of course, Facebook had a system in place to prevent a command like this from being executed, but a bug allowed it to slip through.
From there, the company’s DNS servers became unreachable, making it impossible for the rest of the internet to find Facebook’s servers. Thus, not only was the website down, but the domain was showing up for sale on various marketplaces.
Facebook also talked about why the outage lasted so long. The company’s engineers were unable to access the data centers remotely because their networks were down. Additionally, the loss of DNS broke the social network’s internal tools that it would use to investigate outages like the one that occurred on October 4, 2021.
Finally, Facebook’s own security caused it to take longer to get things up and running again. Here’s how Janardhan explained that:
Our primary and out-of-band network access was down, so we sent engineers onsite to the data centers to have them debug the issue and restart the systems. But this took time, because these facilities are designed with high levels of physical and system security in mind. They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them. So it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers. Only then could we confirm the issue and bring our backbone back online.
Essentially, it wasn’t as easy to physically get to the location where the fix needed to be done as it could have been, which slowed everything down.
In the blog post, Facebook summarized the situation by saying, “We’ve done extensive work hardening our systems to prevent unauthorized access, and it was interesting to see how that hardening slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making.”
To put it simply, Facebook wasn’t hacked. There wasn’t a grand conspiracy to keep people quiet. A mistake made by the company itself caused everything to crash, and its security measures made it more difficult for its engineers to repair the problem. That’s all it was.