Facebook says bug in software audit tool triggered mega outage yesterday
Facebook provided more information on the reason for its global outage yesterday which lasted six hours and likely cost the company tens of millions of dollars in lost revenue. The root cause: a bug in a software program that was supposed to identify and prevent issuing commands that could accidentally take systems offline.
In a blog post, Santosh Janardan, vice president of infrastructure for Facebook, said the outage was triggered by engineers working on maintenance of its global backbone, which is made up of tens of thousands of miles of fiber cables. optics and many routers that connect corporate data centers around the world.
Some of these centers are responsible for connecting the backbone to the Internet at large. When users open one of the company’s applications, content is delivered to them through the backbone of Facebook’s largest data centers, which house millions of servers. The backbone requires frequent maintenance, such as testing routers or replacing fiber cables.
It was during one of these routine sessions that disaster struck. According to Janardan, a software order has been issued to test the availability of the capacity of the global backbone network. Facebook developed special auditing software that verifies that such commands won’t cause chaos, but this time it didn’t detect that the instruction was wrong. (Facebook has yet to say exactly what was wrong.)
The result was a stunt of failures. The malicious command removed all backbone connections, effectively isolating the company’s data centers from each other. This, in turn, triggered an issue with Facebook’s Domain Name System (DNS) servers. DNS is like the Internet directory. It translates website names entered into a browser into a series of punctuated numbers, called IP addresses, that other computers can recognize.
DNS servers for large businesses are typically associated with a set of IP addresses advertised to other computers through a system known as Border Gateway Protocol, or BGP. It is like an electronic postal system that chooses the most efficient route to send messages through the many different networks that make up the Internet.
If the BGP addresses disappear, other computers cannot connect to the DNS servers and therefore to the networks behind them. Facebook had programmed its DNS servers to stop BGP ads if the servers were cut from its data centers, which happened after the malicious command was issued. This effectively prevented the rest of the internet from finding Facebook’s servers and clients from accessing its services.
Facebook engineers trying to fix the problem were unable to connect to its remote data centers through computer networks because the backbone was no longer functioning and the outage also removed the software tools needed to deal with the outages. emergency. This meant that engineers had to go in person to the data centers and work on the servers. As the centers and their servers were deliberately difficult to access for security reasons, their access took time, which partly explains why the outage lasted so long.
What it doesn’t explain is how the software audit tool missed the issue in the first place, or why Facebook’s networking strategy apparently didn’t involve segmenting at least some of its. data centers with a backup backbone so they don’t all disappear. at a time. There are plenty of other questions to answer, including whether members of the engineering team could have been located in a way that gave them faster access to data centers in the event of a crisis.
Managing a global network the size and complexity of Facebook is arguably one of the most difficult technical challenges a business has ever faced. Therefore, all other findings and lessons that emerge from yesterday’s events will be of immense value to CIOs and businesses around the world.