Facebook and OVH outages: a non-technical explanation

On 4 October, Facebook suffered an outage lasting almost six hours, with no of the company’s services being accessible. On 13 October, it was OVH’s turn to suffer a one-hour blackout. In both cases, human error was to blame….

In this article, we will explain in a non-technical way the reasons for these two blackouts and, for those who wish to go into more detail, we will provide more technical articles on these two incidents at the end.

The 6 degrees of separation

Do you know the theory of the 6 degrees of separation?

Established in 1929 by the Hungarian Frigyes Karinthy, it is based on the hypothesis that any person in the world can be linked to any other person by means of at most 5 intermediaries (and therefore at most 6 links between these two people).

In 1967, the psycho-sociologist Stanley Milgram tested this theory in the United States: he sent letters to 60 people living in Nebraska to be forwarded to a particular person living at a known address in Massachusetts, only by handing the letter to a personal acquaintance, who then had to repeat the process until the letter reached its destination. Of course, at each stage the idea is to give the letter to the person you feel is best suited to the task.

For letters that reached their destination, the figure of 6 seemed to be confirmed.

In practice, this study suffered from several biases, but the idea continues to be explored and, in 2011, Facebook calculated that out of its then 721 million users each person was on average connected to any other by 4.74 relationships (and 4.57 in 2016). In the US alone, that number drops to 3.36.

What’s the link with Facebook and OVH you might ask? We’ll get to that!

When the people in Milgram’s experiment were choosing the next person to give the letter to, they used the information they knew to choose the “best” person: thus, it seems more relevant to give the letter to a sales representative who will be travelling than to someone who has never left his or her small town… unless, for example, he or she has a brother who lives in the target town and is currently on holiday in his or her small hometown!

We can see that decisions depend on several factors and can vary over time (if this brother has already left, the decision is no longer the same).

These people were “routing” without knowing it and it is precisely routing that is causing the problems of Facebook and OVH. Before we can say more, we will have to introduce some concepts in a non-technical way.

Adresse IP et DNS

Viewing a website is actually downloading a web page hosted on the remote server of the site you are viewing. When our computer exchanges with this site, it does so by sending messages to the IP address of this other computer, which is a sort of postal address. This IP address is a sequence of 4 digits associated with each machine on the internet, for example 51.91.139.53 for the server hosting this article.

(here and hereafter we will be voluntarily imprecise, the objective being to present concepts, so we will not talk about IPV4 / IPV6, local network IPs, NAT….)

But how does your computer know that to obtain the content of this page: https://www.netanswer.fr/des-releases-toutes-les-3-semaines it is with the server of address 51.91.139.53 that you must exchange?

Imagine that (in a world without internet!) an American living in the United States needs to know the telephone number of the marriage department of the town hall of Trifouilly-les-Oies in France. He will :

1. Call the US intelligence service.

2. The American intelligence service will direct him to the French intelligence service.

3. French intelligence will direct him to the town hall switchboard, which can give him the requested telephone number.

For our website the principle is exactly the same!

1. A central server is interrogated (the equivalent of the white and yellow pages of the post office for addresses)

2. It will answer that for the .FR pages it is necessary to see with such server in France.

3. This server will direct it to the servers ns0.cusae.com/ns1.cusae.com (our DNS servers)

4. Our DNS servers will be able to respond by directing it to the IP address server 51.91.139.53.

This is therefore a pyramid structure.

Routing

So we know that we have to exchange with 51.91.139.53… But how? Who do we address and how do we determine the shortest path from server to server to get from our computer to the destination server?

We must imagine all the machines in the world as a gigantic graph of billions of interconnected machines, but each machine is only directly connected to a small number of other machines.

A first idea would be to say to all your neighbours “can you tell 51.91.139.53 that I would like the contents of https://www.netanswer.fr/des-releases-toutes-les-3-semaines ?” each neighbour then transmitting this message to all their neighbours, then their neighbours too and so on.

The message would eventually reach its destination but it wouldn’t be very efficient, imagine if all the world’s traffic went through your computer every time!

Small aside:

This is however what is generally done on a local network (that of your company or school for example) or on a Wifi network: when a message arrives for one of the machines of the network (for example the answer of a website) this one is sent to all the machines of the network, it is up to them to filter the messages which do not concern them.

You understand why it is important to consult secure sites (in https) because without this, it is enough to listen to the network to read the whole of your communication with an Internet site: password, card number….

This is also why public Wifi networks are considered risky.

But let’s get back to our topic and talk about Milgram’s experiment again, but this time you have to send a letter each day to the same recipient (and you have very understanding friends – and friends of friends – ready to help you each day!) In addition, each person in the chain will tell the previous person how long it took them to pass the letter to the next person.

So if on the first day you give the letter to “Intermediate A” and we have :

You => Intermediate A
Intermediate A => Intermediate B in 15 days
Intermediate B => Intermediate C in 1 day
Intermediate C => Intermediate D in 4 hours
Intermediate D => recipient in 1 day

So you have all this information at your disposal.

On the 18th day (because before that you don’t have all these returns) you realise that “Intermediary A” may not be very efficient (15 days out of 18 in total is a lot) and you decide to go through another friend and this time the letter only takes 5 days to arrive at its destination.

As all the intermediaries apply the same reasoning, the same type of optimisation, the route used by the letter becomes more and more efficient over time!

Of course you (or your intermediaries) may have new friends who tell you they have good contacts with certain cities/countries or friends who go on holiday so the optimal route will vary over time. And if you have letters to send to different places your optimal route will be different (maybe “Intermediary A” is much more efficient for another destination city)

The set of rules you apply depending on the final recipient is your “routing table” and this is how the machines in the world communicate!

Autonomous Systems (AS) et Border Gateway Protocol (BGP)

In practice, several remarks can be made:

Your home computer has far fewer choices of recipients (usually you will only have your ISP but if it is down you could go through your phone’s shared connection, thus changing your “routing table” manually)
The internet is divided into a set of sub-networks, for example French and Japanese machines have a limited number of connections. In the same way, the set of servers of Facebook or those of your internet service provider can be seen as independent sub-networks communicating with other sub-networks.Ainsi ces règles de routages peuvent être décomposées en deux types :

Those that allow you to move from one subnet (the so-called Autonomous Systems or AS) to another.
Those allowing routing within an Autonomous System and under its responsibility.

The protocol that allows different ASes to communicate with each other is called BGP (for Border Gateway Protocol) and it allows them to announce themselves and exchange routing rules, so a message might be to say “I’m Facebook and I manage this set of IPs so send me what’s at their destination”.

We are now ready to explain the worries of Facebook and OVH!

In the case of Facebook they stopped announcing themselves to other SAs via BGP, so nobody knew how to communicate with “Facebook” anymore!

In the case of OVH, it is a question of an erroneous update on an internal router, an error propagated to all the routers of this AS which prevented any routing within their network: OVH was well seen by the other AS but once arrived in the OVH network, impossible to route towards the good machine!

In the case of Facebook, it’s as if they had become Atlantis: we’ve heard about them but it’s complicated to send them mail…
And in the case of OVH, it’s as if all the French postmen had lost their memory: “hum Marseille yes, it’s our home but how can I send them your mail?

In both cases the error was human and is unfortunately bound to happen again: the internet is very powerful, very decentralised and able to adapt continuously but it remains a fragile system at the mercy of a configuration error that spreads.

It is even suspected that some countries might abuse the BGP system (based on trust) to route traffic between certain countries through their infrastructure and thus more easily spy on them…

For further information:

https://www.numerama.com/tech/747034-de-nombreux-sites-ne-fonctionnent-plus-ovh-semble-avoir-des-problemes.html

https://blog.cloudflare.com/october-2021-facebook-outage/

https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/

https://www.techtarget.com/searchnetworking/definition/BGP-Border-Gateway-Protocol

A QUESTION?