Daily Productive Sharing 300 - What Happened to Crash Facebook?

Photo by Sigmund / Unsplash

Facebook experienced a serious crash this week, and there has been no clear progress report/postmortem from the incident until now. Instead, some third parties have released some very detailed analysis. This analysis from Cloudflare -- a CDN service provider -- can be regarded as a textbook version of the analysis.

  1. after Cloudflare found Facebook's service exceptions, the first reaction is to self-test, thinking that their own services in addition to the exceptions.
  2. the situation of Facebook was like being unplugged from the network (if you are familiar with the history of the Chinese Internet, you can probably understand what this means), from the entire Internet disappeared.
  3. the entire Internet was equivalent to a nested network, consisting of a combination of sub-networks, and the networks were linked to each other by BGP.
  4. Facebook's BGP was offline at that time, so other networks did not know about Facebook's existence, and the entire Internet did not know about its existence (in fact, GFW also knew about it);
  5. interestingly, at the beginning when Facebook was offline, the number of requests to Facebook increased dramatically, because people could not find Facebook and repeatedly submitted requests to try to figure out what was going on.

Facebook 本周经历了一次严重当机事故,从事故发生一直到现在,都没有很清晰的进展报告/事后报告。反倒是第三方发布了一些很详细的分析,这份来自 Cloudflare 的分析 -- 一家内容分发(CDN)服务商 -- 堪称教科书版的分析。

  1. Cloudflare 发现 Facebook 的服务异常之后,第一反应是自检,以为是自己的服务出了异常;
  2. 当时 Facebook 的状况像是被人拔了网线(熟悉中国互联网历史的同学大概能明白这是啥意思),从整个互联网上消失了;
  3. 整个互联网相当于一个嵌套的网络,由一个个子网络组合而成,而网络与网络之间靠 BGP 相互链接;
  4. Facebook 的 BGP 当时下线了,所以其他网络就不知道 Facebook 的存在,整个互联网也不知道它的存在(其实还有 GFW 知道);
  5. 很有趣的是,当 Facebook 下线后的最初一段时间,发往 Facebook 的请求量反而暴增,这是因为大家找不到 Facebook 后,反复提交请求,试图搞清楚怎么回事。

