昨天 GitHub 发了推特,说上周的 "挂掉" 事件对广大的忠实用户造成了不小的影响,表示深切的歉意,并形成了一份故障分析报告,链接附在了推文下面。

报告围观地址:October 21 post-incident analysis (点击下面的"原文链接")
你可以在电脑上点开这篇报告,然后边看本文的阐述,理一理这次incident的来龙去脉。好,我们开始!
直接跳过前面两段 "废话",来到第一个小节【Background】,略微扫一下第二段:
At 22:52 UTC on October 21, routine maintenance work to replace failing 100G optical equipment resulted in the loss of connectivity between our US East Coast network hub and our primary US East Coast data center. Connectivity between these locations was restored in 43 seconds, but this brief outage triggered a chain of events that led to 24 hours and 11 minutes of service degradation.
GitHub 指出此次事件发生的原因是:在10月21日22:52 UTC,更换发生故障的100G光学设备时,导致美国东海岸网络中心与美国东海岸数据中心之间的连接断开。虽然两地的连接在43秒内恢复了,但引发了一系列事件,这才导致了长达24小时11分钟的服务降级。(注意:是服务降级,不是 "宕机"!)
也就是说,一次日常维护工作,导致了异地通讯被断开了一会,然后就是一波 "蝴蝶效应",搞了一天一夜的 "火线救援"。。
接着往下看。
In the past, we’ve discussed how we use MySQL to store GitHub metadata as well as our approach to MySQL High Availability. GitHub operates multiple MySQL clusters varying in size from hundreds of gigabytes to nearly five terabytes, each with up to dozens of read replicas per cluster to store non-Git metadata, so our applications can provide pull requests and issues, manage authentication, coordinate background processing, and serve additional functionality beyond raw Git object storage. Different data across different parts of the application is stored on various clusters through functional sharding.
GitHub运行多个MySQL集群,其大小从几百GB到约5TB不等,每个集群最多有几十个只读副本来存储非Git元数据,因此我们的应用程序可以提供pull requests和issues,管理身份验证,协调后台处理,以及存储原始Git对象以外的其他功能。应用程序不同部分的不同数据通过功能分片存储在各种集群中。
在讲具体原因之前,先说一下我们GitHub是如何使用MySQL集群存储GitHub非元数据的。注意到没?MySQL、集群、存储、数据,对信息拥有较高敏感度的人应该可以猜到一二了。。
好,继续往下看。
To improve performance at scale, our applications will direct writes to the relevant primary for each cluster, but delegate read requests to a subset of replica servers in the vast majority of cases. We use Orchestrator to manage our MySQL cluster topologies and handle automated failover. Orchestrator considers a number of variables during this process and is built on top of Raft for consensus. It’s possible for Orchestrator to implement topologies that applications are unable to support, therefore care must be taken to align Orchestrator’s configuration with application-level expectations.
为了大规模提高性能,GitHub 的应用程序将直接写入每个群集的对应的主数据库,但在绝大多数情况下将读取请求委派给副本服务器的子集。GitHub 使用 Orchestrator 来管理 MySQL 集群拓扑并处理自动故障转移,Orchestrator 在此过程中综合了许多变量,并在 Raft 共识机制之上达成共识。Orchestrator 可以实现应用程序无法支持的拓扑,因此要注意将 Orchestrator 的配置与应用程序级别的期望保持一致。
2018 October 21 22:52 UTC During the network partition described above, Orchestrator, which had been active in our primary data center, began a process of leadership deselection, according to Raft consensus. The US West Coast data center and US East Coast public cloud Orchestrator nodes were able to establish a quorum and start failing over clusters to direct writes to the US West Coast data center. Orchestrator proceeded to organize the US West Coast database cluster topologies. When connectivity was restored, our application tier immediately began directing write traffic to the new primaries in the West Coast site.
然而 21 日,在网络分区过程中,Orchestrator 在主数据中心根据 Raft 的共识机制,执行了取消领导的选举(leadership deselection)。美国西海岸数据中心和美国东海岸公有云 Orchestrator 节点获得合规票数,并开始对群集进行故障转移,将写入指向美国西海岸数据中心。Orchestrator 继续组织美国西海岸数据库集群拓扑,当连接恢复时,应用层立即开始将写入流量引导到西海岸站点的新当选主节点上。
就是说网络分区,多数派根据raft协议选出新主的时候,出现了脑裂,两边都有数据写入,数据出现不一致。
The database servers in the US East Coast data center contained a brief period of writes that had not been replicated to the US West Coast facility. Because the database clusters in both data centers now contained writes that were not present in the other data center, we were unable to fail the primary back over to the US East Coast data center safely.
美国东海岸数据中心的数据库服务器包含一小段时间的写入数据,它们尚未复制到美国西海岸的设施。由于两个数据中心中的数据库集群都包含了其它数据中心中不存在的写入数据,因此无法安全地将主数据库故障转移到美国东海岸数据中心。
异地双活了解一下?双活中断后,两端产生的写入队列会形成冲突,网络恢复后,双活方案没有解决掉这个冲突,就出现了后续的服务降级。
While MySQL data backups occur every four hours and are retained for many years, the backups are stored remotely in a public cloud blob storage service. The time required to restore multiple terabytes of backup data caused the process to take hours. A significant portion of the time was consumed transferring the data from the remote backup service.
虽然MySQL数据备份每四个小时发生一次并保留多年,但备份会远程存储在公共云blob存储服务中。恢复多TB的备份数据所需的时间导致该过程需要数小时。消耗的大部分时间用于从远程备份服务传输数据。将大型备份文件解压缩,校验和准备并加载到新配置的MySQL服务器上的过程花费了大部分时间。
GitHub 工程师发现问题后进行了一系列抢救措施,这里解释一下我们救活一个 "人" 为什么需要那么长时间,原来基本是花在MySQL数据的备份压缩、校验和reload这些步骤上面。
那这个和GitHub页面打不开貌似关系不大啊,可以先恢复页面再处理数据嘛,别急,看看他们怎么解释的。。
Guarding the confidentiality and integrity of user data is GitHub’s highest priority. In an effort to preserve this data, we decided that the 30+ minutes of data written to the US West Coast data center prevented us from considering options other than failing-forward in order to keep user data safe. However, applications running in the East Coast that depend on writing information to a West Coast MySQL cluster are currently unable to cope with the additional latency introduced by a cross-country round trip for the majority of their database calls. This decision would result in our service being unusable for many users. We believe that the extended degradation of service was worth ensuring the consistency of our users’ data.
保护用户数据的机密性和完整性是GitHub的首要任务。为了保护这些数据,我们决定写入美国西海岸数据中心的30多分钟数据,以保证用户数据的安全。但是,在东海岸运行的依赖于向西海岸MySQL集群写入信息的应用程序目前无法应对大多数数据库调用的跨国往返引入的额外延迟。此决定将导致我们无法向许多用户提供服务了。我们认为,服务的延长退化值得确保用户数据的一致性。
而之所以服务降级时间长达24小时11分,是因为在此次事件中,GitHub 的策略是优先考虑用户数据完整性,而不是站点可用性和恢复时间。
we can explain the events that led to this incident, the lessons we’ve learned, and the steps we’re taking as a company to better ensure this doesn’t happen again.
GitHub 对所有受影响的用户表示歉意,并表示 "我们已经吸取了教训,并且采取了一系列措施,我们希望更好地确保不再发生类似情况。"
During our recovery, we captured the MySQL binary logs containing the writes we took in our primary site that were not replicated to our West Coast site from each affected cluster. The total number of writes that were not replicated to the West Coast was relatively small.
同时 GitHub 也表示,在恢复过程中捕获了MySQL的二进制日志,目前正在对这些日志进行分析,接下来将进一步解决由此导致的数据不一致问题。
(完)





