March 25, 2022

GitHub Outages

 

GitHub has admitted its services have been only intermittently available to developers this past week due to issues arising from their database cluster “MySQL 1”. The Microsoft-owned code-sharing website confirmed that its outages affected many of its 73 million users, degrading their experience with the platform. They are aware that these outages have impacted their customers’ productivity, and they are taking this very seriously.

The issues have arisen due to resource contention on the database cluster which negatively impacted the performance of many of GitHub’s features and services during peak load times. Issues spiked on March 23rd, with many issues relating to push and pull requests for projects. These issues were also experienced on March 16th, 17th, and 22nd, with each incident lasting between 2 and 5 hours.

GitHub is paramount to keep enterprise applications running as a large number of software products are hosted through the service. These repeated outages have prevented some developers from maintaining their products as git operations, pull requests, webhooks, API requests, Codespaces, Packages, and Actions were all affected due to poor query performance under peak loads.

GitHub Code-sharing Logo

Despite having failover options at hand, GitHub was not able to pinpoint the query performance issues before they escalated and a new load pattern introduced connectivity issues on the failover primary. This meant applications were again unable to connect to the database cluster as they tried to reset the connections.

In the more recent outages, memory profiling was enabled on the database proxy so they could more closely examine the performance characteristics under peak load. However, client connections to the database cluster started to fail again and a primary failover had to be performed in order to recover. Webhook traffic has since been throttled and this is currently mitigating database issues during peak periods.

Clearly, the code-sharing service is taking steps to prevent future outages with this database cluster during peak load. GitHub is performing an audit of load patterns, moving traffic to other databases, rolling out various performance fixes, and trying to reduce its failover times. Their senior vice president of engineering, Keith Ballinger, has apologised for these disruptions and has committed to ensuring outages are dealt with and downtime is minimised. The company will disclose more details in its availability report in a few weeks.

No comments:

Post a Comment