date: Tue Nov 26 07:14:59 PST 2013 author: Charlie Robbins title: Keeping The npm Registry Awesome slug: npm-post-mortem category: npm
We know the availability and overall health of The npm Registry is paramount to everyone using Node.js as well as the larger JavaScript community and those of your using it for some awesome projects and ideas. Between November 4th and November 15th 2013 The npm Registry had several hours of downtime over three distinct time periods:
The root cause of these downtime was insufficient resources: both hardware and human. This is a full post-mortem where we will be look at how npmjs.org works, what went wrong, how we changed the previous architecture of The npm Registry to fix it, as well next steps we are taking to prevent this from happening again.
All of the next steps require additional expenditure from Nodejitsu: both servers and labor. This is why along with this post-mortem we are announcing our crowdfunding campaign: scalenpm.org! Our goal is to raise enough funds so that Nodejitsu can continue to run The npm Registry as a free service for you, the community.
Please take a minute now to donate at https://scalenpm.org!
There are two distinct components that make up npmjs.org operated by different people:
Here is a high-level summary of the old architecture:
As illustrated above, before November 13th, 2013, npm operated as a single CouchDB server with regular daily backups. We briefly ran a multi-master CouchDB setup after downtime back in August, but after reports that npm login
no longer worked correctly we rolled back to a single CouchDB server. On both November 13th and November 15th CouchDB became unresponsive on requests to the /registry
database while requests to all other databases (e.g. /public_users
) remained responsive. Although the root cause of the CouchDB failures have yet to be determined given that only requests to /registry
were slow and/or timed out we suspect it is related to the massive number of attachments stored in the registry.
The incident on November 4th was ultimately resolved by a reboot and resize of the host machine, but when the same symptoms reoccured less than 10 days later additional steps were taken:
When neither of these yielded a solution Jason Smith and I decided to move to a multi-master architecture with continuous replication illustrated below:
This should have been the end of our story but unfortunately our supervision logic did not function properly to restart the secondary master on the morning of November 15th. During this time we moved briefly back to a single master architecture. Since then the secondary master has been closely monitored by the entire Nodejitsu operations team to ensure it's continued stability.
The public npm registry simply cannot go down. Ever. We gained a lot of operational knowledge about The npm Registry and about CouchDB as a result of these outages. This new knowledge has made clear several steps that we need to take to prevent future downtime:
When these new infrastructure components are in-place The npm Registry will look like this:
The npm Registry has had a 10x year. In November 2012 there were 13.5 million downloads. In October 2013 there were 114.6 million package downloads. We're honored to have been a part of sustaining this growth for the community and we want to see it continue to grow to a billion package downloads a month and beyond.
But we need your help! All of these necessary improvements require more servers, more time from Nodejitsu staff and an overall increase to what we spend maintaining the public npm registry as a free service for the Node.js community.
Please take a minute now to donate at https://scalenpm.org!