On Friday, November 30th, one of our databases became partially corrupted. To be as clear as possible: we did not lose any data that we know of and a full recovery is in progress. It will take a couple of days for things to return to normal but we are confident that the issue will be resolved completely.
Right now, a symptom of the issue is that you'll randomly see empty results appear. This can happen when you look at any page that shows multiple clips, boards, or people, including your personal clips or boards. If you reload the page, those objects will probably reappear because this issue is happening on about 20% of our queries.
The specific problem that occurred is that one of our five search indexes (which is implemented on Riak) became corrupted, causing queries to fail, throwing errors, and returning empty results to the front-end. The machine became less stable so, working with friends from Basho (the good folks behind Riak), we decided that the best path was to do a slow rebuild of the search index.
This process is slow but very safe. Everyone's data will be protected. We're going to keep the site running during this time, but performance may be degraded while the reindexing is underway.
We are very sorry for the inconvenience. Your data is extremely important to us and protecting its integrity is our highest priority. We don't take this obligation lightly.
So that you know just how serious we take this responsibility, the rest of this post will describe all of the different ways that we secure your data.
As I mentioned above, we store our data in Riak which is our favorite type of noSQL database. Riak allows you to easily manage redundant copies of your data. In our case, every object is stored in triplicate over five different machines. Moreover, the data is distributed in such a way that if any two of our five Riak machines died, there would still be at least one copy of each object left.
So, the first level of protecting your data is that it's in triplicate right from the start.
We also do full backups of those Riak machines on a nightly basis. If something catastrophic happened to all of our machine at once, we would be able to recover the state of Clipboard from less than 24 hours earlier.
But having a full backup of the nodes now means that there are 6 copies of your data.
As a final precaution, we take regular snapshots of the entire production environment and copy it to our staging environment. We do this primarily so that we can test new features on a full copy without bringing down the main service. These snapshots or performed on about a weekly basis.
So, altogether, we may have as many as 12 copies of every clip that is created, with no less than three different ways to recover the data if something went bad. I hope this gives you some comfort. It certainly does for us.
