The primary backend store of Clipboard is built on top of Riak, one of the lesser known NoSQLs solutions. We love Riak and are really happy with our experiences with it -- both in terms of development and operations -- but to get to where we are, we had to use some tricks. In this post I want to share with you why we chose Riak and also arm you with some of the best tricks that we learned along the way. Individually, these tricks gave us better than a 100x performance boost, so they may make a big difference for you too.
If you don't know what Clipboard is, you should try it out. We're in private beta now, but here's a backdoor that will bypass the invitation system: Register at Clipboard.
There are now a ton of great open source data stores to choose from, ranging from the more traditional relational models, like MySQL and PostgresSQL, to the newer NoSQL approaches that have been popularized by CouchDB, MongoDB, Cassandra, and others. For me, the two most important considerations are (1) how easy it is to write effective code and (2) how bulletproof the system is operationally. Others may argue that other attributes -- like performance or the particulars of the data model -- are more important, but I’ll pick simplicity and robustness every time. A simple and robust store can usually be finessed to map to any data model and can be scaled outward to make up for performance.
Another nuance that greatly influenced the choice was our need to support search. Over a year ago I tried almost every NoSQL + Lucene combination and, in virtually every case, I had a hard time getting the two systems to work the way I wanted. Clearly, others have made this path work, but at the high making a data store and search index scale out together gracefully.
It was those concerns that pushed me in the direction of Riak and the more I learned about Riak, the more I liked it. I won’t reproduce the entire investigation here, but this is the short list of things that resonated with me:
- Robustness - multiple copies are stored of everything and if individual nodes die, the cluster knows how to find the copies.
- Graceful resizing - Nodes can be added and removed from a cluster without interrupting services, and data will be automatically redistributed as appropriate, all without sharding at the application layer.
- Tuning of eventual consistency - you get to pick the number of copies of each stored item and the read and write quorums (the number of nodes that must report back for an I/O operation to complete).
- Text search - built in text search with SOLR-like parameters that can be scoped over document fields.
I could go on and wax poetic about map-reduce, link walking, and Erlang extensibility, but you get the idea.
The Data Model
In Riak, you can have a bunch of buckets, each of which acts like a scoped namespace. A bucket has multiple objects inside of it, each of which has a key unique to that bucket. In other words, each bucket acts like a big hash table, and most of your work is in manipulating key/value pairs.
At Clipboard, we have buckets for users, clips, HTML blobs, comments, transactions, contacts, likes, statistics, and other things. We’ve designed our application to have minimal write contention, so usually our application code writes an object and moves on. We’ll usually also know the key for an item that we need; for example, users have a GUID in the frontend code which serves as the primary key in the users bucket.
However, almost every interesting page on Clipboard has a multitude of clips. In most cases, those clips are retrieved through a search query. Here are some examples:
- "user:bob" - All of bob’s clips
- "user:bob public:true" - bob’s public clips
- "tag:funny public:true" - public clips tagged #funny
- "user:bob tag:funny" - bob’s clips tagged #funny
We also tend to show results ordered by recentness, which has other implications that will come up later.
Clearly, this is a simple model with a trivial way of querying Riak to generate various views. But if you implement all of this in the most natural manner with Riak, your search performance will be horrible for the last two query types. The good news is that there are simple ways to mitigate the underlying issues. In our case, we saw more than a 100x performance improvement for the queries that are typical of Clipboard.
The Underlying Issues
Riak Search is a great piece of engineering and I think it is a testament to the quality of the code that we were able to grok the search core despite having no prior experience with Erlang. On investigation of the performance issues, we discovered that Riak search was optimized for indexing patterns that were unlike our application. In a nutshell, Riak uses term-based partitioning for its search index instead of document-based partitioning. Both methods have their place, but term-based partitioning has some pathological side effects that I want to talk about.
In document-based partitioning, your text index lives with the documents that are indexed. If you split a corpus into two parts that live on two machines, say A and B, then A will have an index for all of its documents (and just its documents) and B will have an index for just its documents. To query over all documents, you have to query both indices then merge the results.
In term-based partitioning, your index is partitioned by the term that you are querying on. So the term "foo" will appear on machine A even if the document in which it appears is stored on machine B
The good thing about term-based partitioning is that if you are querying on a single term, you only have to ask one machine to get your answer. This is, in fact, the primary reason that people use term-based partitioning -- short queries and queries with disjunctions perform relatively well.
But term-based partitioning also has some really bad traits. The first is that an individual index entry for a term can be unbounded in size. Let me restate that a different way. If you have a billion documents with the word "foo" then there is going to be one machine in your cluster with an index entry that has a billion document IDs in it. As a result, your index now has really spikey distributions.
The second major problem with term-based partitioning is that you can’t perform an AND (intersection or inner join) without bringing all of the keys to a single machine. Referring back to the last query example above -- "user:bob tag:funny" -- if bob has only 100 clips, but there are a million clips tagged #funny, then records for all 1M funny clips will have to be transferred and compared to bob’s 100 clips. But if we had used document-partitioning, the AND could have been performed on the local machines, resulting in no more than 100 records being transferred in total.
Put more succinctly, if you query for "x AND y" in a term-partitioned index, your performance will be proportional to the larger of x or y. But in a document-partitioned index, it’s proportional to the smaller (which is often much better).
This brings up the third problem. If you can’t do localized AND operations -- because we have to transfer records to a single machine to do the AND -- then that means that any sort and pagination operations have to wait as well. On document-partitioned indices you can do the sort and pagination in part on the local machines, bounding the amount of I/O again.
Putting this all together, we have one inescapable conclusion: without employing some tricks, Riak search can’t scale to something with millions of documents which need to be queried in ways that could have a many matches. For example, you couldn’t build Twitter on top of Riak Search. Nor could you index Wikipedia effectively. You could even have a hard time using it for a big email box.
Getting More for Less
We get around all of these issues with two basic hacks. The first requires nothing more than massaging your data. The second uses a code change we made to Riak Search that has been part of the official branch for almost a year.
The first hack takes advantage of a nifty feature of Riak search known as dynamic fields. In Riak, most fields that you search over will probably be declared in a schema file that describes the field’s type (number, date, text, etc.). But Riak also allows you to specify a wild card pattern, like "*_text", to mean that any field with the "_text" suffix should be treated as an indexed text field.
We also noted that the term-based partitioning scheme in the Riak internals hash a query to a machine by hashing the concatenation of the bucket, the field name, and the term that is being queried.
We now have all of the necessary setup in place to explain the first trick. Originally, our clip records had a field name "tag" and another named "text" for searching. We still have those fields (and they are used for broader searches over all public clips). However, we also now add the dynamic fields "tag_USER" and "text_USER" where USER is the username (e.g., bob). This means that the query that had looked like "user:bob tag:funny" -- and that choked because of the 1M #funny clips -- now looks like "tag_bob:funny". This second query can be performed on one node and can never return more clips than what bob has.
With this trick, we’ve effectively doubled the size of the index, but have effectively precomputed all feasible inner joins with usernames, yielding orders-of-magnitude performance gain for very typical queries. It’s as if each user now has their own localized text index of their clips, yielding all sorts of good properties, and doubling the storage cost is a small price to pay, IMHO. We do a variation of this trick over time ranges to make pagination faster as well.
The second trick is in how sorting is handled. As I mentioned, we tend to show results in time order. We have a time stamp in each clip object, and it would be natural to ask Riak to sort on this key with the appropriate SOLR parameter. However, to sort on a key within a document, a document has to be read from disk.
For the query "user:bob tag:funny" this wouldn’t be that bad. But if we were seeking, for example, the 100 most recent clips, Riak search would naively want to read the documents for every public clip. Things are much worse when it comes to pagination, because Riak doesn't correctly implement the order of a sort then slice (it does the sort after the slice) leaving you with the obvious alternative as doing it with a map/reduce job. You could speed this up by using something called search filters and range queries, but it is still a big problem.
Prior to the release of Riak 1.0, we added a new search parameter to the search core, "presort", which takes the arguments "key" or "relevance". We use the first form, "presort=key", which tells the search system to not read documents for sorting purposes and to, instead, base the sort on the primary key. As a result, the search internals do not need to retrieve documents to do a sort. Moreover, we do the sort at the right time (before the slice of a pagination).
This makes a profound difference in performance, as the following example will show. Suppose you do a query that has 100k results but that you only want the first 100 results. In default Riak search, the search core needs to read all 100k documents to send 100 results. But if you can sort on the primary key instead, then you only need to read the100 documents that end up being returned.
This second sort hack yields between a 10-100x performance boost alone, so it can really make the difference between your product working or failing. A more detailed description of the problems of sorting results with Riak search can be found at the riak-user mailing list from last year:
Putting it all Together
I want to reiterate my point from the opening of this blog post: We love Riak and have been really delighted with its performance, usability, and durability. But I want you to love Riak too. If you approach Riak with an open mind and a bit of creativity, I think you’ll discover it to be a kick-ass data tier. But like anything worth it’s salt, you’ll need to work at it a bit.
One more great thing about Riak is that it’s parent company, Basho, is a tremendous partner and great open source citizen. Not only have they open-sourced Riak, but they’ve also been incredibly open and generous in working with us and others. When we came up with the presort hack, they scrutinized and accepted the pull request in a couple of days, making it part of the 1.0 branch. For us, that alone is priceless.