Tuesday, February 22, 2011

Storing Part of Riak Object Value in Memory

One lesson I learned from SQL is that using (primary) surrogate keys has a lot of advantages. I know that not all SQL wisdom holds true for Riak, but also with Riak I prefer not using natural keys

For “Key Filters” Basho list the following natural keys as an example:

basho-20101215
google-20110103
yahoo-20090613

Here, the keys contain two pieces of domain data to enable Key Filters: Company name and date. Key Filters’ main advantage is that they work on the keys only, which – at least for Bitcask – are stored in memory. So querying can be done faster than if the values were to be loaded from disk.

The disadvantage of Riak’s Key Filter approach is that you end up with highly domain-specific keys, which can be hard to reference, especially if you need to update keys to allow querying new aspects of the data: If you need to change your existing keys, references to these keys needs to be updated too. This is hard to do atomically when you have a key-value store like Riak. Even worse, if data changes you need to update the key, and – again – the pointers to the key, if you have any.

Riak’s natural keys also demand the use of transform functions, which gets more complex as the amount of data stored in the key increases. In the example of filtering data for “3rd of June” (look at bottom of this page), the predicate function “ends_with” is used. If the key is extended with more data, that query will fail.

Using natural keys like Riak currently does, is a cumbersome way to store part of the object’s value in memory, forced into a single string.

You could of course ease the situation by only using Key Filter-friendly natural keys for objects that act as indexes. But wouldn't it be good to have the advantage of Key Filters, while at the same time have the ability to have surrogate keys?

What if… not only the key, but also part of the object’s value could be stored in memory? Then you could write queries that used the object’s memory only and get good performance. For the REST API, maybe an X-Riak-Memory header could be supported. Its content could be JSON, and the Key Filter could work on this memory data.  Enabling such functionality would let the application developer tune memory/disk storage and keep keys stable as the application evolves.

I fully understand that such a change will be complex. Riak use multiple backends, and maybe this idea does not fit those. Still, I think having part of the object in memory has advantages that cannot be ignored: Key filters could be replaced by simply using the memory part of the value. And maybe the need for secondary indices would be less important? Using memory could potentially enable Riak to scan data on range too.

1 comment:

KevBurnsJr said...

I'm pretty sure this is exactly why Riak Search was born, and I'm pretty sure it can be used to address all of the concerns you've laid out here.

http://wiki.basho.com/Riak-Search---Indexing-and-Querying-Riak-KV-Data.html