Tales from the Spark Summit: Alluxio's Key-Value Store

Right, so I promised our awesome marketing lead Lucy that I would write a post every day for a month so here is the first one!

Right at the start of my Spark journey there was a bunch of things available between 0.4 and 0.7 and they were all disconnected and unpackaged from one another. Most of the various tools were very discrete. All you could really use were RDDs, SQL was disconnected and mllib was in tiny pieces. The Spark we know of today was a bunch of bits back then.

However, Tachyon, was an ever-popular fairly enticing idea. With most Spark applications it was necessary to pull all of your teeth out and and kiss goodbye to your weekends because deployment was full of mishaps and took ages (and the software was buggy).

Fast forward some 3-4 years and Tachyon becomes Alluxio and has a bunch of new features associated with it and we'll cover one such new feature in this post. Ordinarily you can only store files in the cache against keys but an experimental feature of Alluxio allows you store keys and values for a much faster lookup using cluster memory with durability.

To create a store you would probably use the following code.

KeyValueStoreWriter writer = kvs.createStore(new AlluxioURI("alluxio://path/my-kvstore"));

writer.put("1", "sorry I missed your party Lucy");

writer.put("2", "going to write a post a day as an apology");

writer.close();

As you can imagine, this is a nice way of sharing state across a Spark cluster. Any process can then read this back in.

KeyValueStoreReader reader = kvs.openStore(new AlluxioURI("alluxio://path/kvstore"));

val lucyKnewIdBale = reader.get("1");

reader.close();

Very simple. You can also get an iterator and read all keys and values if you want. You can build in a backing store and with simple configuration can use Azure Blob Storage through the wasb protocol.

Happy trails!