Os riak1-pdf

© Copyright IBM Corporation 2012 TrademarksIntroducing Riak, Part 1: The language-independentHTTP API

Page 1 of 12

Introducing Riak, Part 1: The language-independent HTTP APIStore and retrieve data using Riak's HTTP interface

Simon Buckle ([email protected])Independent ConsultantFreelance

Skill Level: Intermediate

Date: 13 Mar 2012

This is Part 1 of a two-part series about Riak, a highly scalable, distributed datastore written in Erlang and based on Dynamo, Amazon's high availability key-value store. Learn the basics about Riak and how to store and retrieve itemsusing its HTTP API. Explore how to use its Map/Reduce framework for doingdistributed queries, how links allow relationships to be defined between objects,and how to query those relationships using link walking.

IntroductionTypical modern relational databases perform poorly on certain types of applicationsand struggle to cope with the performance and scalability demands of today'sInternet applications. A different approach is needed. In the last few years, a newtype of data store, commonly referred to as NoSQL, has become popular as itdirectly addresses some of the deficiencies of relational databases. Riak is one suchexample of this type of data store.

Riak is not the only NoSQL data store out there. Two other popular data storesare MongoDB and Cassandra. Although similar in many ways, there are alsosome significant differences. For example, Riak is a distributed system whereasMongoDB is a single system database — Riak has no concept of a master node,making it more resilient to failure. Though also based on Amazon's description ofDynamo, Cassandra omits features such as vector clocks and consistent hashing fororganizing its data. Riak's data model is more flexible. In Riak, buckets are createdon the fly when they are first accessed; Cassandra's data model is defined in an XMLfile so changing it requires having to reboot the entire cluster.

Another strength of Riak is it is written in Erlang. MongoDB and Cassandraare written in what can be referred to as general-purpose languages (C++ and

http://www.ibm.com/legal/copytrade.shtml

http://www.ibm.com/developerworks/ibm/trademarks/

mailto:[email protected]

developerWorks® ibm.com/developerWorks/

Introducing Riak, Part 1: The language-independentHTTP API

Page 2 of 12

Java, respectively), whereas Erlang was designed from the ground up to supportdistributed, fault-tolerant applications, and as such is more suited to developingapplications such as NoSQL data stores that share some characteristics with theapplications that Erlang was originally created for.

Map/Reduce jobs can only be written in either Erlang or JavaScript. For this article,we have chosen to write the map and reduce functions in JavaScript, but it is alsopossible to write them in Erlang. While Erlang code may be slightly quicker toexecute, we have chosen JavaScript code because of its accessibility to a largeraudience. See Resources for links to learn more about Erlang.

Getting startedIf you want to try out some of the examples in this article, you need to install Riak(see Resources) and Erlang on your system.

You also need to build a cluster containing three nodes running on your localmachine. All data stored in Riak are replicated to a number of nodes in the cluster. Aproperty (n_val) on the bucket the data is stored in determines the number of nodesto replicate. The default value of this property is three, therefore, we need to create acluster with at least three nodes (after which you can create as many as you like) inorder for it to be effective.

After you download the source code, you need to build it. The basic steps are asfollows:

1. Unpack the source: $ tar xzvf riak-1.0.1.tar.gz2. Change directory: $ cd riak-1.0.13. Build: $ make all rel

This will build Riak (./rel/riak). To run multiple nodes locally you need to make copiesof ./rel/riak — one copy for each additional node. Copy ./rel/riak to ./rel/riak2, ./rel/riak3 and so on, then make the following changes to each copy:

• In riakN/etc/app.config change the following values: the port specified in thehttp{} section, handoff_port, and pb_port, to something unique

• Open up riakN/etc/vm.args and change the name, again to something unique,for example, -name [email protected]

Now start each node in turn, as shown in Listing 1.

Listing 1. Listing 1. Starting each node$ cd rel$ ./riak/bin/riak start$ ./riak2/bin/riak start$ ./riak3/bin/riak start

Finally, join the nodes together to make a cluster, as shown in Listing 2.

ibm.com/developerWorks/ developerWorks®


Page 3 of 12

Listing 2. Listing 2. Making a cluster$ ./riak2/bin/riak-admin join [email protected]$ ./riak3/bin/riak-admin join [email protected]

You should now have a 3-node cluster running locally. To test it, run the followingcommand: $ ./riak/bin/riak-admin status | grep ring_members.

You should see each node that is part of the cluster you just created, for example,ring_members : ['[email protected]','[email protected]','[email protected]'].

The Riak APIThere are currently three ways of accessing Riak: an HTTP API (RESTful interface),Protocol Buffers, and a native Erlang interface. Having more than one interface givesyou the benefit of being able to choose how to integrate your application. If you havean application written in Erlang then it would make sense to use the native Erlanginterface so you have tight integration between the two. There are also other factors,such as performance, that may play a part in deciding which interface to use. Forexample, a client that uses the Protocol Buffers interface will perform better thanone that interacts with the HTTP API; less data is communicated and parsing allthose HTTP headers can be (relatively) costly in terms of performance. However, thebenefits of having an HTTP API are that most developers today — particularly Webdevelopers — are familiar with RESTful interfaces plus most programming languageshave built-in primitives for requesting resources over HTTP, for example, opening aURL, so no additional software is needed. In this article, we will focus on the HTTPAPI.

All the examples will use curl to interact with Riak through its HTTP interface. This isjust to get a better understanding of the underlying API. There are a number of clientlibraries available in various different languages and you should consider using oneof those when developing an application that uses Riak as the data store. The clientlibraries provide an API to Riak that makes it easy to integrate into your application;you won't have to write code yourself to handle the kind of responses you will seewhen using curl.

The API supports the usual HTTP methods: GET, PUT, POST, DELETE, which will beused for retrieving, updating, creating and deleting objects respectively. Each one willbe covered in turn.

Storing objectsYou can think of Riak as implementing a distributed map from keys (strings) to values(objects). Riak stores values in buckets. There is no need to explicitly create a bucketbefore storing an object in one; if an object is stored in a bucket that doesn't exist, itwill be created automatically for us.

Buckets are a virtual concept in Riak and exist primarily as a means of groupingrelated objects. Buckets also have properties and the value of these properties define



Page 4 of 12

what Riak does with the objects that are stored in it. Here are some examples ofbucket properties:

• n_val — The number of times an object should be replicated across the cluster• allow_mult — Whether to allow concurrent updates

You can view a bucket's properties (and their current values) by making a GET requeston the bucket itself.

To store an object, we do an HTTP POST to one of the URLs shown in Listing 3.

Listing 3. Listing 3. Storing an objectPOST -> /riak/<bucket> (1)POST -> /riak/<bucket>/<key> (2)

Keys can either be allocated automatically by Riak (1) or defined by the user (2).

When storing an object with a user-defined key it's also possible to do an HTTP PUTto (2) to create the object.

The latest version of Riak also supports the following URL format: /buckets/<bucket>/keys/<key>, but we will use the older format in this article in order to maintainbackwards compatibility with earlier versions of Riak.

If no key is specified, Riak will automatically allocate a key for the object. Forexample, let's store a plain text object in the bucket "foo" without explicitly specifyinga key (see Listing 4).

Listing 4. Listing 4. Storing a plain text object without specifying a key$ curl -i -H "Content-Type: plain/text" -d "Some text" \http://localhost:8098/riak/foo/

HTTP/1.1 201 CreatedVary: Accept-EncodingLocation: /riak/foo/3vbskqUuCdtLZjX5hx2JHKD2FTKContent-Type: plain/textContent-Length: ...

By examining the Location header, you can see the key that Riak allocated to theobject. It's not very memorable, so the alternative is to have the user provide a key.Let's create an artists bucket and add an artist who goes by the name of Bruce (seeListing 5).

Listing 5. Listing 5. Creating an artists bucket and adding an artist$ curl -i -d '{"name":"Bruce"}' -H "Content-Type: application/json" \http://localhost:8098/riak/artists/Bruce

HTTP/1.1 204 No ContentVary: Accept-EncodingContent-Type: application/jsonContent-Length: ...



Page 5 of 12

If the object was stored correctly using the key that we specified, we will get a 204 NoContent response from the server.

In this example, we are storing the value of the object as JSON but it could just aseasily have been plain text or some other format. It is important to note that whenstoring an object that the Content-Type header is set correctly. For example, if youwant to store a JPEG image, then you should set the content type to image/jpeg.

Retrieving an object

To retrieve a stored object, do a GET on the bucket using the key of the object youwant to retrieve. If the object exists, it will be returned in the body of the response,otherwise a 404 Object Not Found response will be returned by the server (seeListing 6).

Listing 6. Listing 6. Performing a GET on the bucket

$ curl http://localhost:8098/riak/artists/Bruce

HTTP/1.1 200 OK...{ "name" : "Bruce" }

Updating an object

When updating an object, just like when storing one, the Content-Type header isrequired. For example, let's add Bruce's nickname as shown in Listing 7.

Listing 7. Listing 7. Adding Bruce's nickname

$ curl -i -X PUT -d '{"name":"Bruce", "nickname":"The Boss"}' \-H "Content-Type: application/json" http://localhost:8098/riak/artists/Bruce

As mentioned earlier, Riak creates buckets automatically. The buckets haveproperties. One of those properties, allow_mult, determines whether concurrentwrites are allowed. By default, it is set to false; however, if concurrent updates areallowed then for each update, the X-Riak-Vclock header should be sent as well. Thevalue of this header should be set to the value that was seen when the object waslast read by the client.

Riak uses vector clocks to determine the causality of modifications to objects. Howvector clocks work is beyond the scope of this article but suffice to say that whenconcurrent writes are allowed there is a possibility that conflicts may occur so it willbe left up to the application to resolve these conflicts (see Resources).

Removing an object

Removing an object follows a similar pattern to the previous commands: we simplydo an HTTP DELETE to the URL that corresponds to the object we want to delete: $curl -i -X DELETE http://localhost:8098/riak/artists/Bruce.



Page 6 of 12

If the object was removed successfully we will get a 204 No Content response fromthe server; if the object we are trying to delete does not exist, the server respondswith a 404 Object Not Found.

LinksSo far, we have seen how to store objects by associating an object with a particularkey so it can be retrieved later on. What would be useful is if we could extend thissimple model to be able to express how (and if) objects are related to each other.Well we can and Riak achieves this via links.

So, what are links? Links allow the user to create relationships between objects. Ifyou are familiar with UML class diagrams, you can think of a link as an associationbetween objects with a label describing the relationship; in a relational database, therelationship would be expressed using a foreign key.

Links are "attached" to objects via the "Link" header. Below is an example of what alink header looks like. The target of the relationship, for example, the object we arelinking to, is the thing between the angled brackets. The relationship type — in thiscase "performer" — is expressed by the riaktag property: Link: </riak/artists/Bruce>; riaktag="performer".

Let's add some albums and associate them with the artist Bruce who performed onthe albums (see Listing 8).

Listing 8. Listing 8. Adding some albums$ curl -H "Content-Type: text/plain" \-H 'Link: </riak/artists/Bruce> riaktag="performer"' \-d "The River" http://localhost:8098/riak/albums/TheRiver

$ curl -H "Content-Type: text/plain" \-H 'Link: </riak/artists/Bruce> riaktag="performer"' \-d "Born To Run" http://localhost:8098/riak/albums/BornToRun

Now that we have set-up some relationships, it's time to query them via link walking— link walking is the name given to the process of querying the relationshipsbetween objects. For example, to find the artist who performed the album The River,you would do this: $ curl -i http://localhost:8098/riak/albums/TheRiver/artists,performer,1.

The bit at the end is the link specification. This is what a link query looks like. The firstpart (artists) specifies the bucket that we should restrict the query to. The secondpart (performer) specifies the tag we want to use to limit the results, and finally, the1 indicates that we do want to include the results from this particular phase of thequery.

It's also possible to issue transitive queries. Let's assume we have set-up therelationships between albums and artists as in Figure 1.



Page 7 of 12

Figure 1. Figure 1. Example relationship between albums and artists

It's now possible to issue queries such as, "Which artists collaborated with theartist who performed The River," by executing the following: $ curl -i http://localhost:8098/riak/albums/TheRiver/artists,_,0/artists,collaborator,1. Theunderscore in the link specification acts like a wildcard character and indicates thatwe don't care what the relationship is.

Running Map/Reduce queriesMap/Reduce is a framework popularized by Google for running distributedcomputations in parallel over huge datasets. Riak also supports Map/Reduce byallowing queries that are more powerful to be performed on the data stored in thecluster.

A Map/Reduce function consists of both a map phase and a reduce phase. The mapphase is applied to some data and produces zero or more results; this is equivalent infunctional programming terms to mapping a function over each item in a list. The mapphases occur in parallel. The reduce phase then takes all of the results from the mapphases and combines them together.

For example, consider counting the number of each instance of a word across a largeset of documents. Each map phase would calculate the number of times each wordappears in a particular document. These intermediate totals, once calculated, wouldthen be sent to the reduce function that would tally the totals and emit the resultfor the whole set of documents. See Resources for a link to Google's Map/Reducepaper.

Example: Distributed grepFor this article, we are going to develop a Map/Reduce function that will do adistributed grep over a set of documents stored in Riak. Just like grep, the final outputwill be a set of lines that match the supplied pattern. In addition, each result will alsoindicate the line number in the document where the match occurred.

To execute a Map/Reduce query we do a POST to the /mapred resource. The body ofthe request is a JSON representation of the query; as in previous cases, the Content-Type header must be present and always be set to application/json. Listing 9 shows



Page 8 of 12

the query that we will execute to do the distributed grep. Each part of the query willbe discussed in turn.

Listing 9. Listing 9. Example Map/Reduce query{ "inputs": [["documents","s1"],["documents","s2"]], "query": [ { "map": { "language": "javascript", "name": "GrepUtils.map", "keep": true, "arg": "[s|S]herlock" } }, { "reduce": { "language": "javascript", "name": "GrepUtils.reduce" } } ]}

Each query consists of a number of inputs, for example, the set of documents wewant to do some computation on, and the name of a function to run during both themap and reduce phases. It is also possible to include the source of both the map andreduce functions directly inline in the query by using the source property instead ofname but I have not done that here; however, in order to use named functions youwill need to make some changes to Riak's default configuration. Save the code inListing 9 in a directory somewhere. For each node in the cluster, locate the file etc/app.config, open it up and set the property js_source_dir to the directory where yousaved the code. You will need to restart all the nodes in the cluster in order for thechanges to take effect.

The code in Listing 10 contains the functions that will be executed during the mapand reduce phases. The map function looks at each line in the document and checksto see if matches the supplied pattern (the arg parameter). The reduce function inthis particular example doesn't do much; it behaves like an identity function and justreturns its input.

Listing 10. Listing 10. GrepUtils.jsvar GrepUtils = { map: function (v, k, arg) { var i, len, lines, r = [], re = new RegExp(arg); lines = v.values[0].data.split(/\r?\n/); for (i = 0, len = lines.length; i < len; i += 1) { var match = re.exec(lines[i]); if (match) { r.push((i+1) + “. “ + lines[i]); } } return r; }, reduce: function (v) { return [v]; }};

Before we can run the query, we need some data. I downloaded a couple of SherlockHolmes e-books from the Project Gutenberg Web site (see Resources). The first text



Page 9 of 12

is stored in the "documents" bucket under the key "s1"; the second text in the samebucket with the key "s2".

Listing 11 is an example of how you would load such a document into Riak.

Listing 11. Listing 11. Loading a document into Riak$ curl -i -X POST http://localhost:8098/riak/documents/s1 \-H “Content-Type: text/plain” --data-binary @s1.txt

Once the documents have been loaded, we can then search them. In this case, wewant to output any lines that match the regular expression "[s|S]herlock" (seeListing 12).

Listing 12. Listing 12. Searching the documents$ curl -X POST -H "Content-Type: application/json" \http://localhost:8098/mapred --data @-<<\EOF{ "inputs": [["documents","s1"],["documents","s2"]], "query": [ { "map": { "language":"javascript", "name":"GrepUtils.map", "keep":true, "arg": "[s|S]herlock" } }, { "reduce": { "language": "javascript", "name": "GrepUtils.reduce" } } ]}EOF

The arg property in the query contains the pattern that we want to grep for in thedocuments; this value is passed in to the map function as the arg parameter.

The output from running the Map/Reduce job over the sample data is in Listing 13.

Listing 13. Listing 13. Sample output from running the Map/Reduce job[["1. Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur ConanDoyle","9. Title: The Adventures of Sherlock Holmes","62. To Sherlock Holmesshe is always THE woman. I have seldom heard","819. as I had pictured it fromSherlock Holmes' succinct description,","1017. \"Good-night, Mister SherlockHolmes.\"","1034. \"You have really got it!\" he cried, grasping SherlockHolmes by" …]]

Streaming Map/ReduceTo finish off this section on Map/Reduce, we'll take a brief look at Riak's streamingMap/Reduce feature. It's useful for jobs that have map phases that take a whileto complete, since streaming the results allows you to access the results of eachmap phase as soon as they become available, and before the reduce phase hasexecuted.

We can apply this to good effect to the distributed grep query. The reduce step inthe example doesn't actually do much. In fact, we can get rid of the reduce phase



Page 10 of 12

altogether and just emit the results from each map phase directly to the client. Toachieve this, we need to modify the query by removing the reduce step and adding?chunked=true to end of the URL to indicate that we want to stream the results (seeListing 14).

Listing 14. Listing 14. Modifying the query to stream the results

$ curl -X POST -H "Content-Type: application/json" \http://localhost:8098/mapred?chunked=true --data @-<<\EOF{ "inputs": [["documents","s1"],["documents","s2"]], "query": [ { "map": { "language": "javascript", "name": "GrepUtils.map", "keep": true, "arg": "[s|S]herlock" } } ]}EOF

The results of each map phase — in this example, lines that match the query string— will now be returned to the client as each map phase completes. This approachwould be useful for applications that need to process the intermediary results of aquery when they become available.

Conclusion

Riak is an open source, highly scalable key-value store based on principles fromAmazon's Dynamo paper. It's easy to deploy and to scale. Additional nodes canadded to the cluster seamlessly. Features such as link walking and support for Map/Reduce allow for queries that are more complex. In addition to the HTTP API there isalso a native Erlang API and support for Protocol Buffers. In Part 2 of this series, we'llexplore a number of client libraries available in various different languages and showhow Riak can be used as a highly scalable cache.



Page 11 of 12

ResourcesLearn

• See Basic Cluster Setup and Building a Development Environment for moredetailed information on setting-up a 3-node cluster.

• Read Google's MapReduce: Simplified Data Processing on Large Clusters.• Introduction to programming in Erlang (Martin Brown, developerWorks, May

2011) explains how Erlang's functional programming style compares with otherprogramming paradigms such as imperative, procedural and object-orientedprogramming.

• Read Amazon's Dynamo paper on which Riak is based. Highly recommended!• See the article How To Analyze Apache Logs to learn how you can use Riak to

process your server logs.• Get an explanation of vector clocks and why they are easier to understand than

you may think.• Find a good explanation of vector clocks and more detailed information on link

walking on the Riak wiki.• The Project Gutenberg site is a great resource if you need some text resources

for experimenting.• Find extensive how-to information, tools, and project updates to help you

develop with open source technologies and use them with IBM products underdeveloperWorks Open source

• developerWorks Web development specializes in articles covering various web-based solutions.

• To listen to interesting interviews and discussions for software developers,check out developerWorks podcasts.

• Follow developerWorks on Twitter.• Watch developerWorks demos that range from product installation and setup for

beginners to advanced functionality for experienced developers.

Get products and technologies

• Download Riak from basho.com.• Download the Erlang programming language.• Innovate your next open source development project using software especially

for developers; access IBM trial software, available for download or on DVD.

Discuss

• Connect with other developerWorks users while exploring the developer-drivenblogs, forums, groups, and wikis. Help build the Real world open source groupin the developerWorks community.

http://wiki.basho.com/Building-a-Development-Environment.html

http://research.google.com/archive/mapreduce.html

http://www.ibm.com/developerworks/opensource/library/os-erlang1/index.html

http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

http://www.simonbuckle.com/2011/08/27/analyzing-apache-logs-with-riak/

http://basho.com/blog/technical/2010/01/29/why-vector-clocks-are-easy/

http://wiki.basho.com/Vector-Clocks.html

http://wiki.basho.com/Links-and-Link-Walking.html

http://wiki.basho.com/Links-and-Link-Walking.html

http://www.gutenberg.org/

http://www.ibm.com/developerworks/opensource/

http://www.ibm.com/developerworks/web/

https://www.ibm.com/developerworks/mydeveloperworks/groups/service/html/communityview?communityUuid=70786d1c-a2d4-4de8-a807-fccfa600bc77

http://twitter.com/#!/developerworks/

http://www.ibm.com/developerworks/demos/

http://basho.com/resources/downloads/

http://www.erlang.org/download.html

http://www.ibm.com/developerworks/downloads/

http://www.ibm.com/developerworks/community

https://www.ibm.com/developerworks/mydeveloperworks/groups/service/html/communityview?communityUuid=6e6f6d1b-95c3-46df-8a26-b7efd8ee4b57



Page 12 of 12

About the author

Simon Buckle

Simon Buckle is an independent consultant. His interests includedistributed systems, algorithms, and concurrency. He has a MastersDegree in Computing from Imperial College, London. Check out hiswebsite at simonbuckle.com.

© Copyright IBM Corporation 2012(www.ibm.com/legal/copytrade.shtml)Trademarks(www.ibm.com/developerworks/ibm/trademarks/)

http://simonbuckle.com

http://www.ibm.com/legal/copytrade.shtml

http://www.ibm.com/developerworks/ibm/trademarks/

Technology

Os riak1-pdf