Hadoop

10/17/12 Hadoop in Practice | Javalobby

1/9java.dzone.com/articles/hadoop‑practice

LOG IN OR JOIN Search

Hadoop in Practice02.28.2012 | 12059 views | Like 5 TweetTweet 18 5

The Cloud Zone is brought to you in partnership with DZone and Microsoft. Try their Windows Azurecloud platform free for 90 days with no strings attached. Learn more.

Hadoop in PracticeBy Alex Holmes

Working with simple data formats such as log files is straightforward and supported in

MapReduce. In this article based on Chapter 3 of Hadoop in Practice, author Alex Holmesshows you how to work with ubiquitous data serialization formats such as XML and JSON.

Processing Common Serialization FormatsXML and JSON are industry-standard data interchange formats. Their ubiquity in our industry isevidenced in their heavy adoption in data storage and exchange. XML has existed since 1998 as amechanism to represent data that is readable by machines and humans alike. It became a universallanguage to data exchange between systems and is employed by many standards today such asSOAP and RSS and used as an open data format for products such as Microsoft Office.

Technique 1: MapReduce and XMLOur goal is to be able to use XML as a data source for a MapReduce job. We’re going to assume thatthe XML documents that need to be processed are large and, as a result, we want to be able toprocess them in parallel with multiple mappers working on the same input file.

CLOUD ZONE

Chris SmithBio Website @DZone

ShareShare 6

Publish an ArticlePublish an Article Share a Tip Share a Tip

FollowFollow 17.3K followers

DZone, Inc. on Follow

CONNECT WITH DZONE

RELATED MICROZONE RESOURCES

PHP Web Site with Windows Azure TableStorage Using Git

PHP Web Site with MySQL Using Git onAzure

FREE 3 Month Azure Trial

Hello World Web App Using Eclipse onAzure

On-Premises Application with Azure BlobStorage

Your Software Flow is MoreLike a Whirlpool than aPipeline

Commercial and Open SourceBig Data PlatformsComparison

Continuous Delivery andApple

Quality + Simplicity - theSweet Spot

POPULAR AT DZONE

Like 4.9k

Spotlight Features

Simple Algo to compute the SquareRoot of a Number

Debugging Hibernate Envers -Historical Data

Gradle, Vaadin 7 and Multi-Moduleprojects

Rate Limiting With Repose, TheRESTFul Proxy Servie Engine

Terracotta and Tomcat Clustering

HOM E REFCARDZ M ICROZONES ZONES LIBRARY SNIPPET S T UT ORIALS

http://java.dzone.com/user

http://java.dzone.com/user/register?clean=true

http://java.dzone.com/articles/hadoop-practice#

https://twitter.com/intent/tweet?original_referer=http%3A%2F%2Fjava.dzone.com%2Farticles%2Fhadoop-practice&source=tweetbutton&text=Hadoop%20in%20Practice%20%7C%20Javalobby%3A&url=http%3A%2F%2Fjava.dzone.com%2Farticles%2Fhadoop-practice%23.UH8SWZyowcQ.twitter

http://twitter.com/search?q=http%3A%2F%2Fjava.dzone.com%2Farticles%2Fhadoop-practice



http://txt.couchware.com/medias/jump?hid=2815&cid=415&mid=925


http://www.manning.com/holmes/

http://www.dzone.com/mz/cloud

http://dzone.com/users/cjsmith

http://www.facebook.com/pages/DZone/259639764711?v=wall&ref=nf

http://plus.google.com/109050195832044380546/

http://www.linkedin.com/groups?home=&gid=696877&trk=anet_ug_hm&goback=.gmr_696877

http://www.dzone.com/page/mvbs

http://java.dzone.com/users/cjsmith

https://twitter.com/DZone

http://java.dzone.com/node/add

mailto:[email protected]

https://twitter.com/intent/follow?original_referer=http%3A%2F%2Fjava.dzone.com%2Farticles%2Fhadoop-practice&region=follow_link&screen_name=DZone&source=followbutton&variant=2.0

https://twitter.com/intent/user?original_referer=http%3A%2F%2Fjava.dzone.com%2Farticles%2Fhadoop-practice&region=count_link&screen_name=DZone&source=followbutton&variant=2.0

http://dzone.com/mz/cloud/rss

http://java.dzone.com/click/119801/1













http://www.dzone.com/links/simple_algo_to_compute_the_square_root_of_a_number.html

http://www.dzone.com/links/debugging_hibernate_envers_historical_data.html

http://www.dzone.com/links/gradle_vaadin_7_and_multimodule_projects.html

http://www.dzone.com/links/rate_limiting_with_repose_the_restful_proxy_servi.html

http://www.dzone.com/links/terracotta_and_tomcat_clustering.html

http://www.dzone.com/

http://refcardz.dzone.com/

http://www.dzone.com/mz

http://java.dzone.com/domain

http://library.dzone.com/

http://dzone.com/snippets/

http://www.dzone.com/tutorials/



ProblemWorking on a single XML file in parallel in MapReduce is tricky because XML does not contain asynchronization marker in its data format. Therefore, how do we work with a file format that’s notinherently splittable like XML?

SolutionMapReduce doesn’t contain built-in support for XML, so we have to turn to another Apache project,Mahout, a machine learning system, which provides an XML InputFormat. To showcase the XMLInputFormat, let’s write a MapReduce job that uses the Mahout’s XML Input Format to read propertynames and values from Hadoop’s

configuration files. Our first step is to set up our job configuration.

#1 Defines the string form of the XML start tag. Our job is to take Hadoop config files as input,where each configuration entry uses the "property" tag.#2 Defines the string form of the XML end tag.#3 Sets the Mahout XML input format class.

It quickly becomes apparent by looking at the code that Mahout’s XML InputFormat is rudimentary;you need to tell

it an exact sequence of start and end XML tags that will be searched in the file. Looking at the sourceof the

InputFormat confirms this:

Next, we need to write a Mapper to consume Mahout’s XML input format. We’re being supplied theXML element in

Text form, so we’ll need to use an XML parser to extract content from the XML.

1.2.3.

conf.set("xmlinput.start", "<property>"); #1conf.set("xmlinput.end", "</property>"); #2job.setInputFormatClass(XmlInputFormat.class); #3

01.02.03.04.05.06.07.08.09.10.11.12.13.14.15.16.

private boolean next(LongWritable key, Text value)throws IOException {if (fsin.getPos() < end && readUntilMatch(startTag, false)) {try {buffer.write(startTag);if (readUntilMatch(endTag, true)) {key.set(fsin.getPos());value.set(buffer.getData(), 0, buffer.getLength());return true;}} finally {buffer.reset();}}return false;}

See more popular at DZone

Subscribe to the RSS feed

Understanding Logging Frameworks InJava

Case Study: Factory Design Pattern

http://java.dzone.com/articles/hadoop-practice#viewSource

http://java.dzone.com/articles/hadoop-practice#printSource

http://java.dzone.com/articles/hadoop-practice#about




http://www.dzone.com/

http://feeds.dzone.com/dzone/frontpage

http://www.dzone.com/links/understanding_logging_frameworks_in_java.html

http://www.dzone.com/links/case_study_factory_design_pattern.html



Our Map is given a Text instance, which contains a String representation of the data between the startand end tags. In our code we’re using Java’s built-in Streaming API for XML (StAX) parser to extractthe key and value for each property and output them. If we run our MapReduce job against Cloudera’score-site.xml and cat the output, we’ll see the output that you see below.

This output shows that we have successfully worked with XML as an input serialization format withMapReduce! Not only that, we can support huge XML files since the InputFormat supports splittingXML.

WRITING XML

01.02.03.04.05.06.07.08.09.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.35.36.37.38.

public static class Map extends Mapper<LongWritable, Text,Text, Text> {@Overrideprotected void map(LongWritable key, Text value,

Mapper.Context context)throwsIOException, InterruptedException {String document = value.toString();System.out.println("‘" + document + "‘");try {XMLStreamReader reader =XMLInputFactory.newInstance().createXMLStreamReader(newByteArrayInputStream(document.getBytes()));

String propertyName = "";String propertyValue = "";String currentElement = "";while (reader.hasNext()) {int code = reader.next();switch (code) {case START_ELEMENT:currentElement = reader.getLocalName();break;case CHARACTERS:if (currentElement.equalsIgnoreCase("name")) {propertyName += reader.getText();} else if (currentElement.equalsIgnoreCase("value")) {propertyValue += reader.getText();}break;

}}reader.close();context.write(propertyName.trim(), propertyValue.trim());} catch (Exception e) {log.error("Error processing ‘" + document + "‘", e);}}}

01.02.03.04.05.06.07.08.09.10.

$ hadoop fs -put $HADOOP_HOME/conf/core-site.xml core-site.xml $ bin/run.sh com.manning.hip.ch3.xml.HadoopPropertyXMLMapReduce \core-site.xml output $ hadoop fs -cat output/part*fs.default.name hdfs://localhost:8020hadoop.tmp.dir /var/lib/hadoop-0.20/cache/${user.name}hadoop.proxyuser.oozie.hosts *hadoop.proxyuser.oozie.groups *









Having successfully read XML, the next question would be how do we write XML? In our Reducer, wehave callbacks

that occur before and after our main reduce method is called, which we can use to emit a start and endtag.

#1 Uses the setup method to write the root element start tag.#2 Uses the cleanup method to write the root element end tag.#3 Constructs a child XML element for each key/value combination we get in the Reducer. #4Emits the XML element.

This could also be embedded in an OutputFormat.

PIGIf you want to work with XML in Pig, the Piggybank library (a user-contributed library of useful Pigcode) contains an XMLLoader. It works in a similar way to our technique and captures all of thecontent between a start and end tag and supplies it as a single bytearray field in a Pig tuple.

HIVECurrently, there doesn’t seem to be a way to work with XML in Hive. You would have to write a customSerDe[1].

DiscussionMahout’s XML InputFormat certainly helps you work with XML. However, it’s very sensitive to an

01.02.03.04.05.06.07.08.09.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.

29.30.31.32.33.34.35.

public static class Reduceextends Reducer<Text, Text, Text, Text> {

@Overrideprotected void setup(Context context)throws IOException, InterruptedException {context.write(new Text("<configuration>"), null); #1} @Overrideprotected void cleanup(Context context)throws IOException, InterruptedException {context.write(new Text("</configuration>"), null); #2} private Text outputKey = new Text();public void reduce(Text key, Iterable<Text> values,

Context context)throws IOException, InterruptedException {for (Text value : values) {outputKey.set(constructPropertyXml(key, value)); #3context.write(outputKey, null); #4}} public static String constructPropertyXml(Text name, Text value){StringBuilder sb = new StringBuilder();sb.append("<property><name>").append(name).append("</name><value>").append(value) .append("</value></property>");

return sb.toString();}}

http://java.dzone.com/articles/hadoop-practice#note1






exact string match of both the start and end element names. If the element tag can contain attributeswith variable values, or the generation of the element can’t be controlled and could result in XMLnamespace qualifiers being used, then this approach may not work for you. Also problematic will besituations where the element name you specify is used as a descendant child element.

If you have control over how the XML is laid out in the input, this exercise can be simplified by havinga single XML element per line. This will let you use the built-in MapReduce text-based InputFormats(such as TextInputFormat), which treat each line as a record and split accordingly to preserve thatdemarcation.

Another option worth considering is that of a preprocessing step, where you could convert the originalXML into a separate line per XML elemen, or convert it into an altogether different data format such asa SequenceFile or Avro, both of which solve the splitting problem for you.

There’s a streaming class StreamXmlRecordReader to allow you to work with XML in your streamingcode.

We have a handle on how to work with XML, so let’s move on to tackle another popular serializationformat, JSON. JSON shares the machine and human-readable traits of XML and has existed sincethe early 2000s. It is less verbose than XML and doesn’t have the rich typing and validation featuresavailable in XML.

Technique 2: MapReduce and JSONOur technique covers how you can work with JSON in MapReduce. We’ll also cover a method bywhich a JSON file can be partitioned for concurrent reads.

ProblemFigure 1 shows us the problem with using JSON in MapReduce. If you are working with large JSONfiles, you need to be able to split them. But, given a random offset in a file, how do we determine thestart of the next JSON element, especially when working with JSON that has multiple hierarchiessuch as in the example below?

Figure 1 Example of issue with JSON and multiple input splits

SolutionJSON is harder to partition into distinct segments than a format such as XML because JSON doesn’thave a token (like an end tag in XML) to denote the start or end of a record.

ElephantBird[2], an open-source project that contains some useful utilities for working with LZOcompression, has a LzoJsonInputFormat, which can read JSON, but it requires that the input file be




LZOP compressed. We’ll use this code as a template for our own JSON InputFormat, which doesn’thave the LZOP compression requirement.

We’re cheating with our solution because we’re assuming that each JSON record is on a separateline. Our JsonRecordFormat is simple and does nothing other than construct and return aJsonRecordReader, so we’ll skip over that code. The JsonRecordReader emits LongWritable,MapWritable key/value pairs to the Mapper, where the Map is a map of JSON element names andtheir values. Let’s take a look at how this RecordReader works. It leverages the LineRecordReader,which is a built-in MapReduce reader that emits a record for each line. To convert the line to aMapWritable, it uses the following method.

It uses the json-simple[3] parser to parse the line into a JSON object and then iterates over the keysand puts the keys and values into a MapWritable. The Mapper is given the JSON data inLongWritable, MapWriable pairs and can process the data accordingly. The code for the MapReducejob is very basic. We’re going to demonstrate the code using the JSON below.

Since our technique assumes a JSON object per line, the actual JSON file we’ll work with is shownbelow.

01.

02.03.04.05.06.07.08.09.10.11.12.13.14.15.16.17.18.19.20.21.22.

public static boolean decodeLineToJson(JSONParser parser, Textline,

MapWritable value) {try {JSONObject jsonObj = (JSONObject)parser.parse(line.toString());for (Object key: jsonObj.keySet()) {Text mapKey = new Text(key.toString());Text mapValue = new Text();if (jsonObj.get(key) != null) {mapValue.set(jsonObj.get(key).toString());}

value.put(mapKey, mapValue);}return true;} catch (ParseException e) {LOG.warn("Could not json-decode string: " + line, e);return false;} catch (NumberFormatException e) {LOG.warn("Could not parse field into number: " + line, e);return false;}}

01.02.03.04.05.06.07.08.09.10.11.12.

13.14.15.

{"results" :[{"created_at" : "Thu, 29 Dec 2011 21:46:01 +0000","from_user" : "grep_alex","text" : "RT @kevinweil: After a lot of hard work by ..."},{"created_at" : "Mon, 26 Dec 2011 21:18:37 +0000","from_user" : "grep_alex","text" : "@miguno pull request has been merged, thanksagain!"

}]

}

Online Visitors: 329











We’ll copy the JSON file into HDFS and run our MapReduce code. Our MapReduce code simplywrites each JSON

key/value to the output.

WRITING JSONAn approach similar to what we looked at for writing XML could also be used to write JSON.

PIGElephantBird contains a JsonLoader and LzoJsonLoader, which can be used to work with JSON inPig. It also works for JSON that is line based. Each Pig tuple contains a field for each JSON elementin the line as a chararray.

HIVEHive contains a DelimitedJSONSerDe, which can serialize JSON but unfortunately not deserialize it,so you can’t load data into Hive using this SerDe.

DiscussionOur solution works with the assumption that the JSON input is structured with a line per JSON object.How would we work with JSON objects that are across multiple lines? The authors have anexperimental project on GitHub[4], which works with multiple input splits over a single JSON file. Thekey to this approach is searching for a specific JSON member and retrieving the containing object.There’s a Google Code project called hive-json-serde[5], which can support both serialization anddeserialization.

SummaryAs you can see, using XML and JSON in MapReduce is kludgy and has rigid requirements about howyour data is laid out. Supporting them in MapReduce is complex and error prone, as they don’tnaturally lend themselves to splitting. Alternative file formats, such as Avro and SequenceFiles, havebuilt-in support for splittability.

1.

2.

{"created_at" : "Thu, 29 Dec 2011 21:46:01 +0000","from_user" :...

{"created_at" : "Mon, 26 Dec 2011 21:18:37 +0000","from_user" :...

01.02.03.04.05.06.07.08.09.10.11.12.13.

$ hadoop fs -put test-data/ch3/singleline-tweets.json \singleline-tweets.json $ bin/run.sh com.manning.hip.ch3.json.JsonMapReduce \singleline-tweets.json output $ fs -cat output/part*text RT @kevinweil: After a lot of hard work by ...from_user grep_alexcreated_at Thu, 29 Dec 2011 21:46:01 +0000text @miguno pull request has been merged, thanks again!from_user grep_alexcreated_at Mon, 26 Dec 2011 21:18:37 +0000











Things Great Engineers (almost) Never Say

Time to Stop Paying GitHub's Stupid Toll

Why Java EE Lost and Spring Won

The AntiJavaScript: Perl 6

This is Why We Need More Women in Technology

10 Things That I Never Want to See a Java Developer to do Again

Does Scala as an FP Language Suffer From Its OO Syntax?

The Pros and Cons of JavaScript and jQuery

No more excuses to use null references in Java 8

How to Tune Java Garbage Collection

A Developer's Guide to Getting Hired

Cage Match! Sencha Touch vs. jQuery Mobile

Learning How to Learn

Vampires of the Cloud

Weekly Poll: In a World where Java was no more. . .

YOU MIGHT ALSO LIKE

If you would like to purchase Hadoop in Practice, DZone members can receive a 38% discountby entering the Promotional Code: dzone38 during checkout at Manning.com.

[1] SerDe is a shortened form of Serializer/Deserializer, the mechanism that allows Hive to read andwrite data in HDFS. [2] https://github.com/kevinweil/elephant-bird [3] http://code.google.com/p/json-simple/ [4] A multiline JSON InputFormat. https://github.com/alexholmes/json-mapreduce. [5] http://code.google.com/p/hive-json-serde/

Source: http://www.manning.com/holmes/

Tags: Apache big data Hadoop Open Source

The Cloud Zone is presented by DZone and Microsoft. There is a host of tools to let you deployNode.js, PHP, and Java apps on their Windows Azure platform with an unprecedented 90 day freetrial.

AROUND THE DZONE NETWORK

WEB BUILDER

Ah, the wonders ofDHTML...

WEB BUILDER

Caching for Fun andProfit. Or, Why WouldYou ...

ARCHITECTS

5 Ways to do SourceControl Really, ReallyWrong

JAVALOBBY

A JavaScriptMapReduce OneLiner

JAVALOBBY

SOA Service DesignCheat Sheet

JAVALOBBY

10 Ways I AvoidTrouble in theDatabase

http://java.dzone.com/articles/things-great-engineers-almost

http://java.dzone.com/articles/time-stop-paying-github-their

http://java.dzone.com/articles/why-java-ee-lost-and-spring

http://java.dzone.com/articles/anti-javascript-perl-6

http://java.dzone.com/articles/why-we-need-more-women

http://java.dzone.com/articles/10-things-i-never-want-do

http://java.dzone.com/articles/does-scala-fp-language-suffer

http://java.dzone.com/articles/javascript-and-jquery

http://java.dzone.com/articles/no-more-excuses-use-null

http://java.dzone.com/articles/how-tune-java-garbage

http://java.dzone.com/articles/developers-guide-getting-hired

http://java.dzone.com/articles/sencha-touch-v-jquery-mobile

http://java.dzone.com/articles/learning-how-learn

http://java.dzone.com/articles/you-might-be-vampire

http://java.dzone.com/articles/weekly-poll-world-without-java



http://java.dzone.com/category/dzone-taxonomy/open-source/apache

http://java.dzone.com/category/tags/big-data

http://java.dzone.com/category/tags/hadoop

http://java.dzone.com/category/dzone-taxonomy/open-source






http://css.dzone.com/

http://css.dzone.com/articles/ah-wonders-dhtml

http://css.dzone.com/

http://css.dzone.com/articles/caching-fun-and-profit-or-why

http://architects.dzone.com/

http://architects.dzone.com/articles/5-ways-do-source-control

http://java.dzone.com/

http://java.dzone.com/articles/javascript-mapreduce-one-liner


http://java.dzone.com/articles/soa-service-design-cheat-sheet


http://java.dzone.com/articles/10-ways-i-avoid-trouble



Search

Refcardz Book ReviewsTech Library IT QuestionsSnippets My ProfileAbout DZone AdvertiseTools & Buttons Send Feedback

DZone

HTML5 Windows PhoneCloud Mobile.NET JavaPHP EclipsePerformance Big DataAgile DevOps

T opics

Google +

Facebook

LinkedIn

Twitter

Follow Us

Pithy Advice for Programmers

"Starting from scratch" isseductive but disease ridden

POPULAR ON JAVALOBBY

· Spring Batch Hello World

· Is Hibernate the best choice?

· 9 Programming Languages ToWatch In 2011

· Lucene's FuzzyQuery is 100 timesfaster in 4.0

· How to Create Visual Applicationsin Java?

· Introduction to Oracle's ADF FacesRich Client Framework

· Time Slider: OpenSolaris 2008.11Killer Feature

· Interview: John De Goes Introducesa Newly Free Source Code Editor

LATEST ARTICLES

· A New Suite Dev Tools for Windows8 and VS 2012 from Telerik

· Erlang: client/server

· Video: Monitoring Netflix's JVMs onAWS

· Debugging Hibernate Envers Historical Data

· Microsoft DevRadio: Using Blend toHelp Design Your Windows 8 Apps(Part 2)

· Is Architecture Evaluation a Wasteof Time and Money?

· Hot Deploy Still Hell

· How to Combat Web BrowsingZombies

SPOTLIGHT RESOURCES

DataWarehousing:Best Practicesfor Collecting,Storing, andDeliveringDecision-SupportDataData Warehousing is aprocess for collecting,storing, and...

DatabasePartitioning withMySQL:ImprovingPerformance,Availability, andManageabilityMySQL, the world’smost popular opensource databasemanagement system,has become the defaultdatabase for buildingany new generation...

Java Profilingwith VisualVM: X-Ray Vision forDramaticPerformanceGainsVisualVM is a visualtool integrating severalcommandline JDKtools and lightweightprofiling capabilities.Designed for bothproduction...

Advertising - Terms of Service - Privacy - © 1997-2012, DZone, Inc.

http://refcardz.dzone.com/

http://books.dzone.com/

http://library.dzone.com/

http://itquestions.com/index.html

http://snippets.dzone.com/

http://dzone.com/user

http://www.dzone.com/links/about.jsp

http://www.dzone.com/corporate/advertise

http://www.dzone.com/links/buttons.jsp

http://ask.dzone.com/index.html

http://dzone.com/

http://dzone.com/mz/html5

http://dzone.com/mz/windowsphone7

http://cloud.dzone.com/

http://mobile.dzone.com/

http://dzone.com/mz/dotnet


http://php.dzone.com/

http://eclipse.dzone.com/

http://dzone.com/mz/performance

http://www.dzone.com/mz/big-data

http://agile.dzone.com/

http://dzone.com/mz/devops

http://dzone.com/mz/html5

https://plus.google.com/+dzone/posts

http://www.facebook.com/pages/DZone/259639764711

http://www.linkedin.com/groups/DZone-696877

https://twitter.com/#!/Dzone


http://java.dzone.com/news/spring-batch-hello-world-1

http://java.dzone.com/news/hibernate-best-choice

http://java.dzone.com/news/9-programming-languages-watch

http://java.dzone.com/news/lucenes-fuzzyquery-100-times

http://java.dzone.com/news/how-create-visual-applications

http://java.dzone.com/news/introduction-oracles-adf-faces

http://java.dzone.com/news/killer-feature-opensolaris-200

http://java.dzone.com/news/interview-john-de-goes-free-un

http://java.dzone.com/articles/new-suite-dev-tools-windows-8

http://java.dzone.com/articles/erlang-clientserver

http://java.dzone.com/articles/video-performance-amazon-aws

http://java.dzone.com/articles/debugging-hibernate-envers

http://java.dzone.com/articles/microsoft-devradio-using-blend

http://java.dzone.com/articles/software-architecture-and

http://java.dzone.com/articles/hot-deploy-still-hell

http://java.dzone.com/articles/how-combat-web-browsing

http://refcardz.dzone.com/refcardz/data-warehousing

http://refcardz.dzone.com/refcardz/database-partitioning

http://refcardz.dzone.com/refcardz/java-profiling-visualvm

http://java.dzone.com/page/advertise

http://java.dzone.com/page/tos

http://java.dzone.com/page/privacy

Documents

Hadoop