Upload
naga
View
2
Download
0
Embed Size (px)
DESCRIPTION
hi
Citation preview
10/17/12 Hadoop in Practice | Javalobby
1/9java.dzone.com/articles/hadoop‑practice
LOG IN OR JOIN Search
Hadoop in Practice02.28.2012 | 12059 views | Like 5 TweetTweet 18 5
The Cloud Zone is brought to you in partnership with DZone and Microsoft. Try their Windows Azurecloud platform free for 90 days with no strings attached. Learn more.
Hadoop in PracticeBy Alex Holmes
Working with simple data formats such as log files is straightforward and supported in
MapReduce. In this article based on Chapter 3 of Hadoop in Practice, author Alex Holmesshows you how to work with ubiquitous data serialization formats such as XML and JSON.
Processing Common Serialization FormatsXML and JSON are industry-standard data interchange formats. Their ubiquity in our industry isevidenced in their heavy adoption in data storage and exchange. XML has existed since 1998 as amechanism to represent data that is readable by machines and humans alike. It became a universallanguage to data exchange between systems and is employed by many standards today such asSOAP and RSS and used as an open data format for products such as Microsoft Office.
Technique 1: MapReduce and XMLOur goal is to be able to use XML as a data source for a MapReduce job. We’re going to assume thatthe XML documents that need to be processed are large and, as a result, we want to be able toprocess them in parallel with multiple mappers working on the same input file.
CLOUD ZONE
Chris SmithBio Website @DZone
ShareShare 6
Publish an ArticlePublish an Article Share a Tip Share a Tip
FollowFollow 17.3K followers
DZone, Inc. on Follow
CONNECT WITH DZONE
RELATED MICROZONE RESOURCES
PHP Web Site with Windows Azure TableStorage Using Git
PHP Web Site with MySQL Using Git onAzure
FREE 3 Month Azure Trial
Hello World Web App Using Eclipse onAzure
On-Premises Application with Azure BlobStorage
Your Software Flow is MoreLike a Whirlpool than aPipeline
Commercial and Open SourceBig Data PlatformsComparison
Continuous Delivery andApple
Quality + Simplicity - theSweet Spot
POPULAR AT DZONE
Like 4.9k
Spotlight Features
Simple Algo to compute the SquareRoot of a Number
Debugging Hibernate Envers -Historical Data
Gradle, Vaadin 7 and Multi-Moduleprojects
Rate Limiting With Repose, TheRESTFul Proxy Servie Engine
Terracotta and Tomcat Clustering
HOM E REFCARDZ M ICROZONES ZONES LIBRARY SNIPPET S T UT ORIALS
10/17/12 Hadoop in Practice | Javalobby
2/9java.dzone.com/articles/hadoop‑practice
ProblemWorking on a single XML file in parallel in MapReduce is tricky because XML does not contain asynchronization marker in its data format. Therefore, how do we work with a file format that’s notinherently splittable like XML?
SolutionMapReduce doesn’t contain built-in support for XML, so we have to turn to another Apache project,Mahout, a machine learning system, which provides an XML InputFormat. To showcase the XMLInputFormat, let’s write a MapReduce job that uses the Mahout’s XML Input Format to read propertynames and values from Hadoop’s
configuration files. Our first step is to set up our job configuration.
#1 Defines the string form of the XML start tag. Our job is to take Hadoop config files as input,where each configuration entry uses the "property" tag.#2 Defines the string form of the XML end tag.#3 Sets the Mahout XML input format class.
It quickly becomes apparent by looking at the code that Mahout’s XML InputFormat is rudimentary;you need to tell
it an exact sequence of start and end XML tags that will be searched in the file. Looking at the sourceof the
InputFormat confirms this:
Next, we need to write a Mapper to consume Mahout’s XML input format. We’re being supplied theXML element in
Text form, so we’ll need to use an XML parser to extract content from the XML.
1.2.3.
conf.set("xmlinput.start", "<property>"); #1conf.set("xmlinput.end", "</property>"); #2job.setInputFormatClass(XmlInputFormat.class); #3
01.02.03.04.05.06.07.08.09.10.11.12.13.14.15.16.
private boolean next(LongWritable key, Text value)throws IOException {if (fsin.getPos() < end && readUntilMatch(startTag, false)) {try {buffer.write(startTag);if (readUntilMatch(endTag, true)) {key.set(fsin.getPos());value.set(buffer.getData(), 0, buffer.getLength());return true;}} finally {buffer.reset();}}return false;}
See more popular at DZone
Subscribe to the RSS feed
Understanding Logging Frameworks InJava
Case Study: Factory Design Pattern
10/17/12 Hadoop in Practice | Javalobby
3/9java.dzone.com/articles/hadoop‑practice
Our Map is given a Text instance, which contains a String representation of the data between the startand end tags. In our code we’re using Java’s built-in Streaming API for XML (StAX) parser to extractthe key and value for each property and output them. If we run our MapReduce job against Cloudera’score-site.xml and cat the output, we’ll see the output that you see below.
This output shows that we have successfully worked with XML as an input serialization format withMapReduce! Not only that, we can support huge XML files since the InputFormat supports splittingXML.
WRITING XML
01.02.03.04.05.06.07.08.09.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.35.36.37.38.
public static class Map extends Mapper<LongWritable, Text,Text, Text> {@Overrideprotected void map(LongWritable key, Text value,
Mapper.Context context)throwsIOException, InterruptedException {String document = value.toString();System.out.println("‘" + document + "‘");try {XMLStreamReader reader =XMLInputFactory.newInstance().createXMLStreamReader(newByteArrayInputStream(document.getBytes()));
String propertyName = "";String propertyValue = "";String currentElement = "";while (reader.hasNext()) {int code = reader.next();switch (code) {case START_ELEMENT:currentElement = reader.getLocalName();break;case CHARACTERS:if (currentElement.equalsIgnoreCase("name")) {propertyName += reader.getText();} else if (currentElement.equalsIgnoreCase("value")) {propertyValue += reader.getText();}break;
}}reader.close();context.write(propertyName.trim(), propertyValue.trim());} catch (Exception e) {log.error("Error processing ‘" + document + "‘", e);}}}
01.02.03.04.05.06.07.08.09.10.
$ hadoop fs -put $HADOOP_HOME/conf/core-site.xml core-site.xml $ bin/run.sh com.manning.hip.ch3.xml.HadoopPropertyXMLMapReduce \core-site.xml output $ hadoop fs -cat output/part*fs.default.name hdfs://localhost:8020hadoop.tmp.dir /var/lib/hadoop-0.20/cache/${user.name}hadoop.proxyuser.oozie.hosts *hadoop.proxyuser.oozie.groups *
10/17/12 Hadoop in Practice | Javalobby
4/9java.dzone.com/articles/hadoop‑practice
Having successfully read XML, the next question would be how do we write XML? In our Reducer, wehave callbacks
that occur before and after our main reduce method is called, which we can use to emit a start and endtag.
#1 Uses the setup method to write the root element start tag.#2 Uses the cleanup method to write the root element end tag.#3 Constructs a child XML element for each key/value combination we get in the Reducer. #4Emits the XML element.
This could also be embedded in an OutputFormat.
PIGIf you want to work with XML in Pig, the Piggybank library (a user-contributed library of useful Pigcode) contains an XMLLoader. It works in a similar way to our technique and captures all of thecontent between a start and end tag and supplies it as a single bytearray field in a Pig tuple.
HIVECurrently, there doesn’t seem to be a way to work with XML in Hive. You would have to write a customSerDe[1].
DiscussionMahout’s XML InputFormat certainly helps you work with XML. However, it’s very sensitive to an
01.02.03.04.05.06.07.08.09.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.
29.30.31.32.33.34.35.
public static class Reduceextends Reducer<Text, Text, Text, Text> {
@Overrideprotected void setup(Context context)throws IOException, InterruptedException {context.write(new Text("<configuration>"), null); #1} @Overrideprotected void cleanup(Context context)throws IOException, InterruptedException {context.write(new Text("</configuration>"), null); #2} private Text outputKey = new Text();public void reduce(Text key, Iterable<Text> values,
Context context)throws IOException, InterruptedException {for (Text value : values) {outputKey.set(constructPropertyXml(key, value)); #3context.write(outputKey, null); #4}} public static String constructPropertyXml(Text name, Text value){StringBuilder sb = new StringBuilder();sb.append("<property><name>").append(name).append("</name><value>").append(value) .append("</value></property>");
return sb.toString();}}
10/17/12 Hadoop in Practice | Javalobby
5/9java.dzone.com/articles/hadoop‑practice
exact string match of both the start and end element names. If the element tag can contain attributeswith variable values, or the generation of the element can’t be controlled and could result in XMLnamespace qualifiers being used, then this approach may not work for you. Also problematic will besituations where the element name you specify is used as a descendant child element.
If you have control over how the XML is laid out in the input, this exercise can be simplified by havinga single XML element per line. This will let you use the built-in MapReduce text-based InputFormats(such as TextInputFormat), which treat each line as a record and split accordingly to preserve thatdemarcation.
Another option worth considering is that of a preprocessing step, where you could convert the originalXML into a separate line per XML elemen, or convert it into an altogether different data format such asa SequenceFile or Avro, both of which solve the splitting problem for you.
There’s a streaming class StreamXmlRecordReader to allow you to work with XML in your streamingcode.
We have a handle on how to work with XML, so let’s move on to tackle another popular serializationformat, JSON. JSON shares the machine and human-readable traits of XML and has existed sincethe early 2000s. It is less verbose than XML and doesn’t have the rich typing and validation featuresavailable in XML.
Technique 2: MapReduce and JSONOur technique covers how you can work with JSON in MapReduce. We’ll also cover a method bywhich a JSON file can be partitioned for concurrent reads.
ProblemFigure 1 shows us the problem with using JSON in MapReduce. If you are working with large JSONfiles, you need to be able to split them. But, given a random offset in a file, how do we determine thestart of the next JSON element, especially when working with JSON that has multiple hierarchiessuch as in the example below?
Figure 1 Example of issue with JSON and multiple input splits
SolutionJSON is harder to partition into distinct segments than a format such as XML because JSON doesn’thave a token (like an end tag in XML) to denote the start or end of a record.
ElephantBird[2], an open-source project that contains some useful utilities for working with LZOcompression, has a LzoJsonInputFormat, which can read JSON, but it requires that the input file be
10/17/12 Hadoop in Practice | Javalobby
6/9java.dzone.com/articles/hadoop‑practice
LZOP compressed. We’ll use this code as a template for our own JSON InputFormat, which doesn’thave the LZOP compression requirement.
We’re cheating with our solution because we’re assuming that each JSON record is on a separateline. Our JsonRecordFormat is simple and does nothing other than construct and return aJsonRecordReader, so we’ll skip over that code. The JsonRecordReader emits LongWritable,MapWritable key/value pairs to the Mapper, where the Map is a map of JSON element names andtheir values. Let’s take a look at how this RecordReader works. It leverages the LineRecordReader,which is a built-in MapReduce reader that emits a record for each line. To convert the line to aMapWritable, it uses the following method.
It uses the json-simple[3] parser to parse the line into a JSON object and then iterates over the keysand puts the keys and values into a MapWritable. The Mapper is given the JSON data inLongWritable, MapWriable pairs and can process the data accordingly. The code for the MapReducejob is very basic. We’re going to demonstrate the code using the JSON below.
Since our technique assumes a JSON object per line, the actual JSON file we’ll work with is shownbelow.
01.
02.03.04.05.06.07.08.09.10.11.12.13.14.15.16.17.18.19.20.21.22.
public static boolean decodeLineToJson(JSONParser parser, Textline,
MapWritable value) {try {JSONObject jsonObj = (JSONObject)parser.parse(line.toString());for (Object key: jsonObj.keySet()) {Text mapKey = new Text(key.toString());Text mapValue = new Text();if (jsonObj.get(key) != null) {mapValue.set(jsonObj.get(key).toString());}
value.put(mapKey, mapValue);}return true;} catch (ParseException e) {LOG.warn("Could not json-decode string: " + line, e);return false;} catch (NumberFormatException e) {LOG.warn("Could not parse field into number: " + line, e);return false;}}
01.02.03.04.05.06.07.08.09.10.11.12.
13.14.15.
{"results" :[{"created_at" : "Thu, 29 Dec 2011 21:46:01 +0000","from_user" : "grep_alex","text" : "RT @kevinweil: After a lot of hard work by ..."},{"created_at" : "Mon, 26 Dec 2011 21:18:37 +0000","from_user" : "grep_alex","text" : "@miguno pull request has been merged, thanksagain!"
}]
}
Online Visitors: 329
10/17/12 Hadoop in Practice | Javalobby
7/9java.dzone.com/articles/hadoop‑practice
We’ll copy the JSON file into HDFS and run our MapReduce code. Our MapReduce code simplywrites each JSON
key/value to the output.
WRITING JSONAn approach similar to what we looked at for writing XML could also be used to write JSON.
PIGElephantBird contains a JsonLoader and LzoJsonLoader, which can be used to work with JSON inPig. It also works for JSON that is line based. Each Pig tuple contains a field for each JSON elementin the line as a chararray.
HIVEHive contains a DelimitedJSONSerDe, which can serialize JSON but unfortunately not deserialize it,so you can’t load data into Hive using this SerDe.
DiscussionOur solution works with the assumption that the JSON input is structured with a line per JSON object.How would we work with JSON objects that are across multiple lines? The authors have anexperimental project on GitHub[4], which works with multiple input splits over a single JSON file. Thekey to this approach is searching for a specific JSON member and retrieving the containing object.There’s a Google Code project called hive-json-serde[5], which can support both serialization anddeserialization.
SummaryAs you can see, using XML and JSON in MapReduce is kludgy and has rigid requirements about howyour data is laid out. Supporting them in MapReduce is complex and error prone, as they don’tnaturally lend themselves to splitting. Alternative file formats, such as Avro and SequenceFiles, havebuilt-in support for splittability.
1.
2.
{"created_at" : "Thu, 29 Dec 2011 21:46:01 +0000","from_user" :...
{"created_at" : "Mon, 26 Dec 2011 21:18:37 +0000","from_user" :...
01.02.03.04.05.06.07.08.09.10.11.12.13.
$ hadoop fs -put test-data/ch3/singleline-tweets.json \singleline-tweets.json $ bin/run.sh com.manning.hip.ch3.json.JsonMapReduce \singleline-tweets.json output $ fs -cat output/part*text RT @kevinweil: After a lot of hard work by ...from_user grep_alexcreated_at Thu, 29 Dec 2011 21:46:01 +0000text @miguno pull request has been merged, thanks again!from_user grep_alexcreated_at Mon, 26 Dec 2011 21:18:37 +0000
10/17/12 Hadoop in Practice | Javalobby
8/9java.dzone.com/articles/hadoop‑practice
Things Great Engineers (almost) Never Say
Time to Stop Paying GitHub's Stupid Toll
Why Java EE Lost and Spring Won
The AntiJavaScript: Perl 6
This is Why We Need More Women in Technology
10 Things That I Never Want to See a Java Developer to do Again
Does Scala as an FP Language Suffer From Its OO Syntax?
The Pros and Cons of JavaScript and jQuery
No more excuses to use null references in Java 8
How to Tune Java Garbage Collection
A Developer's Guide to Getting Hired
Cage Match! Sencha Touch vs. jQuery Mobile
Learning How to Learn
Vampires of the Cloud
Weekly Poll: In a World where Java was no more. . .
YOU MIGHT ALSO LIKE
If you would like to purchase Hadoop in Practice, DZone members can receive a 38% discountby entering the Promotional Code: dzone38 during checkout at Manning.com.
[1] SerDe is a shortened form of Serializer/Deserializer, the mechanism that allows Hive to read andwrite data in HDFS. [2] https://github.com/kevinweil/elephant-bird [3] http://code.google.com/p/json-simple/ [4] A multiline JSON InputFormat. https://github.com/alexholmes/json-mapreduce. [5] http://code.google.com/p/hive-json-serde/
Source: http://www.manning.com/holmes/
Tags: Apache big data Hadoop Open Source
The Cloud Zone is presented by DZone and Microsoft. There is a host of tools to let you deployNode.js, PHP, and Java apps on their Windows Azure platform with an unprecedented 90 day freetrial.
AROUND THE DZONE NETWORK
WEB BUILDER
Ah, the wonders ofDHTML...
WEB BUILDER
Caching for Fun andProfit. Or, Why WouldYou ...
ARCHITECTS
5 Ways to do SourceControl Really, ReallyWrong
JAVALOBBY
A JavaScriptMapReduce OneLiner
JAVALOBBY
SOA Service DesignCheat Sheet
JAVALOBBY
10 Ways I AvoidTrouble in theDatabase
10/17/12 Hadoop in Practice | Javalobby
9/9java.dzone.com/articles/hadoop‑practice
Search
Refcardz Book ReviewsTech Library IT QuestionsSnippets My ProfileAbout DZone AdvertiseTools & Buttons Send Feedback
DZone
HTML5 Windows PhoneCloud Mobile.NET JavaPHP EclipsePerformance Big DataAgile DevOps
T opics
Google +
Follow Us
Pithy Advice for Programmers
"Starting from scratch" isseductive but disease ridden
POPULAR ON JAVALOBBY
· Spring Batch Hello World
· Is Hibernate the best choice?
· 9 Programming Languages ToWatch In 2011
· Lucene's FuzzyQuery is 100 timesfaster in 4.0
· How to Create Visual Applicationsin Java?
· Introduction to Oracle's ADF FacesRich Client Framework
· Time Slider: OpenSolaris 2008.11Killer Feature
· Interview: John De Goes Introducesa Newly Free Source Code Editor
LATEST ARTICLES
· A New Suite Dev Tools for Windows8 and VS 2012 from Telerik
· Erlang: client/server
· Video: Monitoring Netflix's JVMs onAWS
· Debugging Hibernate Envers Historical Data
· Microsoft DevRadio: Using Blend toHelp Design Your Windows 8 Apps(Part 2)
· Is Architecture Evaluation a Wasteof Time and Money?
· Hot Deploy Still Hell
· How to Combat Web BrowsingZombies
SPOTLIGHT RESOURCES
DataWarehousing:Best Practicesfor Collecting,Storing, andDeliveringDecision-SupportDataData Warehousing is aprocess for collecting,storing, and...
DatabasePartitioning withMySQL:ImprovingPerformance,Availability, andManageabilityMySQL, the world’smost popular opensource databasemanagement system,has become the defaultdatabase for buildingany new generation...
Java Profilingwith VisualVM: X-Ray Vision forDramaticPerformanceGainsVisualVM is a visualtool integrating severalcommandline JDKtools and lightweightprofiling capabilities.Designed for bothproduction...
Advertising - Terms of Service - Privacy - © 1997-2012, DZone, Inc.