8
© Copyright IBM Corporation 2012 Trademarks Practice: Process logs with Apache Hadoop Page 1 of 8 Practice: Process logs with Apache Hadoop Extract useful data from logs using Hadoop on typical Linux systems M. Tim Jones Independent author Consultant 30 May 2012 Logs are an essential part of any computing system, supporting capabilities from audits to error management. As logs grow and the number of log sources increases (such as in cloud environments), a scalable system is necessary to efficiently process logs. This practice session explores processing logs with Apache Hadoop from a typical Linux system. Logs come in all shapes, but as applications and infrastructures grow, the result is a massive amount of distributed data that's useful to mine. From web and mail servers to kernel and boot logs, modern servers hold a rich set of information. Massive amounts of distributed data are a perfect application for Apache Hadoop, as are log files—time-ordered structured textual data. You can use log processing to extract a variety of information. One of its most common uses is to extract errors or count the occurrence of some event within a system (such as login failures). You can also extract some types of performance data, such as connections or transactions per second. Other useful information includes the extraction (map) and construction of site visits (reduce) from a web log. This analysis can also support detection of unique user visits in addition to file access statistics. Overview About this article You may want to read these articles before working through the exercises: Distributed computing with Linux and Hadoop Distributed data processing with Hadoop, Part 1: Getting started Distributed data processing with Hadoop, Part 2: Going further Distributed data processing with Hadoop, Part 3: Application development Data processing with Apache Pig These exercises give you practice in:

Os Log Process Hadoop PDF

Embed Size (px)

DESCRIPTION

hadoop

Citation preview

  • Copyright IBM Corporation 2012 TrademarksPractice: Process logs with Apache Hadoop Page 1 of 8

    Practice: Process logs with Apache HadoopExtract useful data from logs using Hadoop on typical Linuxsystems

    M. Tim JonesIndependent authorConsultant

    30 May 2012

    Logs are an essential part of any computing system, supporting capabilities from audits toerror management. As logs grow and the number of log sources increases (such as in cloudenvironments), a scalable system is necessary to efficiently process logs. This practice sessionexplores processing logs with Apache Hadoop from a typical Linux system.

    Logs come in all shapes, but as applications and infrastructures grow, the result is a massiveamount of distributed data that's useful to mine. From web and mail servers to kernel and bootlogs, modern servers hold a rich set of information. Massive amounts of distributed data are aperfect application for Apache Hadoop, as are log filestime-ordered structured textual data.

    You can use log processing to extract a variety of information. One of its most common uses is toextract errors or count the occurrence of some event within a system (such as login failures). Youcan also extract some types of performance data, such as connections or transactions per second.Other useful information includes the extraction (map) and construction of site visits (reduce) froma web log. This analysis can also support detection of unique user visits in addition to file accessstatistics.

    Overview

    About this articleYou may want to read these articles before working through the exercises:

    Distributed computing with Linux and Hadoop Distributed data processing with Hadoop, Part 1: Getting started Distributed data processing with Hadoop, Part 2: Going further Distributed data processing with Hadoop, Part 3: Application development Data processing with Apache Pig

    These exercises give you practice in:

  • developerWorks ibm.com/developerWorks/

    Practice: Process logs with Apache Hadoop Page 2 of 8

    Getting a simple Hadoop environment up and running Interacting with the Hadoop file system (HDFS) Writing a simple MapReduce application Writing a filtering Apache Pig query Writing an accumulating Pig query

    PrerequisitesTo get the most from these exercises, you should have a basic working knowledge of Linux.Some knowledge of virtual appliances is also useful for bringing a simple environment up.

    Exercise 1. Get a simple Hadoop environment up and runningThere are two ways to get Hadoop up and running. The first is to install the Hadoop software,and then configure it for your environment (the simplest case is a single-node instance, in whichall daemons run in a single node). See Distributed data processing with Hadoop, Part 1: Gettingstarted for details.

    The second and simpler way is through the use of the Cloudera's Hadoop Demo VM (whichcontains a Linux image plus a preconfigured Hadoop instance). The Cloudera virtual machine(VM) runs on VMware, Kernel-based Virtual Machine (KVM), or Virtualbox.

    Choose a method, and complete the installation. Then, complete the following task:

    Verify that Hadoop is running by issuing an HDFS ls command.

    Exercise 2. Interact with the HDFSThe HDFS is a special-purpose file system that manages data and replicas within a Hadoopcluster, distributing them to compute nodes for efficient processing. Even though HDFS is aspecial-purpose file system, it implements many of the typical file system commands. To retrievehelp information for Hadoop, issue the command hadoop dfs. Perform the following tasks:

    Create a test subdirectory within the HDFS. Move a file from the local file system into the HDFS subdirectory using copyFromLocal. For extra credit, view the file within HDFS using a hadoop dfs command.

    Exercise 3. Write a simple MapReduce applicationAs demonstrated in Distributed data processing with Hadoop, Part 3: Application development,writing a word count map and reduce application is simple. Using the Ruby example demonstratedin this article, develop a Python map and reduce application, and run them on a sample set ofdata. Recall that Hadoop sorts the output of map so that like words are contiguous, which providesa useful optimization for the reducer.

    Exercise 4. Write a simple Pig queryAs you saw in Data processing with Apache Pig, Pig allows you to build simple scripts that aretranslated into MapReduce applications. In this exercise, you extract all log entries (from /var/log/messages) that contain both the word kernel: and the word terminating.

  • ibm.com/developerWorks/ developerWorks

    Practice: Process logs with Apache Hadoop Page 3 of 8

    Create a script that extracts all log lines with the predefined criteria.

    Exercise 5. Write an aggregating Pig queryLog messages are generated by a variety of sources within the Linux kernel (such as kernel ordhclient). In this example, you want to discover the various sources that generate log messagesand the number of log messages per source.

    Create a script that counts the number of log messages for each log source.

    Exercise solutionsThe specific output depends on your particular Hadoop installation and configuration.

    Solution for Exercise 1. Get a simple Hadoop environment up andrunningIn Exercise 1, you perform an ls command on the HDFS. Listing 1 illustrates the proper solution.

    Listing 1. Performing an ls operation on the HDFS$ hadoop dfs -ls /drwxrwxrwx - hue supergroup 0 2011-12-10 06:56 /tmpdrwxr-xr-x - hue supergroup 0 2011-12-08 05:20 /userdrwxr-xr-x - mapred supergroup 0 2011-12-08 10:06 /var$

    More or fewer files might be present depending on use.

    Solution for Exercise 2. Interact with the HDFSIn Exercise 2, you create a subdirectory within HDFS and copy a file into it. Note that you createtest data by moving the kernel message buffer into a file. For extra credit, view the file within theHDFS using the cat command (see Listing 2).

    Listing 2. Manipulating the HDFS$ dmesg > kerndata$ hadoop dfs -mkdir /test$ hadoop dfs -ls /test$ hadoop dfs -copyFromLocal kerndata /test/mydata$ hadoop dfs -cat /test/mydataLinux version 2.6.18-274-7.1.el5 ([email protected])......e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX$

    Solution for Exercise 3. Write a simple MapReduce applicationIn Exercise 3, you create a simple word count MapReduce application in Python. Python is actuallya great language in which to implement the word count example. You can find a useful writeup onPython MapReduce in Writing a Hadoop MapReduce Program in Python by Michael G. Noll.

  • developerWorks ibm.com/developerWorks/

    Practice: Process logs with Apache Hadoop Page 4 of 8

    This example assumes that you performed the steps of exercise 2 (to ingest data into the HDFS).Listing 3 provides the map application.

    Listing 3. Map application in Python#!/usr/bin/env python

    import sys

    for line in sys.stdin: line = line.strip() words = line.split() for word in words: print '%s\t1' % word

    Listing 4 provides the reduce application.

    Listing 4. The reduce application in Python#!/usr/bin/env python

    from operator import itemgetterimport sys

    last_word = Nonelast_count = 0cur_word = None

    for line in sys.stdin: line = line.strip()

    cur_word, count = line.split('\t', 1)

    count = int(count)

    if last_word == cur_word: last_count += count else: if last_word: print '%s\t%s' % (last_word, last_count) last_count = count last_word = cur_word

    if last_word == cur_word: print '%s\t%s' % (last_word, last_count)

    Listing 5 illustrates the process of invoking the Python MapReduce example in Hadoop.

    Listing 5. Testing Python MapReduce with Hadoop$ hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \ -file pymap.py -mapper pymap.py -file pyreduce.py -reducer pyreduce.py \ -input /test/mydata -output /test/output...$ hadoop dfs -cat /test/output/part-00000...write 3write-combining 2wrong. 1your 2zone: 2zonelists. 1$

  • ibm.com/developerWorks/ developerWorks

    Practice: Process logs with Apache Hadoop Page 5 of 8

    Solution for Exercise 4. Write a simple Pig query

    In Exercise 4, you extract /var/log/messages log entries that contain both the word kernel: andthe word terminating. In this case, you use Pig in local mode to query the local file (see Listing 6).Load the file into a Pig relation (log), filter its contents to only kernel messages, and then filter thatresulting relation for terminating messages.

    Listing 6. Extracting all kernel + terminating log messages

    $ pig -x localgrunt> log = LOAD '/var/log/messages';grunt> logkern = FILTER log BY $0 MATCHES '.*kernel:.*';grunt> logkernterm = FILTER logkern BY $0 MATCHES '.*terminating.*';grunt> dump logkernterm...(Dec 8 11:08:48 localhost kernel: Kernel log daemon terminating.)grunt>

    Solution for Exercise 5. Write an aggregating Pig query

    In Exercise 5, extract the log sources and log message counts from /var/log/messages. In thiscase, create a script for the query, and execute it through Pig's local mode. In Listing 7, you loadthe file and parse the input using a space as a delimiter. You then assign the delimited string fieldsto your named elements. Use the GROUP operator to group the messages by their source, and thenuse the FOREACH operator and COUNT to aggregate your data.

    Listing 7. Log sources and counts script for /var/log/messages

    log = LOAD '/var/log/messages' USING PigStorage(' ') AS (month:chararray, \ day:int, time:chararray, host:chararray, source:chararray);sources = GROUP log BY source;counts = FOREACH sources GENERATE group, COUNT(log);dump counts;

    The result is shown executed in Listing 8.

    Listing 8. Executing your log sources script

    $ pig -x local logsources.pig...(init:,1)(gconfd,12)(kernel:,505)(syslogd,2)(dhclient:,91)(localhost,1168)(gpm[2139]:,2)[gpm[2168]:,2)(NetworkManager:,292)(avahi-daemon[3345]:,37)(avahi-daemon[3362]:,44)(nm-system-settings:,8)$

  • developerWorks ibm.com/developerWorks/

    Practice: Process logs with Apache Hadoop Page 6 of 8

    ResourcesLearn

    Distributed computing with Linux and Hadoop (Ken Mann and M. Tim Jones,developerWorks, December 2008): Discover Apache's Hadoop, a Linux-based softwareframework that enables distributed manipulation of vast amounts of data, including parallelindexing of internet web pages.

    Distributed data processing with Hadoop, Part 1: Getting started (M. Tim Jones,developerWorks, May 2010): Explore the Hadoop framework, including its fundamentalelements, such as the Hadoop file system (HDFS), common node types, and ways to monitorand manage Hadoop using its core web interfaces. Learn to install and configure a single-node Hadoop cluster, and delve into the MapReduce application.

    Distributed data processing with Hadoop, Part 2: Going further (M. Tim Jones,developerWorks, June 2010): Configure a more advanced setup with Hadoop in a multi-node cluster for parallel processing. You'll work with MapReduce functionality in a parallelenvironment and explore command line and web-based management aspects of Hadoop.

    Distributed data processing with Hadoop, Part 3: Application development (M. Tim Jones,developerWorks, July 2010): Explore the Hadoop APIs and data flow and learn to use themwith a simple mapper and reducer application.

    Data processing with Apache Pig (M. Tim Jones, developerWorks, February 2012): Pigsare known for rooting around and digging out anything they can consume. Apache Pig doesthe same thing for big data. Learn more about this tool and how to put it to work in yourapplications.

    Writing a Hadoop MapReduce Program in Python (Michael G. Noll, updated October 2011,published September 2007): Learn to write a simple MapReduce program for Hadoop in thePython programming language in this tutorial.

    IBM InfoSphere BigInsights Basic Edition offers a highly scalable and powerful analyticsplatform that can handle incredibly high data throughput rates that can range to millions ofevents or messages per second.

    The Open Source developerWorks zone provides a wealth of information on open sourcetools and using open source technologies.

    developerWorks Web development specializes in articles covering various web-basedsolutions.

    Stay current with developerWorks technical events and webcasts focused on a variety of IBMproducts and IT industry topics.

    Attend a free developerWorks Live! briefing to get up-to-speed quickly on IBM products andtools, as well as IT industry trends.

    Watch developerWorks on-demand demos ranging from product installation and setup demosfor beginners, to advanced functionality for experienced developers.

    Follow developerWorks on Twitter, or subscribe to a feed of Linux tweets on developerWorks.

    Get products and technologies

    Cloudera's Hadoop Demo VM (May 2012): Start using with Apache Hadoop with a set ofvirtual machines that include a Linux image and a preconfigured Hadoop instance.

  • ibm.com/developerWorks/ developerWorks

    Practice: Process logs with Apache Hadoop Page 7 of 8

    IBM InfoSphere BigInsights Basic Edition -- IBM's Hadoop distribution -- is an integrated,tested and pre-configured, no-charge download for anyone who wants to experiment with andlearn about Hadoop.

    Evaluate IBM products in the way that suits you best: Download a product trial, try a productonline, use a product in a cloud environment, or spend a few hours in the SOA Sandboxlearning how to implement Service Oriented Architecture efficiently.

    Discuss

    Check out developerWorks blogs and get involved in the developerWorks community. Get involved in the developerWorks community. Connect with other developerWorks users

    while exploring the developer-driven blogs, forums, groups, and wikis.

  • developerWorks ibm.com/developerWorks/

    Practice: Process logs with Apache Hadoop Page 8 of 8

    About the author

    M. Tim Jones

    M. Tim Jones is an embedded firmware architect and the author of ArtificialIntelligence: A Systems Approach, GNU/Linux Application Programming (now inits second edition), AI Application Programming (in its second edition), and BSDSockets Programming from a Multilanguage Perspective. His engineering backgroundranges from the development of kernels for geosynchronous spacecraft to embeddedsystems architecture and networking protocols development. Tim is a platformarchitect with Intel and author in Longmont, Colo.

    Copyright IBM Corporation 2012(www.ibm.com/legal/copytrade.shtml)Trademarks(www.ibm.com/developerworks/ibm/trademarks/)

    Table of ContentsOverviewPrerequisites

    Exercise 1. Get a simple Hadoop environment up and runningExercise 2. Interact with the HDFSExercise 3. Write a simple MapReduce applicationExercise 4. Write a simple Pig queryExercise 5. Write an aggregating Pig queryExercise solutionsSolution for Exercise 1. Get a simple Hadoop environment up and runningSolution for Exercise 2. Interact with the HDFSSolution for Exercise 3. Write a simple MapReduce applicationSolution for Exercise 4. Write a simple Pig querySolution for Exercise 5. Write an aggregating Pig queryResourcesAbout the authorTrademarks