Upload
shay-sofer
View
4.079
Download
0
Tags:
Embed Size (px)
DESCRIPTION
A presentation from TheEdge10 about Hadoop and Big data
Citation preview
Big Data is Here – Hadoop to the Rescue!
Shay Sofer,AlphaCSP
2
Today we will:
» Understand what is BigData» Get to know Hadoop» Experience some MapReduce magic» Persist very large files» Learn some nifty tricks
On Today's Menu...
3
Data is Everywhere
4
» IDC : “Total data in the universe : 1.2 Zettabytes” (May, 2010)
» 1ZB = 1 Trillion Gigabytes (or: 1,000,000,000,000,000,000,000 bytes = 1021 )
» 60% Growth from 2009» By 2020 – we will reach 35 ZB
Facts and Numbers
Data is Everywhere
5
Facts and Numbers
Data is Everywhere
Source: www.idc.com
6
» 234M Web sites» 7M New sites in 2009» New York Stock Exchange – 1 TB of data per day» Web 2.0
147M Blogs (and counting…) Twitter – ~12 TB of data per day
Facts and Numbers
Data is Everywhere
7
» 500M users» 40M photos per day » More than 30 billion pieces of content (web links,
news stories, blog posts, notes, photo albums etc.) shared each month
Facts and Numbers - Facebook
Data is Everywhere
8
» Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools
» Where and how do we store this information?» How do we perform analyses on such large datasets?
Why are you here?
Data is Everywhere
9
Scale-up Vs. Scale-out
Data is Everywhere
10
» Scale-up : Adding resources to a single node in a system, typically involving the addition of CPUs or memory to a single computer
» Scale-out : Adding more nodes to a system. E.g. Adding a new computer with commodity hardware to a distributed software application
Scale-up Vs. Scale-out
Data is Everywhere
11
Introducing…Hadoop!
12
» A framework for writing and running distributed applications that process large amount of data.
» Runs on large clusters of commodity hardware» A cluster with hundreds of machine is standard» Inspired by Google’s architecture : MapReduce and
GFS
What is Hadoop?
Hadoop
13
» Robust - Handles failures of individual nodes» Scales linearly» Open source » A top-level Apache project
Why Hadoop?
Hadoop
14
Hadoop
15
» Facebook holds the largest known Hadoop storage cluster in the world 2000 machines 12 TB per machine (some has 24 TB) 32 GB of RAM per machine
» Total of more than 21 Petabytes » (1 Petabyte = 1024 Terabytes)
Facebook (Again…)
Hadoop
16
History
Hadoop
2004 2006 2008 20082002 2010
Apache Nutch – Open Source web search engine founded by Doug Cutting
Cutting joins Yahoo!, forms Hadoop
Sorting 1 TB in 62 seconds
Google’s GFS & MapReduce papers published
Hadoop hits web scale, being used by Yahoo! for web indexing
Creating the longest Pi yet
17
Hadoop
Common
MapReduce
Pig Chukwa
HDFS
Hive
Zoo Keeper
HBase
18
IDE Plugin
Hadoop
19
Hadoop and MapReduce
20
» A programming model for processing and generating large data sets
» Introduced by Google » Parallel processing of the map/reduce operations
Definition
MapReduce
21
Sam believed “An apple a day keeps a doctor away”
MapReduce – The Story of Sam
Mother
Sam
An Apple
Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
22
Sam thought of “drinking” the apple
MapReduce – The Story of Sam
He used a to cut
the and a to
make juice.
Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
23
(map ‘( ))
( )
Sam applied his invention to all the fruits he could find in the fruit basket
MapReduce – The Story of Sam
(reduce ‘( ))
A list of values mapped into another list of values, which gets reduced into a
single value
Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
24
Sam got his first job for his talent in making juice
MapReduce – The Story of Sam
Now, it’s not just one basket
but a whole container of
fruits
Also, they produce a list of
juice types separately
Fruits
But, Sam had just ONE
and ONE
Large data and list of values for output
Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
25
Sam Implemented a parallel version of his innovation Fruits
(<a, > , <o, > , <p ,> , …)
Each map input: list of <key, value> pairs
Each map output: list of <key, value> pairs
(<a’ , > , <o’, v > , <p’ , > , …)Grouped by key (shuffle)
Each reduce input: <key, value-list>
e.g. <a’, ( …)>
Reduced into a list of values
Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
Map
Reduce
MapReduce – The Story of Sam
26
» Mapper - Takes a series of key/value pairs, processes each and generates output key/value pairs
(k1, v1) list(k2, v2)» Reducer - Iterates through the values that are
associated with a specific key and generate output (k2, list (v2)) list(k3, v3)» The Mapper takes the input data, filters and
transforms into something The Reducer can aggregate over
First Map, Then Reduce
MapReduce
27
MapReduce
Map
Map
Map
Map
Map
Input
Shuffle
Reduce
Reduce
Output
Output
28
» Hadoop comes with a number of predefined classes BooleanWritable ByteWritable LongWritable Text, etc…
» Supports pluggable serialization frameworks» Apache Avro
Hadoop Data Types
MapReduce
29
» TextInputFormat / TextOutputFormat» KeyValueTextInputFormat
» SequenceFile - A Hadoop specific compressed binary file format. Optimized for passing data between 2 MapReduce jobs
Input / Output Formats
MapReduce
30
public static class MapClass extends MapReduceBase
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, …)
{String line = value.toString();StringTokenizer itr = new StringTokenizer(line);while(itr.hasMoreTokens()){
word.set(itr.nextToken());output.collect(word, new IntWritable(1));}
} }
Word Count – The Mapper
implements Mapper<LongWritable,Text,Text,IntWritable<
<K1,Hello World Bye World>
< Hello, 1> < World, 1>
< Bye, 1> < World, 1>
31
public static class ReduceClass extends MapReduceBase
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable> output,
…){int sum = 0;while(values.hasNext()){
sum += values.next().get(); }output.collect(key, new IntWritable(sum));
{{
Word Count– The Reducer
implements Reducer<Text,IntWritable,Text,IntWritable{<
< Hello, 1> < World, 2> < Bye, 1>
< Hello, 1> < World, 1> < Bye, 1> < World, 1>
32
public static void main(String[] args){JobConf job = new JobConf(WordCount.class);
job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(MapClass.class);job.setReducerClass(ReduceClass.class);
FileInputFormat.addInputFormat(job ,new Path(args[0]));FileOutputFormat.addOutputFormat(job ,new Path(args[1]));
//job.setInputFormat(KeyValueTextInputFormat.class);
JobClient.runJob(job);{
Word Count – The Driver
33
Music discovery website» Scrobbling / Streaming VIA radio» 40M unique visitors per month» Over 40M scrobbles per day» Each scrobble creates a log line
Hadoop @ Last.FM
MapReduce
34
35
» Goal : Create a “Unique listeners per track” chart
Sample listening data
MapReduce
Skip Radio Scrobbles TrackId UserId
0 10 5 100 55551
3 3 0 900 55551
0 5 0 101 55552
0 0 5 102 55553
36
public void map(LongWritable position, Text rawLine, OutputCollector<IntWritable,IntWritable>
output, Reporter reporter) throws IOException {
int scrobbles, radioListens; // assume they are initialized - IntWritable trackId,userId; // for verbosity // if track somehow is marked with zero plays - ignore if (scrobbles <= 0 && radioListens <= 0) { return;
} // output user id against track id output.collect(trackId, userId); }
Unique Listens - Mapper
37
public void reduce(IntWritable trackId, Iterator<IntWritable> values,
OutputCollector<IntWritable, IntWritable> output, Reporter reporter) throws IOException {
Set<Integer> usersSet = new HashSet<Integer>(); // add all userIds to the set, duplicates removed while (values.hasNext()) { IntWritable userId = values.next(); usersSet.add(userId.get()); }
// output: trackId -> number of unique listeners per track output.collect(trackId, new IntWritable(usersSet.size()));}
Unique Listens - Reducer
38
» Complex tasks will sometimes be needed to be broken down to subtasks
» Output of the previous job goes as input to the next job
» job-a | job-b | job-c» Simply launch the driver of the 2nd job after the 1st
Chaining
MapReduce
39
» Hadoop supports other languages via API called Streaming
» Use UNIX commands as mappers and reducers» Or use any script that processes line-oriented data
stream from STDIN and outputs to STDOUT» Python, Perl etc.
Hadoop Streaming
MapReduce
40
$ hadoop jar hadoop-streaming.jar -input input/myFile.txt -output output.txt -mapper myMapper.py -reducer myReducer.py
Hadoop Streaming
MapReduce
41
HDFSHadoop Distributed File System
42
» A large dataset can and will outgrow the storage capacity of a single physical machine
» Partition it across separate machines – Distributed FileSystems
» Network based - complex» What happens when a node fails?
Distributed FileSystem
HDFS
43
» Designed for storing very large files running on clusters on commodity hardware
» Highly fault-tolerant (via replication)» A typical file is gigabytes to terabytes in size» High throughput
HDFS - Hadoop Distributed FileSystem
HDFS
44
Running Hadoop = Running a set of daemons ondifferent servers in your network» NameNode» DataNode» Secondary NameNode» JobTracker» TaskTracker
Hadoop’s Building Blocks
HDFS
45
Topology of a Hadoop Cluster
Secondary NameNode
NameNode
JobTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
46
» HDFS has a master/slave architecture ; The NameNode acts as the master
» Single NameNode per HDFS» Keeps track of :
How the files are broken into blocks Which nodes store those blocks The overall health of the filesystem
» Memory and I/O intensive
The NameNode
HDFS
47
» Each slave machine will host a DataNode daemon» Serves read/write/delete requests from the
NameNode» Manages the storage attached to the nodes » Sends a periodic Heartbeat to the NameNode
The DataNode
HDFS
48
» Failure is the norm rather than exception» Detection of faults and quick, automatic recovery» Each file is stored as a sequence of blocks (default:
64MB each)» The blocks of a file are replicated for fault tolerance» Block size and replicas are configurable per file
Fault Tolerance - Replication
HDFS
49
HDFS
50
Topology of a Hadoop Cluster
Secondary NameNode
NameNode
JobTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
51
» Assistant daemon that should be on a dedicated node» Takes snapshots of the HDFS metadata» Doesn’t receive real time changes» Helps minimizing downtime incase the NameNode
crashes
Secondary NameNode
HDFS
52
Topology of a Hadoop Cluster
Secondary NameNode
NameNode
JobTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
53
» One per cluster - on the master node» Receives job request submitted by the client» Schedules and monitors MapReduce jobs on
TaskTrackers
JobTracker
HDFS
54
» Run map and reduce tasks» Send progress reports to the JobTracker
TaskTracker
HDFS
55
» VIA file commands$ hadoop fs -mkdir /user/chuck$ hadoop fs -put hugeFile.txt$ hadoop fs -get anotherHugeFile.txt
» Programmatically (HDFS API)FileSystem hdfs = FileSystem.get(new Configuration());FSDataOutStream out = hdfs.create(filePath);while(...){ out.write(buffer,0,bytesRead);}
Working with HDFS
HDFS
56
Tips & Tricks
57
Tip #1: Hadoop Configuration Types
Tips & Tricks
HDFS # of Machines Type
No daemons, 1 JVM Local Machine Local Mode
Daemons running on separate JVMs (“cluster of one”)
Local Machine Pseudo-distributed mode
Daemons running on separate JVMs
Cluster with several nodes
Fully-distributed mode
58
» Monitoring events in the cluster can prove to be a bit more difficult
» Web interface for our cluster» Shows a summary of the cluster» Details about list of jobs there are currently running,
completed and failed
Tip #2: JobTracker UI
Tips & Tricks
59
WebTracker UI SS
Tips & Tricks
60
» Digging through logs or….Running again the exact same scenario with the same input on the same node?
» IsolationRunner can rerun the failed task to reproduce the problem
» Attach a debugger » Keep.failed.tasks.file = true
Tip #3: IsolationRunner – Hadoop’s Time Machine
Tips & Tricks
61
» Output of the map phase (which will be shuffled across the network) can be quite large
» Built in support for compression» Different codecs : gzip, bzip2 etc» Transparent to the developer
conf.setCompressMapOutput(true);conf.setMapOutputCompressorClass(GzipCodec.class);
Tip #4: Compression
Tips & Tricks
62
» A node can experience a slowdown, thus slowing down the entire job
» If a task is identified as “slow”, it will be scheduled to run in another node in parallel
» As soon as one finishes successfully, the others will be killed
» An optimization – not a feature
Tip #5: Speculative Execution
Tips & Tricks
63
» Input can come from 2 (or more) different sources» Hadoop has a contrib package called datajoin » Generic framework for performing reduce-side join
Tip #6: DataJoin Package
MapReduce
64
Hadoop in the CloudAmazon Web Services
65
» Cloud computing - Shared resources and information are provided on demand
» Rent a cluster rather than buy it» The best known infrastructure for cloud computing is
Amazon Web Services (AWS)» Launched at July 2002
Cloud Computing and AWS
Hadoop in the Cloud
66
» Elastic Compute Cloud (EC2) A large farm of VMs where a user can rent and use them to
run a computer application Wide range on instance types to choose from (price varies)
» Simple Storage Service (S3) – Online storage for persisting MapReduce data for future use
» Hadoop comes with built in support for EC2 and S3$ hadoop-ec2 launch-cluster <cluster-name>
<num-of-slaves>
Hadoop in the Cloud – Core Services
67
EC2 Data Flow
OurData
HDFS
MapReduce Tasks
EC2
68
EC2 & S3 Data Flow
S3
OurData
HDFS
MapReduce Tasks
EC2
69
Hadoop-Related Projects
70
» Thinking in the level of Map, Reduce and job chaining instead of simple data flow operations is non-trivial
» Pig simplifies Hadoop programming» Provides high-level data processing language : Pig Latin» Being used by Yahoo! (70% of production jobs),
Twitter, LinkedIn, EBay etc..
» Problem: Users file & Pages file. Find top 5 most visited pages by users aged 18-25
Pig
Hadoop-Related Projects
71
Users = LOAD ‘users.csv’ AS (name, age);Fltrd = FILTER Users BY age >= 18 AND age <= 25;
Pages = LOAD ‘pages.csv’ AS (user, url);
Jnd = JOIN Fltrd BY name, Pages BY user;Grpd = GROUP Jnd BY url;Smmd = FOREACH Grpd GENERATE group, COUNT(Jnd) AS clicks;Srtd = ORDER Smmd BY clicks DESC;Top5 = LIMIT Srtd 5;
STORE Top5 INTO ‘top5sites.csv’;
Pig Latin – Data Flow Language
72
» A data warehousing package built on top of Hadoop» SQL-like queries on large datasets
Hive
Hadoop-Related Projects
73
» Hadoop database for random read/write access» Uses HDFS as the underlying file system» Supports billions of rows and millions of columns» Facebook chose HBase as a framework for their new
version of “Messages”
HBase
Hadoop-Related Projects
74
» A distribution of Hadoop that simplifies deployment by providing the most recent stable version of Apache Hadoop with and backports
Cloudera
Hadoop-Related Projects
75
» Machine learning algorithms for Hadoop» Coming up next.. (-:
Mahout
Hadoop-Related Projects
76
» Big Data can and will cause serious scalability problems to your application
» MapReduce for analysis, Distributed filesystem for storage
» Hadoop = MapReduce + HDFS and much more» AWS integration is easy» Lots of documentation
Last words
Summary
77
Hadoop in Action / Chuck LamHadoop
: The Definitive Guide, 2nd Edition / Tom White (O’reilly)
Apache Hadoop DocumentationHadoop @ Last.FM Presentation MapReduce in Simple Terms / Saliya EkanayakeAmazon Web Services
References
78
Thank you!