If you can't read please download the document
Upload
doug-chang
View
923
Download
5
Embed Size (px)
Citation preview
Hadoop Big Data Intro
2/16/2013Hadoop/BigData IntroProvided agenda Addition:Theory from papersAddition: Demo/code samplesAddition: System architectureGoal: develop some theory
Agenda
Introduction to Big Data
Basic Concepts
Hadoop Overview of Hadoop Working with HDFS / Map Reduce Architecture Anatomy of File write / read Admin and Development
Introduce other components of Hadoop ecosystem
Agenda (2)
Hive / HBase / Pig / Sqoop Map Reduce Features - ArchitectureWorking
Job Execution
We can cover this circa 2005 agenda in 3h w/some additions. Need hands on lab to understand the content.
Big Data defn.
Big data, too big to run SQL queries on
Lots of data (cover Google approach which is what Hadoop is based on)
Modifying the Hadoop Components, JIRABuilding Applications on Hadoop, Compet GapAstayanaxDevOps, Packaging, Chaos Monkey, AWS, Zookeeper10x
3-4x
Replacing Legacy Systems
Big Data Basic Concepts
Storing large amounts of data and doing something with themSome sort of analyticsEasy: Tableau, Datameer
Competitive AdvantageSmall scale analytics: R, stats 202 , DemographicsWeblog
Large scale analytics: cs246
Should be able to define analytics POCs based on the next slide which are domain specific
Big Data Analytics
Big Data started in 2000, 2 design problems @Google, 1998-2000
There is a separate Big Data product for each use case. Google Design Problems/GFS:Store internet pages on hard drives
Unstructured dataCollect HTML and Links; images?
20+ billion web pages x 20KB = 400+ TB
1 computer reads 30-35 MB/sec from disk
~4 months to read the web
~1,000 hard drives to store the web
Source Jure Leskovic Slides cs264
Google M/R
Once the data is on 1k machines... How to run an algorithm over 1k disk drives?
Traditional method: read file into memory. Can't put webpages into memory & reading data would saturate network.
Soln: Map Reduce. Move the code to the data via mappers and reducers which are placed on the same computer as the data
GFS paper/MapReduce Paper. Hadoop = GFS+M/R
Google GFS
Stored the html/links/images were stored in BigTable. Store html pages into files. Many pages per file. Why? Seeks $, store crawl
2 parts:What is a file system? SB=Collection of inodes
When you create and delete files you are adding/removing inodes from the superblock. When you add contents to a file and save it like adding text in word you are adding data blocks to an inode.
R/W in a file system
Read the contents of foo.txtGo to superblock, find location of datablocks from pointer in superblock for foo.txt and read them into memory
Write into foo.txtGo to superblock, write contents into new datablocks and append addresses of datablocks into superblock entry for foo.txt.
Distribute file system across servers
Superblocks=>GFS master =>Hadoop NameN inodes=>chunkservers=>Hadoop DataNode
R/W in distributed file system
Read from HDFS foo.txt:Go to namenode, find datablock where data is, read data into memory on client machine. What is the difference?
Write into HDFS foo.txt:Go to namenode, find empty block, tell client to send data to an empty block on the datanode, append the addresses of the new blocks into NN for foo.txt. What is the difference? Client, Network
Hadoop HDFS, List of files in system, blocks file contents
HDFS Demo
List of files
NN+DN website http://:50070/
Where is the DN? Port:50075
Logs demo
Running in single node PD modeJVM processes are threads vs. separate JVM processes for each service.
Global vars in mappers good in PD not in cluster
/etc/init.d. Do not download and install tar ball
File R/W system issue
Cache/Disk Drives
Before writing from memory to disk power goes out. Lost data
Write to Memory
Write to Disk
Failures
Commodity servers fail, One server @G may stay up 3 years (1,000 days)
If you have 1,000 servers, expect to lose 1/day
With 1M machines 1,000 machines fail every day!
Google 3y vs else once 3w? Why? 20Servers?
GFS paper/restart failed M/R tasks. Not in Hadoop
Most system designs neglect failure except Netflix ChaosM
What is Hadoop?
An implementation of GFS/Map Reduce in Java. Used at Yahoo, LinkedIn, Facebook, Netflix, TwitterWhat did each contribute? Use cases?
Doug Cutting (Cloudera)/Lucene
v1.0 vs v2.0
Hadoop Components, HBase, Flume, Sqoop, Zookeeper, Oozie, Pig, Hive
HDFS
HDFS is a distributed file system. Hadoop Distributed File SystemUnlimited capacity, add more capacity add more nodes
A file SB info is stored in a NN server. Inodes or datablocks are stored in DN server.
Replicate for data locality & error detection/recoveryReplicate a data block 3x. Why?
HDFS: Append only file system (copy Google Paper)
HDFS
What is your file system on your laptop?Append only or Random R/W?
When is append only bad? Digression:RMW. Editing a word document is what? Append only or RMW?
Design exercise: 200Gb in files. How many files are there?
Does this fit in memory?
HDFS Design exercise
Many files combined into smaller number of large files. How to access smaller files? Slower to access for reads
If RMW; add modify into new blocks in HDFS. Find the new blocks and read them into memory is slower than sequential access on a single node file system
Faster to delete old file and create a new file with sequential blocks in place.
Solns
1) on write write to disk everytime write to memoryWhy Good?
Why Bad?
2) lose the data when the power goes outWhy Good?
Why Bad?
FSCK; File System Check Consistency
Agenda: Admin and Development
HDFS/MR Administration. HBase,etc. different24x7 SLAHot standbys for maintenance
HDFS:Recovery from User error, restore the file I just deleted
HDFS/MR Recovery from failures, (not automated in Hadoop)
MR lagging mapper, cascading failures
Development
Apache S/W development practicesJenkins, Jira tickets
Repos
HDFS Schemas
Do you store 20B files on HDFS by file name?What happens with multiple files with same name? e.g. test.txt?
Create metadata, partitions
HDFS Schemas:Avro
ParquetDremel column store/encoding
Map Reduce Intro(1)
Map Reduce Designed in 2000, when there was very little memory in commodity PCs, ~4GB or less. These aren't enterprise class servers.
This isn't the case today. MultiCPU/MultiCore 192gb machines are much more reliable with different use cases
M/R idiom is being replaced with non MR systems.
What we don't coverGoogle F1
Map Reduce Intro
There are 3 parts to how Map Reduce works:Mapper
Shuffle
Reducer
There are 3 parts to a Map Reduce programMappers
Reducers
Driver
These 2 concepts aren't the same. People get these mixed up.
Map Reduce Part 1
1k node cluster; bring the code to the data. Reduce network traffic
Programming idiom
Divide task into mappers.
Examples of what can be divided and combinedTry dividing first, assume you can combine anything you can divide
Divide input file into single lines, send one line to each server, process each line
Word Count
I can count a text file of words with a single program.
I can split the file into a mappers and have the mappers count the words in parallel
FileLineFileLineFileLineFileLineMapperFileLineMapperMapperMapper
Mapper
Word Count
The mappers output K/V pairs onto the network. These are not Java Strings or Java objects! Keys: Comparable, Writable
Values: Writable
Network saturates with multiple M/R jobs.
NetworkReducerReducerReducerReducer
Shuffle/Reduce Part 2/3
The K/V pairs are sent to the network. The K/V pairs are sent to certain destinations based on 2 rules:1) each K goes to the same reducer
2) all keys are in sorted order
3) Output in 2 forms, _SUCCESS and part-00000
Custom partitioner to send K to specific Reducer
Grouping Comparator: group keys to reducer
Sorting Comparator: can modify sort order for compound keys
Map Reduce Word Count
M/R Program/Mapper
public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {String line = (caseSensitive) ? value.toString() : value.toString().toLowerCase();for (String pattern : patternsToSkip) {line = line.replaceAll(pattern, "");}StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); reporter.incrCounter(Counters.INPUT_WORDS, 1);}
M/R Program Reducer
Class Reduce extends MapReduceBase implements Reducerpublic void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {int sum = 0;while (values.hasNext()) {sum += values.next().get();}output.collect(key, new IntWritable(sum));}}
M/R Program Driver
public class WordCount extends Configured implements Tool conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);conf.setMapperClass(Map.class);conf.setCombinerClass(Reduce.class);conf.setReducerClass(Reduce.class);conf.setInputFormat(TextInputFormat.class);conf.setOutputFormat(TextOutputFormat.class);
AVRO/Protocol Buffers
Avro is a serialization format for Mappers.
Splittable and human readable. Not as small as PB.
Avro object
Protocol Buffers
Used internally at Google, compact serializationhttps://code.google.com/p/protobuf/
proto bufs ,not just serializtion, closest to binary. Used internally in Hadoop.
Why do we need Avro,Protobufs?
Binary: no parser, fast, small. OK for objects maybe this is like Hibernate
Thrift
Add a server to send/receive objects and do the serialization/deserialization
Map Reduce References
What can I do with each text line? Easy: ETL patterns:Match patterns
Count num occurences tokens
Processing files
Harder: Machine Learning/DMWhat can't be easily done?K-means clustering
Ullman book: Mining massive datasets:http://infolab.stanford.edu/~ullman/mmds.html
Jimmy Lin book:http://www.umiacs.umd.edu/~jimmylin/
MRv2
2 versions of M/Rv1: old api import xxx.mapred, JT/TT
v2: new api, import xxx.mapreduce, RM/NM/JH
YARN, in Hadoop 2.x maintains backward compatability to M/R v1.Devs start shifting to Hadoop 2.x YARN for new bug fixes
YARN Dameons
hadoop-hdfs-datanode
hadoop-hdfs-namenode
hadoop-yarn-resourcemanager
hadoop-yarn-nodemanager
hadoop-yarn-proxyserver
hadoop-hdfs-SecondaryNameNode
hadoop-hdfs-JournalNode
hadoop-Hdfs-zkfc
hadoop-Httpfs
hadoop-mapreduce-HistoryServer
YARN->Enterprise
Encrypted/Pluggable Shuffle/Sort
Httpfs rewrite or proxyserver
V2 user authentication/permissions. Apache SentrySeparate authorization policies per database/schema
Users have to customize for shared data structures (tables/metadata,(hbase,search,zk). Not in any distro!
Schema metadata needs fine grained auth.
Web app proxy/part of RM to reduce attacks on exposed RM web server
Map Reduce Demo
Word Count demoHDFS DataNode: http://localhost:50075/
HDFS NameNode: http://localhost:50070/
ResourceManager http://localhost:8088
JobHistory Server http://jhs_host:19888.
Logging mistakesAdding logging to M/R jobs prop to data size and number times program run. 1TB file means 1TB logs. Processing 100GB 10x
Logs fill up disk crash system
Zookeeper logs
M/R Pipelines
The successful organizations never write direct Mappers/Reducers. They use higher level tools like Pig,Hive, etc..
Defn: Workflow:series of M/R jobs
Pipeline: output of one M/R job is the input to another
Apache Crunch modeled after Google FlumeJava
Google FlumeJava
Introduction of data pipelines based on multiple M/R stages
Define a parallel collection with a set of parallel operations
Much easier to use than M/R programming. Contrast w/UDFs. Less lines of source:
Apache Crunch
Not just M/RFaster to specify w/API a data processing pipeline you can customize instead of writing Pig/Hive scripts, MRPipelines
YARN, next version of M/R
Supports Apache Spark, SparkPipelines
Can keep in memory vs. spill to disk, MemPipelines
Case Study of old systems
Older generation of Hadoop Components, Hadoop, Pig, Hive.
Gives insight to stability/capability of products
Hive at LinkedIn (bottom left). All 3 similar
Pig+DataFu
Hive bottom left corner
Teradata+Hadoop
Netflix, Block Diagram
Yahoo Block Diagram, Pig, Hive, Spark, Storm
Yahoo
Targeting Content, not Search
3k Pig jobs in production
Hive in small use for analysts, Pig in heavy production use. Non MR in use now. Matches Google's progression
Mapper Failures
What happens? Google's paper restarts failed tasks. NS
Hadoop isn't auto recovery
Hadoop Mapper/Reducer Worker Failure:Completed ok, in progress reset
Reschedule on another worker
Speculative Execution
(ADD FROM VIDEo)
Master failure, abort and return fail to client
M/R Runtime
Balancing Cluster capacity#m>>num nodes
#r