Upload
rishma99
View
241
Download
6
Embed Size (px)
Citation preview
A start to hadoop
By:-Ayush MittalKrupa VarugheseParag Sahu
Major Focus• What is hadoop?• Why hadoop?• What is map reduce?• Phases in map reduce.• What is hdfs?• Setting hadoop in pseudo distributed mode.• A simple pattern matcher program.
The Elementary item-DATA
• What is Data? In computer era data is termed as relevant information which flows through one machine to another. Types can be:
Structured:- Identifiable because of its organized structure Example: Database(Information stored in column and rows)
Unstructured:-It does not have predefined data model Generally text heavy(contain date ,numbers ,logs) Irregular and ambiguous
What is the problem with unstructured
DATA? • Hides important insights
• Storage
• Updation
• Time consuming
Big DataIt consists of datasets that grow so large that they become awkward to work with using on-hand database management tools• Sizes in terabyte ,Exabyte ,zettabyte
• Difficulties include capture, storage, search, sharing, analytics, and visualizing
• Examples include web logs; RFID; sensor networks; social data (due to the Social data Revolution), Internet search indexing; call detail records; astronomy, atmospheric science, genomics, biogeochemical, military surveillance; medical records; photography archives; video archives; and large-scale ecommerce.
Here comes Hadoop !!!
• The Apache Hadoop software library is a framework that allows for the “distributed processing of large data sets across clusters of computers using a simple programming model”.
• It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
History of hadoop• Hadoop was created by Doug Cutting, the creator of
Apache Lucene.• In 2004, they set about writing an open source
implementation, the Nutch Distributed Filesystem (NDFS).
• In 2004, Google introduced MapReduce to the world. Early in 2005, the Nutch developers had a working MapReduce implementation in Nutch, and by the middle of that year all the major Nutch algorithms had been ported to run using MapReduce and NDFS.
• In 2006 DOUG Cutting joined yahoo , and it build its site index using 10,000 core hadoop cluster.
• Now hadoop can sort 1 Terabyte data in 62 seconds.
Hadoop ecosystem
• CoreA set of components and interfaces for distributed filesystems and general I/O (serialization, Java RPC, persistent data structures).• AvroA data serialization system for efficient, cross-language RPC, and persistent data storage. (At the time of this writing, Avro had been created only as a new subproject, and no other Hadoop subprojects were using it yet.)
MapReduce• A MapReduce is a distributed data processing
model introduced by Google to support distributed computing on large data sets on clusters of computers
• Hadoop can run MapReduce programs written in various languages
HDFSReliable Storage: HDFS• Hadoop includes a fault‐tolerant storage system called the Hadoop Distributed File System, or HDFS.• HDFS is able to store huge amounts of information, scale up
incrementally.• It can survive the failure of significant parts of the storage infrastructure without losing data.• Clusters can be built with inexpensive computers.• If one fails, Hadoop continues to operate the cluster without closing data or interrupting work, by shifting work to the remaining machines in the cluster.• HDFS manages storage on the cluster by breaking incoming files
into pieces, called “blocks,” and storing each of the blocks redundantly across the pool of servers.
• If namenode fails all system crashes down.
HDFS example
Zookeeper• Distributed consensus engine• Provides well-defined concurrent access semantics:• Distributed locking / mutual exclusion• Message board / mailboxes
Pig• Data-flow oriented language• “Pig Latin”• Data types include sets, associative arrays,• tuples• High-level language for routing data, allows• easy integration of Java for complex tasks• Developed at Yahoo!
Hive• SQL-based data warehousing app• Feature set is similar to Pig• Language is more strictly SQL-esque• Supports SELECT, JOIN, GROUP BY, etc
HBase• Column-store database• Based on design of Google Big-Table• Provides interactive access to information• Holds extremely large datasets (multi-TB)• Constrained access model• (key, value) lookup• Limited transactions (only one row)
Chukwa• A distributed data collection and analysis system.
Chukwa runs collectors that store data in HDFS, and it uses MapReduce to produce reports
Fuse-dfs• Allows mounting of HDFS volumes via Linux FUSE filesystem– Does not imply HDFS can be used for general-purpose file system– Does allow easy integration with other systems for data import/export
Properties of hadoop System
• Accessible—Hadoop runs on large clusters of commodity machines or on cloud computing services
• Robust—Because it is intended to run on commodity hardware, Hadoop is architected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures.
• Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the cluster.
• Simple—Hadoop allows users to quickly write efficient parallel code.
Comparing SQL databases and
Hadoop
• SCALE-OUT INSTEAD OF SCALE-UP Scaling commercial relational databases is expensive. Their design is more friendly to scaling up. To run a bigger database you need to buy a bigger machine. In fact, it’s not unusual to see server vendors market their expensive high-end machines as “database-class servers.” Unfortunately, at some point there won’t be a big enough machine available for the larger data sets.
• KEY/VALUE PAIRS INSTEAD OF RELATIONAL TABLES A fundamental tenet of relational databases is that data resides in tables having relational structure defined by a schema . Although the relational model has great formal properties, many modern applications deal with data types that don’t fit well into this model. Text documents, images, and XML files are popular examples.
• FUNCTIONAL PROGRAMMING (MAPREDUCE) INSTEAD OF DECLARATIVE QUERIES (SQL)
SQL is fundamentally a high-level declarative language. You query data by stating the result you want and let the database engine figure out how to derive it. Under MapReduce you specify the actual steps in processing the data, which is more analogous to an execution plan for a SQL engine . Under SQL you have query statements; under MapReduce you have scripts and codes.
• OFFLINE BATCH PROCESSING INSTEAD OF ONLINE TRANSACTIONS
Hadoop is designed for offline processing and analysis of large-scale data. It doesn’t work for random reading and writing of a few records, which is the type of load for online transaction processing. In fact, as of this writing (and in the foreseeable future), Hadoop is best used as a write-once , read-many-times type of data store. In this aspect it’s similar to data warehouses in the SQL world.
Building blocks of hadoop
• NameNode • DataNode • Secondary NameNode • JobTracker • TaskTracker
Namenode
Datanodes
MapReduce
• A MapReduce program processes data by manipulating (key/value) pairs in the general form
• map: (K1,V1) ➞ list(K2,V2) • reduce: (K2,list(V2)) ➞ list(K3,V3)
• map: (K1, V1) → list(K2, V2)• combine: (K2, list(V2)) → list(K2, V2)• reduce: (K2, list(V2)) → list(K3, V3)
MapReduce
Mapper • Mapper maps input key/value pairs to a set of
intermediate key/value pairs.• MapReduce framework spawns one map task for
each InputSplit (input splits are a logical division of your records ,64MB by default and can be customized) generated by the InputFormat for the job.
• The mapping class needs to extend an abstract class called Mapper
• map: (K1,V1) ➞ list(K2,V2)
Inputsplit• Each map processes a single split, Each split is
divided into records, and the map processes each record—a key-value pair—in turn
Mapper Example• static class FriendMapper extends• Mapper<LongWritable, Text, Text, IntWritable> {• public void map(LongWritable key, Text value, Context context)• throws IOException, InterruptedException {• String data = value.toString();• String[] friends = data.split(" ");• String friendpair = friends[0];• for (int i = 1; i < friends.length; i++) {• if (friendpair.compareTo(friends[i]) > 0) {• friendpair = friendpair + "," + friends[i];• } else {• friendpair = friends[i] + "," + friendpair;• }• context.write(new Text(friendpair), new IntWritable(0));• friendpair = friends[0];• }• for (int j = 0; j < friends.length; j++) {• friendpair= friends[j];• for (int i = j + 1; i < friends.length; i++) {• if (friendpair.compareTo(friends[i]) > 0) {• friendpair = friendpair + "," + friends[i];• } else {• friendpair = friends[i] + "," + friendpair;• }• context.write(new Text(friendpair), new IntWritable(1));• friendpair = friends[j];• }• }
• }• }
Reducer• Reducer reduces a set of intermediate values which share a
key to a smaller set of values.• Reducer has 3 primary phases: shuffle, sort and reduce.
shuffleIn this phase the framework fetches the relevant partition of the output of all the mappersSortframework sorts Reducer inputs by keys lexographicallyReducethe reduce(WritableComparable,Iterator,context) method is called for each <key, (list of values)> pair in the grouped inputs.
Reducer Example• static class FriendReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{• public void reduce(Text key, Iterable<IntWritable> values,• Context context) throws IOException, InterruptedException {• int count = 0;• boolean mark = false;• for (IntWritable v : values) {• if (v.get() == 0) {• mark = true;• }• count++;• }
• if (!mark) {• context.write(key, new IntWritable(count));• }• }• }
Combiner-local reduce
• Running combiners makes map output more compact , so there is less data to write to local disk and to transfer to the reducer.
• If a combiner is used then the map key-value pairs are not immediately written to the output. Instead they will be collected in lists, one list per each key value. When a certain number of key-value pairs have been written, this buffer is flushed by passing all the values of each key to the combiner's reduce method and outputting the key-value pairs of the combine operation as if they were created by the original map operation.
A quick view of hadoop
Setting hadoop in pseudo mode
• Prerequisite:-Linux with jdk 1.6 or laterHadoop jar
• Used configuration:-Linux rhel 5.5 using Vmware playerJdk 1.6
Login as rootusername – rootpassword- root@123
Add group$ groupadd hadoop
Create user(hduser) and add it to group$ useradd –G hadoop hduser
Change password of hduser$ passwd hduser
Adding hduser to sudoers list , this is done to give hduser the privileges to use root using sudo
$ visudo
The file look like this scroll down to the page where:-
Save the file using esc+:wq!
Run the ifconfig command to get the ipaddress and add localhost to the hosts file
$ ifconfig$vi /etc/hosts
Changing the user$su hduser
Generating public private key of sshssh is used by the hadoop server to automatically login into the
system
When it ask to save the key save it in your home with file name id_rsa /$home/.ssh/id_rsawhich is /home/hduser/.ssh/id_rsa here
Adding public key to authorized file$cat /home/hduser/.ssh/id_rsa.pub>>/home/hduser/.ssh/authorized_keysThis command will create a file authorized_keys and add your public key to it.
Restricting permission of /home/hduser/.ssh and /home/hduser/.ssh/authorized_keysThis is done so that the passphrase will be used instead of password which is empty here
$chmod 0700 /home/hduser/.ssh$chmod 0600 /home/hduser/.ssh/authorized_keys
Testing whether the ssh uses the passphrase or not$ssh localhostit should be run without password
Extract the hadoop tar in /usr/local/$cd /usr/local$sudo tar xzf hadoop-0.20.2.tar.gz$sudo mv hadoop-0.20.2 hadoop //changing name of the folder$sudo chown –R hduser hadoop //granting ownership of folder
//to hduser
Configuring the environment variable$vi /home/hduser/.bashrc
Add the following to the file
Save the file using esc+:wq!
Configuring the hadoop serverFiles to be edited:-
hadoop-env.shcore-site.xmlmapred-site.xmlhdfs-site.xml
Change the directory to conf$cd /usr/local/hadoop/conf
Edit the file hadoop-env.sh$vi hadoop-env.sh
Save all the files using esc:wq!Creating a temp dir for hadoop$ sudo mkdir -p /app/hadoop/tmp $ sudo chown hduser:hadoop /app/hadoop/tmpEditing the core-site.xml file
vi core-site.xml
Editing the mapred file$vi mapred-site.xml
Editing the hdfs file$vi hdfs-site.xml
Formatting the namnode $/usr/local/hadoop/bin/hadoop namenode –format
Starting the cluster$/usr/local/hadoop/bin/start-all.sh
Stopping the cluster $/usr/local/hadoop/stop-all.sh
Program to find a pattern in a text file.
import java.io.IOException;import java.util.regex.Pattern;
import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class PatternFinder {
/** * @param args */
static class PatternMapper extends Mapper<LongWritable,Text,Text,IntWritable>{
public void map(LongWritable key ,Text value,Context context){
String line =value.toString();String lineReadArray[]=line.split(" ");Configuration conf=context.getConfiguration();String pattern=conf.get("ln");String pat="[\\w]*"+pattern+"[\\w]*";for(String val:lineReadArray){if(Pattern.matches(pat, val)){
try {context.write(new Text(val), new
IntWritable(1));}
catch (IOException e) {// TODO Auto-generated catch block
e.printStackTrace();} catch (InterruptedException e) {// TODO Auto-generated catch block
e.printStackTrace();}
}}
}}
static class PatternReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
public void reduce(Text key,Iterable<IntWritable> value,Context context){
int sum=0;for(IntWritable i:value){
sum=sum+i.get();}try {
context.write(key, new IntWritable(sum));} catch (IOException e) {// TODO Auto-generated catch block
e.printStackTrace();}
catch (InterruptedException e) {// TODO Auto-generated catch block
e.printStackTrace();}
}}public static void main(String[] args) throws Exception{
// TODO Auto-generated method stubif (args.length != 3) {
System.err.println("Usage: PatternFinder <input path> <output path> <pattern>");
System.exit(-1);}Configuration conf =new Configuration();conf.set("pattern", args[2]);Job job = new Job(conf);job.setJarByClass(PatternFinder.class);FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));job.setMapperClass(PatternMapper.class);job.setReducerClass(PatternReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Create a jar of the class file and store it in /home/krupa/inputjar/
Copying a test file to hdfsStarting the cluster if not started
$/usr/local/hadoop/bin/start-all.shCreating a input directory
$/usr/local/hadoop/hadoop fs –mkdir inputCopying a file to input directory
$/usr/local/hadoop/hadoop fs –put /home/krupa/abc.txt input
Running the jar $ $HADOOP_HOME/bin/hadoop jar
/home/krupa/inputjar/patternfinder PatternFinder input output
Copying the output$ $HADOOP_HOME/hadoop fs –copyToLocal
/user/kruap/output/part-r-00000 /home/krupa/output/out.txt