Upload
managing-partner-3xc-global-partners-darori-capital-luxemborg-start-up-nation-icritical-canvas
View
221
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
Hands-On Hadoop Hands-On Hadoop TutorialTutorial
Chris SosaChris Sosa
Wolfgang RichterWolfgang Richter
May 23, 2008May 23, 2008
General InformationGeneral Information
Hadoop uses HDFS, a distributed file Hadoop uses HDFS, a distributed file system based on GFS, as its shared system based on GFS, as its shared filesystemfilesystem
HDFS architecture divides files into HDFS architecture divides files into large chunks (~64MB) distributed large chunks (~64MB) distributed across data serversacross data servers
HDFS has a global namespaceHDFS has a global namespace
General Information (cont’d)General Information (cont’d) Provided a script for your convenienceProvided a script for your convenience
– Run source /localtmp/hadoop/setupVars from Run source /localtmp/hadoop/setupVars from centurtion064centurtion064
– Changes all uses of {somePath}/command to just Changes all uses of {somePath}/command to just commandcommand
Goto Goto http://www.cs.virginia.edu/~cbs6n/hadoophttp://www.cs.virginia.edu/~cbs6n/hadoop for web access. These slides and more for web access. These slides and more information are also available there.information are also available there.
Once you use the DFS (put something in it), Once you use the DFS (put something in it), relative paths are from /usr/{your usr id}. E.G. if relative paths are from /usr/{your usr id}. E.G. if your id is tb28 … your “home dir” is /usr/tb28your id is tb28 … your “home dir” is /usr/tb28
Master NodeMaster Node
Hadoop currently configured with Hadoop currently configured with centurion064 as the master nodecenturion064 as the master node
Master nodeMaster node– Keeps track of namespace and Keeps track of namespace and
metadata about itemsmetadata about items– Keeps track of MapReduce jobs in the Keeps track of MapReduce jobs in the
systemsystem
Slave NodesSlave Nodes
Centurion064 also acts as a slave nodeCenturion064 also acts as a slave node
Slave nodesSlave nodes– Manage blocks of data sent from master Manage blocks of data sent from master
nodenode– In terms of GFS, these are the chunkserversIn terms of GFS, these are the chunkservers
Currently centurion060 is also another Currently centurion060 is also another slave nodeslave node
Hadoop PathsHadoop Paths Hadoop is locally “installed” on each machineHadoop is locally “installed” on each machine
– Installed location is in /localtmp/hadoop/hadoop-Installed location is in /localtmp/hadoop/hadoop-0.15.30.15.3
– Slave nodes store their data in Slave nodes store their data in /localtmp/hadoop/hadoop-dfs (this is automatically /localtmp/hadoop/hadoop-dfs (this is automatically created by the DFS)created by the DFS)
– /localtmp/hadoop is owned by group gbg /localtmp/hadoop is owned by group gbg (someone in this group must administer this or a (someone in this group must administer this or a cs admin)cs admin)
Files are divided into 64 MB chunks (this is Files are divided into 64 MB chunks (this is configurable)configurable)
Starting / Stopping HadoopStarting / Stopping Hadoop
For the purposes of this tutorial, we For the purposes of this tutorial, we assume you have run the setupVars assume you have run the setupVars from earlierfrom earlier
start-all.sh – starts all slave nodes start-all.sh – starts all slave nodes and master nodeand master node
stop-all.sh – stops all slave nodes and stop-all.sh – stops all slave nodes and master nodemaster node
Using HDFS (1/2)Using HDFS (1/2) hadoop dfshadoop dfs
– [-ls <path>][-ls <path>]– [-du <path>][-du <path>]– [-cp <src> <dst>][-cp <src> <dst>]– [-rm <path>][-rm <path>]– [-put <localsrc> <dst>][-put <localsrc> <dst>]– [-copyFromLocal <localsrc> <dst>][-copyFromLocal <localsrc> <dst>]– [-moveFromLocal <localsrc> <dst>][-moveFromLocal <localsrc> <dst>]– [-get [-crc] <src> <localdst>][-get [-crc] <src> <localdst>]– [-cat <src>][-cat <src>]– [-copyToLocal [-crc] <src> <localdst>][-copyToLocal [-crc] <src> <localdst>]– [-moveToLocal [-crc] <src> <localdst>][-moveToLocal [-crc] <src> <localdst>]– [-mkdir <path>][-mkdir <path>]– [-touchz <path>][-touchz <path>]– [-test -[ezd] <path>][-test -[ezd] <path>]– [-stat [format] <path>][-stat [format] <path>]– [-help [cmd]][-help [cmd]]
Using HDFS (2/2)Using HDFS (2/2)
Want to reformat?Want to reformat?
EasyEasy– hadoop namenode –formathadoop namenode –format
Basically we see most commands look similar Basically we see most commands look similar – hadoop “some command” optionshadoop “some command” options– If you just type hadoop you get all possible If you just type hadoop you get all possible
commands (including undocumented ones – commands (including undocumented ones – hooray)hooray)
To Add Another SlaveTo Add Another Slave This adds another data node / job This adds another data node / job
execution site to the poolexecution site to the pool– Hadoop dynamically uses filesystem Hadoop dynamically uses filesystem
underneath itunderneath it– If more space is available on the HDD, HDFS If more space is available on the HDD, HDFS
will try to use it when it needs towill try to use it when it needs to Modify the slaves file Modify the slaves file
– In centurion064:/localtmp/hadoop/hadoop-In centurion064:/localtmp/hadoop/hadoop-0.15.3/conf0.15.3/conf
– Copy code installation dir to Copy code installation dir to newMachine:/localtmp/hadoop/hadoop-0.15.3 newMachine:/localtmp/hadoop/hadoop-0.15.3 (very small)(very small)
– Restart HadoopRestart Hadoop
Configure HadoopConfigure Hadoop
Can configure in {$installation dir}/confCan configure in {$installation dir}/conf– hadoop-default.xml for globalhadoop-default.xml for global– hadoop-site.xml for site specific (overrides global)hadoop-site.xml for site specific (overrides global)
That’s it for Configuration!That’s it for Configuration!
Real-time AccessReal-time Access