02_Hadoop_Architecture_Exercise_4.0.0[1]

Embed Size (px)

Citation preview

  • 7/25/2019 02_Hadoop_Architecture_Exercise_4.0.0[1]

    1/18

    IBM Software

    Hadoop Fundamentals

    Unit 2: Hadoop Architecture

  • 7/25/2019 02_Hadoop_Architecture_Exercise_4.0.0[1]

    2/18

    Copyright IBM Corporation, 2015

    US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

  • 7/25/2019 02_Hadoop_Architecture_Exercise_4.0.0[1]

    3/18

    IBM Software

    Contents Page 3

    Contents

    LAB 1 HADOOP ARCHITECTURE ............................................................................................................................... 41.1 GETTING STARTED .................................................................................................................................................41.2 LOGIN TO THE VM ..................................................................................................................................................71.3 START ALL SERVICES WITH THEAMBARI WEB CONSOLE........................................................................................9

    1.4 BASIC HDFSINTERACTIONS USING THE COMMAND LINE.....................................................................................101.5 SUMMARY.............................................................................................................................................................15

  • 7/25/2019 02_Hadoop_Architecture_Exercise_4.0.0[1]

    4/18

    IBM Software

    Page 4 Unit 4

    Lab 1 Hadoop Archi tecture

    The overwhelming trend towards digital services, combined with cheap storage, has generated massiveamounts of data that enterprises need to effectively gather, process, and analyze. Data analysis techniquesfrom the data warehouse and high-performance computing communities are invaluable for many

    enterprises, however often times their cost or complexity of scale-up discourages the accumulation of datawithout an immediate need. As valuable knowledge may nevertheless be buried in this data, relatedscaled-up technologies have been developed. Examples include Googles MapReduce, and the open-source implementation, Apache Hadoop.

    Hadoop is an open-source project administered by the Apache Software Foundation. Hadoopscontributors work for some of the worlds biggest technology companies. That diverse, motivatedcommunity has produced a collaborative platform for consolidating, combining and understanding data.After completing this hands-on lab, youll be able to:

    Use Hadoop commands to explore HDFS on the Hadoop system

    Use the BigInsights Console to explore HDFS on the Hadoop system

    Allow 60 minutes to 90 minutes to complete this lab.

    This version of the lab was designed using the IBM BigInsights 4.0 Quick Start Edition(QSE).Throughout this lab you will be using the following account login information. The assumptions are thatthe passwords for rootand virtuserare as follows. If, when you setup the Quick Start Edition image,you specified different passwords, then you will have to make the appropriate mental translationsthroughout the exercise.

    Username Password

    VM image setup screen root passwordLinux virtuser password

    (but you choose your own

    password)

    With the IBM BigInsights v4.0 QSE, you will be working with user virtuser.

    1.1 Getting StartedFirst start your Quick Start image. If you are asked whether you moved your VMware Image or copied it,select Copied It. You need to have 10.5GB of memory to successfully work with this VMwareimage. You must run in an environment that supports 64-bit VMs and has at least 12GB of real,physical memory (some memory is required for the host operating system).

  • 7/25/2019 02_Hadoop_Architecture_Exercise_4.0.0[1]

    5/18

    IBM Software

    Hands-on-Lab Page 5

    When you connect, you should login as root and use password password at the following screen tocomplete the setup required to run the virtual machine (VM) on your system.

  • 7/25/2019 02_Hadoop_Architecture_Exercise_4.0.0[1]

    6/18

    IBM Software

    Page 6 Unit 4

    Further set up is required to get your VM working. Select English (USA) (or, another language, if youkeyboard is different), hit Tab, and then Enter.

    You must accept the license. Hit Enter twice.

  • 7/25/2019 02_Hadoop_Architecture_Exercise_4.0.0[1]

    7/18

    IBM Software

    Hands-on-Lab Page 7

    Choose a password for the user id that you will be working with (virtuser), e.g., password (but you mayselect any other password, but have to remember that password for later use).. Hit Taband Enter.Enter the password a second time when requested, as confirmation.

    1.2 Login to the VM__1. The VM will now start a graphical window (GUI) and present you with the standard login id, i.e.,

    virtuser. Enter the password that you chose.

    Once the password has been entered, you can use the mouse to select Log In.

    __2. You need to start the BigInsights components. With this VM, you have two options. One option isto use the icon that was placed on the desktop.

    But that icon is unique to this VM and may not be available on other systems / clusters for you inthe future..

    __3. The other approach is to use theAmbari Web Console, which may be already started (you wouldsee a web page in front of you). If you need to start theAmbari Web Console, use the Firefox iconon the top-left window border (indicated here by the arrow):

  • 7/25/2019 02_Hadoop_Architecture_Exercise_4.0.0[1]

    8/18

    IBM Software

    Page 8 Unit 4

    The URL needed is: http://localhost:8080or substitute your hostname for localhost.

    __4. The following screen will be shown. This is theAmbari WebConsolewhere you configure theIBM Open Platform for Apache Hadoop v4 software. The default user id and password forAmbari are admin/ admin.

  • 7/25/2019 02_Hadoop_Architecture_Exercise_4.0.0[1]

    9/18

    IBM Software

    Hands-on-Lab Page 9

    1.3 Start all services with the Ambari Web Console

    __5. When you are signed in, you will see the following Dashboard page. If the service on the left-handside shows a red-triangle with exclamation point, the particular service is not currently running;if the individual service has a green-circle with check mark, the particular service is running.

    __6. To start all services, click Actionsat the bottom of the left-hand side, and then Start All:

    __7. Confirm withOK:

    __8. Once all components have started successfully as shown on the Ambari Web Console, you canminimize this webpage as you will not need it further during this Lab Exercise.

    If you later close down this VM and restart it, you will need to Start All services again.

  • 7/25/2019 02_Hadoop_Architecture_Exercise_4.0.0[1]

    10/18

    IBM Software

    Page 10 Unit 4

    1.4 Basic HDFS interactions using the Command Line

    The Hadoop Distributed File System (HDFS) allows user data to be organized in the form of files anddirectories. It provides a command line interface called FS shell that lets a user interact with the data inHDFS accessible to Hadoop MapReduce programs.

    There are two ways that you can interact with HDFS at the command line:

    1.hadoop fs command options

    2. hdfs dfs command options

    Where command is the particular command (ls, rm,mkdir, ) and options are variations on theparticular command and may be followed by a list of files or a list of directories. The command ispreceded by a single dash (-) and the options may be preceded by a single dash.

    __9. Open a terminal window by clicking on your desktop and selecting Open in Terminal. This leavesyou at /home/virtuser/Desktop. Typecd and Enterto move to your home directoryin Linux, /home/virtuser(designated by ~ in the prompt):

    __10. Start with the lscommand to list files and directories. In your terminal window, type thefollowing three commands and hit Enterafter each. Pause after each to review your results.

    hadoop fs -ls

    hadoop fs -ls .hadoop fs -ls /

  • 7/25/2019 02_Hadoop_Architecture_Exercise_4.0.0[1]

    11/18

    IBM Software

    Hands-on-Lab Page 11

    The first of these lists the files in the current directory there are none. The second is a littlemore explicit since it asks for files in dot (,), a synonym for here (again the current directory).The third lists files at the root level within the HDFS (and there are eight directories).

    __11. Look at the directory,/user this is where all home directories are kept for HDFS. Theequivalent for Linux is/home and note the spelling /user as this distinguishes this

    directory from the/usrdirectory in Linux that is used for executable binary programs.hadoop fs -ls /user

    __12. Create a directory called test in your home directory

    hadoop fs mkdir test

    Check the contents of your home directory before and after the command to see that is created.You can do this by listing the contents of the home directory simply (hadoop fs ls) or relative tothe root directory of HDFS (hadoop fs -ls /user/virtuser):

  • 7/25/2019 02_Hadoop_Architecture_Exercise_4.0.0[1]

    12/18

    IBM Software

    Page 12 Unit 4

    __13. Create a file in your Linux home directory for virtuser (i.e.. /home/virtuser): from a command line,execute the following commands. (Ctrl-c means to press-and-hold the Ctrlkey and then the ckey.

    cd ~

    cat > myfile.txt

    this is some data

    in my fileCtrl-c

    __14. Next upload this newly created file to the test directory that you just created.

    hadoop fs -put *.txt test

    __15. Now list your text directory in HDFS. You can use either of the following commands:

    hadoop fs -ls test

    hadoop fs ls R .

    Note the number 3 that follows the permissions. This is the replication factor for that data file.Normally, in a cluster this is 3, but sometimes in a single-node cluster such as the one that you arerunning with, there might be only one copy of each block (split) of the this file.

    The value 3 (or 1, or something else) is the result of a configuration setting of HDFS that sets thenumber of replicants by default.

    __16. To view the contents of the uploaded file, execute

    hadoop fs -cat test/myfile.txt

  • 7/25/2019 02_Hadoop_Architecture_Exercise_4.0.0[1]

    13/18

    IBM Software

    Hands-on-Lab Page 13

    __17. You can pipe (using the | character) any HDFS command so that the output can be used by anyLinux command with the Linux shell. For example, you can easily use grepwith HDFS by doingthe following.

    hadoop fs -cat test/myfile.txt | grep my

    Or,

    hadoop fs -ls R . | grep test

    __18. To find the size of a particular file, like myfile.txt, execute the following:

    hadoop fs -du /user/virtuser/test/myfile.txt

    __19. Or, to get the size of all files in a directory by using a directory name rather than a file name.

    hadoop fs -du /user/virtuser

    __20. Or get a total file size value for all files in a directory:

    hadoop fs -du -s /user/virtuser

  • 7/25/2019 02_Hadoop_Architecture_Exercise_4.0.0[1]

    14/18

    IBM Software

    Page 14 Unit 4

    __21. Remember that you can always use the-help parameter to get more help:

    hadoop fs -help

    hadoop fs -help du

    You can close the command line window.

  • 7/25/2019 02_Hadoop_Architecture_Exercise_4.0.0[1]

    15/18

    IBM Software

    Hands-on-Lab Page 15

    1.5 Summary

    Congratulations! You are now familiar with the Hadoop Distributed File System (HDFS). You now knowhow to manipulate files within HDFS by using the command line.

    Remember that the commands that have been illustrated here with hadoop fscan also be executed with

    hdfs dfsor other combinations of the hadoop | hdfsand fs | dfs.

    You may move on to the next unit.

  • 7/25/2019 02_Hadoop_Architecture_Exercise_4.0.0[1]

    16/18

    NOTES

  • 7/25/2019 02_Hadoop_Architecture_Exercise_4.0.0[1]

    17/18

    NOTES

  • 7/25/2019 02_Hadoop_Architecture_Exercise_4.0.0[1]

    18/18

    Copyright IBM Corporation 2015.

    The information contained in these materials is provided for

    informational purposes only, and is provided AS IS without warranty

    of any kind, express or implied. IBM shall not be responsible for any

    damages arising out of the use of, or otherwise related to, these

    materials. Nothing contained in these materials is intended to, nor

    shall have the effect of, creating any warranties or representations

    from IBM or its suppliers or licensors, or altering the terms and

    conditions of the applicable license agreement governing the use of

    IBM software. References in these materials to IBM products,

    programs, or services do not imply that they will be available in all

    countries in which IBM operates. This information is based on

    current IBM product plans and strategy, which are subject to change

    by IBM without notice. Product release dates and/or capabilities

    referenced in these materials may change at any time at IBMs sole

    discretion based on market opportunities or other factors, and are not

    intended to be a commitment to future product or feature availability

    in any way.

    IBM, the IBM logo and ibm.com are trademarks of International

    Business Machines Corp., registered in many jurisdictions

    worldwide. Other product and service names might be trademarks of

    IBM or other companies. A current list of IBM trademarks is

    available on the Web at Copyright and trademark information atwww.ibm.com/legal/copytrade.shtml.