34
IBM Software Big SQL on Hadoop Working with Big SQL data

Exercise 2

Embed Size (px)

DESCRIPTION

big data more exercises

Citation preview

  • IBM Software

    Big SQL on Hadoop Working with Big SQL data

  • Copyright IBM Corporation, 2013

    US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

  • IBM Software

    Contents Page 3

    Contents WORKING WITH BIG SQL DATA .................................................................................................................................................. 4

    1.1 GETTING STARTED .................................................................................................................................. 5 1.2 CREATING BIG SQL SCHEMA AND TABLES ................................................................................................ 11 1.3 LOADING DATA INTO BIG SQL TABLES ...................................................................................................... 20 1.4 QUERYING BIG SQL DATA FROM TABLES AND VIEWS ................................................................................. 23 1.5 USING ADDITIONAL LOAD OPERATIONS ...................................................................................................... 27 1.6 WORKING WITH PARTITIONED TABLES ....................................................................................................... 30 SUMMARY ........................................................................................................................................................... 31

  • IBM Software

    Page 4

    Working with Big SQL data This exercise shows you how to create Big SQL schemas and tables and then load data into those tables using load operations. The exercise covers creating various SQL statements to query the data in tables and views. Then the exercise covers additional bulk load operations and partitioned tables.

    After completing this hands-on lab, you should be able to:

    - Create Big SQL schemas and tables

    - Load data into Big SQL tables

    - Query data in Big SQL tables and views

    - Using additional load operations

    - Work with partitioned tables

    Allow 60 minutes to complete this section of lab.

    Throughout this lab you will be using the following account login information:

    When to use: Username Password

    Log in from the command-line to accept the licenses root password

    Log in from the Linux SUSE Desktop to access the BigInsights Desktop

    biadmin biadmin

  • IBM Software

    Hands-on-Lab Page 5

    1.1 Getting Started

    If you have not already completed exercise 1, you should do so to prepare for this lab. The following section lists the steps to start up BigInsights.

    __1. Start the VMware image by clicking the Play virtual machine button in the VMware Player if it is not already on.

    __2. Choose the first option to load up the image.

  • IBM Software

    Page 6

    __3. Youll need to log in to the image initially. Use the VM image setup screen credentials listed at the front of the document:

  • IBM Software

    Hands-on-Lab Page 7

    __4. Go through the VM setup screens. When you get to the screen that asks to input your passwords, use the same passwords as listed at the beginning of this document.

    __5. Log in to the VMware virtual machine using the following credentials.

    Username: biadmin

    Password: biadmin

  • IBM Software

    Page 8

    __6. After you log in, your screen should look similar to the one below.

    There are two ways to start up BigInsights: through terminal or simply double-clicking an icon. Both of these methods will be shown in the following steps.

    __7. Now open the terminal by double clicking the BigInsights Shell icon.

    __8. Double click on the Terminal icon

    .

    __9. Once the terminal has been opened, change to the $BIGINSIGHTS_HOME/bin directory (which by default is /opt/ibm/biginsights) by issuing the following commands:

  • IBM Software

    Hands-on-Lab Page 9

    cd $BIGINSIGHTS_HOME/bin

    or

    cd /opt/ibm/biginsights/bin

    __10. Go ahead and start up the BigInsights environment. Note that they will take a few minutes to run.

    ./start-all.sh

    __11. If you would like to stop all components execute the command below. However, for this lab, leave all components started.

  • IBM Software

    Page 10

    ./stop-all.sh

    Next, let us look at how you would start all the components by double-clicking an icon.

    __12. Double-clicking on the Start BigInsights icon would execute a script that does the above mentioned steps. Once all components are started the terminal exits and you are set. Simple.

    __13. You can stop the components in a similar manner, by double-clicking on the Stop Biginsights icon.

    Now that are components are started you may move on to the next section.

    Note: Occasionally, you may need to suspend your lab image and resume your work another time. By doing so, you may disrupt the BigInsights instance where some components do not function properly. If you find yourself resuming to a lab image and things do not work properly, go ahead and restart the BigInsights instance.

  • IBM Software

    Hands-on-Lab Page 11

    1.2 Creating Big SQL schema and tables

    __1. Start a new terminal.

    __2. Start JSqsh. Type in:

    $JSQSH_HOME/bin/jsqsh bigsql

    __3. Create a schema that you will use for the rest of this lab. Type in:

    use mybigsql;

    This will create the mybigsql schema and set it as the default schema.

    Subsequent examples in this section presume your sample data is in the /opt/ibm/biginsights/bigsql/samples/data directory. This is the location of the data on the BigInsights VMware image, and it is the default location in typical BigInsights installations.

    Furthermore, the /opt/ibm/biginsights/bigsql/samples/queries directory contains SQL scripts that include the CREATE TABLE, LOAD, and SELECT statements used in this lab, as well as other statements.

    This tutorial uses sales data from a fictional company that sells and distributes outdoor products to third-party retailer stores as well as directly to consumers through its online store. It maintains its data in a series of FACT and DIMENSION tables, as is common in relational data warehouse environments. In this lab, you will explore how to create, populate, and query a subset of the star schema database to investigate the companys performance and offerings. Note that BigInsights provides scripts to create and populate the more than 60 tables that comprise the sample GOSALESDW database. You will use fewer than 10 of these tables in this lab.

    __4. In the JSqsh console, copy and paste this table definition to create the dimension table for the region info:

    CREATE HADOOP TABLE IF NOT EXISTS go_region_dim

    ( country_key INT NOT NULL,

    country_code INT NOT NULL,

  • IBM Software

    Page 12

    flag_image VARCHAR(45),

    iso_three_letter_code VARCHAR(9) NOT NULL,

    iso_two_letter_code VARCHAR(6) NOT NULL,

    iso_three_digit_code VARCHAR(9) NOT NULL,

    region_key INT NOT NULL,

    region_code INT NOT NULL,

    region_en VARCHAR(90) NOT NULL,

    country_en VARCHAR(90) NOT NULL,

    region_de VARCHAR(90), country_de VARCHAR(90), region_fr VARCHAR(90),

    country_fr VARCHAR(90), region_ja VARCHAR(90), country_ja VARCHAR(90),

    region_cs VARCHAR(90), country_cs VARCHAR(90), region_da VARCHAR(90),

    country_da VARCHAR(90), region_el VARCHAR(90), country_el VARCHAR(90),

    region_es VARCHAR(90), country_es VARCHAR(90), region_fi VARCHAR(90),

    country_fi VARCHAR(90), region_hu VARCHAR(90), country_hu VARCHAR(90),

    region_id VARCHAR(90), country_id VARCHAR(90), region_it VARCHAR(90),

    country_it VARCHAR(90), region_ko VARCHAR(90), country_ko VARCHAR(90),

    region_ms VARCHAR(90), country_ms VARCHAR(90), region_nl VARCHAR(90),

    country_nl VARCHAR(90), region_no VARCHAR(90), country_no VARCHAR(90),

    region_pl VARCHAR(90), country_pl VARCHAR(90), region_pt VARCHAR(90),

    country_pt VARCHAR(90), region_ru VARCHAR(90), country_ru VARCHAR(90),

  • IBM Software

    Hands-on-Lab Page 13

    region_sc VARCHAR(90), country_sc VARCHAR(90), region_sv VARCHAR(90),

    country_sv VARCHAR(90), region_tc VARCHAR(90), country_tc VARCHAR(90),

    region_th VARCHAR(90), country_th VARCHAR(90)

    )

    ROW FORMAT DELIMITED

    FIELDS TERMINATED BY '\t'

    LINES TERMINATED BY '\n'

    STORED AS TEXTFILE;

    Before we proceed further, lets look at the query. Notice the HADOOP keyword to create a Hadoop table. You need to specify this keyword in order to create tables for the Hadoop environment. You can change this by enable the SYSHADOOP.COMPATIBILITY_MODE.

    Also notice that you did not explicitly specify the tables schema. The table will be created in the default schema, which we have specified as myBigSQL.

    If you had not executed the USE command above, then the login name, which is bigsql, will be used as the default schema.

    If you had qualified the table in the definition, that schema would be used instead of the default schema.

    The tables data will be row format delimited and have its fields terminated by tabs (\t) and the lines terminated by new line (\n). The data will be stored as TEXTFILE format, making it easy for a wide range of applications to work with.

    __5. Launch the BigInsights web console.

    __6. Log in with the username: bigsql and the password bigsql.

    __7. Go to the Files tab to check out the table definition.

    __8. Drill down to biginsights hive warehouse mybigsql.db to see the go_region_dim table that you created.

    __9. Go back to the JSqsh shell and copy and paste this next query to create the tracking method of the order for the sale:

    CREATE HADOOP TABLE IF NOT EXISTS sls_order_method_dim

    ( order_method_key INT NOT NULL,

  • IBM Software

    Page 14

    order_method_code INT NOT NULL,

    order_method_en VARCHAR(90) NOT NULL,

    order_method_de VARCHAR(90), order_method_fr VARCHAR(90),

    order_method_ja VARCHAR(90), order_method_cs VARCHAR(90),

    order_method_da VARCHAR(90), order_method_el VARCHAR(90),

    order_method_es VARCHAR(90), order_method_fi VARCHAR(90),

    order_method_hu VARCHAR(90), order_method_id VARCHAR(90),

    order_method_it VARCHAR(90), order_method_ko VARCHAR(90),

    order_method_ms VARCHAR(90), order_method_nl VARCHAR(90),

    order_method_no VARCHAR(90), order_method_pl VARCHAR(90),

    order_method_pt VARCHAR(90), order_method_ru VARCHAR(90),

    order_method_sc VARCHAR(90), order_method_sv VARCHAR(90),

    order_method_tc VARCHAR(90), order_method_th VARCHAR(90)

    )

    ROW FORMAT DELIMITED

    FIELDS TERMINATED BY '\t'

    LINES TERMINATED BY '\n'

    STORED AS TEXTFILE

    ;

    __10. Create the lookup table with product brand info in various languages. Copy and paste this:

    CREATE HADOOP TABLE IF NOT EXISTS sls_product_brand_lookup

    ( product_brand_code INT NOT NULL,

    product_brand_en VARCHAR(90) NOT NULL,

    product_brand_de VARCHAR(90), product_brand_fr VARCHAR(90),

    product_brand_ja VARCHAR(90), product_brand_cs VARCHAR(90),

  • IBM Software

    Hands-on-Lab Page 15

    product_brand_da VARCHAR(90), product_brand_el VARCHAR(90),

    product_brand_es VARCHAR(90), product_brand_fi VARCHAR(90),

    product_brand_hu VARCHAR(90), product_brand_id VARCHAR(90),

    product_brand_it VARCHAR(90), product_brand_ko VARCHAR(90),

    product_brand_ms VARCHAR(90), product_brand_nl VARCHAR(90),

    product_brand_no VARCHAR(90), product_brand_pl VARCHAR(90),

    product_brand_pt VARCHAR(90), product_brand_ru VARCHAR(90),

    product_brand_sc VARCHAR(90), product_brand_sv VARCHAR(90),

    product_brand_tc VARCHAR(90), product_brand_th VARCHAR(90)

    )

    ROW FORMAT DELIMITED

    FIELDS TERMINATED BY '\t'

    LINES TERMINATED BY '\n'

    STORED AS TEXTFILE

    ;

    __11. Create the product dimension table. Copy and paste this:

    CREATE HADOOP TABLE IF NOT EXISTS sls_product_dim

    ( product_key INT NOT NULL,

    product_line_code INT NOT NULL,

    product_type_key INT NOT NULL,

    product_type_code INT NOT NULL,

    product_number INT NOT NULL,

    base_product_key INT NOT NULL,

    base_product_number INT NOT NULL,

    product_color_code INT,

  • IBM Software

    Page 16

    product_size_code INT,

    product_brand_key INT NOT NULL,

    product_brand_code INT NOT NULL,

    product_image VARCHAR(60),

    introduction_date TIMESTAMP,

    discontinued_date TIMESTAMP

    )

    ROW FORMAT DELIMITED

    FIELDS TERMINATED BY '\t'

    LINES TERMINATED BY '\n'

    STORED AS TEXTFILE

    ;

    __12. Create the product line info in various languages table. Copy and paste:

    CREATE HADOOP TABLE IF NOT EXISTS sls_product_line_lookup

    ( product_line_code INT NOT NULL,

    product_line_en VARCHAR(90) NOT NULL,

    product_line_de VARCHAR(90), product_line_fr VARCHAR(90),

    product_line_ja VARCHAR(90), product_line_cs VARCHAR(90),

    product_line_da VARCHAR(90), product_line_el VARCHAR(90),

    product_line_es VARCHAR(90), product_line_fi VARCHAR(90),

    product_line_hu VARCHAR(90), product_line_id VARCHAR(90),

    product_line_it VARCHAR(90), product_line_ko VARCHAR(90),

    product_line_ms VARCHAR(90), product_line_nl VARCHAR(90),

    product_line_no VARCHAR(90), product_line_pl VARCHAR(90),

  • IBM Software

    Hands-on-Lab Page 17

    product_line_pt VARCHAR(90), product_line_ru VARCHAR(90),

    product_line_sc VARCHAR(90), product_line_sv VARCHAR(90),

    product_line_tc VARCHAR(90), product_line_th VARCHAR(90)

    )

    ROW FORMAT DELIMITED

    FIELDS TERMINATED BY '\t'

    LINES TERMINATED BY '\n'

    STORED AS TEXTFILE;

    __13. Create the product lookup table. Copy and paste:

    CREATE HADOOP TABLE IF NOT EXISTS sls_product_lookup

    ( product_number INT NOT NULL,

    product_language VARCHAR(30) NOT NULL,

    product_name VARCHAR(150) NOT NULL,

    product_description VARCHAR(765)

    )

    ROW FORMAT DELIMITED

    FIELDS TERMINATED BY '\t'

    LINES TERMINATED BY '\n'

    STORED AS TEXTFILE;

  • IBM Software

    Page 18

    __14. Create the fact table for sales. Copy and paste:

    CREATE HADOOP TABLE IF NOT EXISTS sls_sales_fact

    ( order_day_key INT NOT NULL,

    organization_key INT NOT NULL,

    employee_key INT NOT NULL,

    retailer_key INT NOT NULL,

    retailer_site_key INT NOT NULL,

    product_key INT NOT NULL,

    promotion_key INT NOT NULL,

    order_method_key INT NOT NULL,

    sales_order_key INT NOT NULL,

    ship_day_key INT NOT NULL,

    close_day_key INT NOT NULL,

    quantity INT,

    unit_cost DOUBLE,

    unit_price DOUBLE,

    unit_sale_price DOUBLE,

    gross_margin DOUBLE,

    sale_total DOUBLE,

    gross_profit DOUBLE

    )

    ROW FORMAT DELIMITED

    FIELDS TERMINATED BY '\t'

    LINES TERMINATED BY '\n'

    STORED AS TEXTFILE

    ;

  • IBM Software

    Hands-on-Lab Page 19

    __15. Create the fact table for marketing promotions. Copy and paste:

    CREATE HADOOP TABLE IF NOT EXISTS mrk_promotion_fact

    ( organization_key INT NOT NULL,

    order_day_key INT NOT NULL,

    rtl_country_key INT NOT NULL,

    employee_key INT NOT NULL,

    retailer_key INT NOT NULL,

    product_key INT NOT NULL,

    promotion_key INT NOT NULL,

    sales_order_key INT NOT NULL,

    quantity SMALLINT,

    unit_cost DOUBLE,

    unit_price DOUBLE,

    unit_sale_price DOUBLE,

    gross_margin DOUBLE,

    sale_total DOUBLE,

    gross_profit DOUBLE

    )

    ROW FORMAT DELIMITED

    FIELDS TERMINATED BY '\t'

    LINES TERMINATED BY '\n'

    STORED AS TEXTFILE;

    __16. You have created eight tables. Go to the Web Console to the Files tab and check that eight directories have been created under biginsightshivewarehousemybigsql.db

  • IBM Software

    Page 20

    1.3 Loading data into Big SQL tables

    __1. Next you will be loading data into the tables. There will be eight LOAD statements. This time, you will be using Eclipse to do the remaining parts of the exercises. Go ahead and launch Eclipse.

    __2. Accept the default workspace.

    __3. Select the My Big SQL Connection.

    __4. You should have a project myBigSQL and the script aFirstFile.sql from Exercise 1. Go back to do Exercise 1 if you have not completed it.

    __5. Load data into each of these tables using sample data provided in files. One at a time, issue each of the following LOAD statements and verify that each completed successfully. Remember to change the file path shown (if needed) to the appropriate path for your environment. The statements will return a warning message providing details on the number of rows loaded, etc.

    Reminder: You can use F5 to run the highlighted query that is part of the script. You may choose to copy and paste all eight of these LOAD statements and then just highlight each one individually and push F5 to run them. If you choose to, you may run multiple statements at once -- just highlight them all and hit F5.

    Eclipse will flag certain statements as errors. You can ignore them and continue with the lab.

    Note: Warning messages will be returned by each of these statements providing details on the number of rows loaded, etc.

    These statements will take a while to load.

    Note that each of these load statements includes the table schema as part of the command.

    load hadoop using file url 'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt' with SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE MYBIGSQL.GO_REGION_DIM overwrite;

    load hadoop using file url 'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_ORDER_METHOD_DIM.txt' with SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE MYBIGSQL.SLS_ORDER_METHOD_DIM overwrite;

    load hadoop using file url 'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_BRAND_LOOKUP.txt' with SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE MYBIGSQL.SLS_PRODUCT_BRAND_LOOKUP overwrite;

  • IBM Software

    Hands-on-Lab Page 21

    load hadoop using file url 'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_DIM.txt' with SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE MYBIGSQL.SLS_PRODUCT_DIM overwrite;

    load hadoop using file url 'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_LINE_LOOKUP.txt' with SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE MYBIGSQL.SLS_PRODUCT_LINE_LOOKUP overwrite;

    load hadoop using file url 'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_LOOKUP.txt' with SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE MYBIGSQL.SLS_PRODUCT_LOOKUP overwrite;

    load hadoop using file url 'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_SALES_FACT.txt' with SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE MYBIGSQL.SLS_SALES_FACT overwrite;

    load hadoop using file url 'file:///opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.MRK_PROMOTION_FACT.txt' with SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE MYBIGSQL.MRK_PROMOTION_FACT overwrite;

    __6. Once the tables have been loaded with the sample data, use the following statements to verify that the load was successful:

    -- total rows in GO_REGION_DIM = 21

    select count(*) from MYBIGSQL.GO_REGION_DIM;

    -- total rows in sls_order_method_dim = 7

    select count(*) from MYBIGSQL.sls_order_method_dim;

    -- total rows in SLS_PRODUCT_BRAND_LOOKUP = 28

    select count(*) from MYBIGSQL.SLS_PRODUCT_BRAND_LOOKUP;

    -- total rows in SLS_PRODUCT_DIM = 274

    select count(*) from MYBIGSQL.SLS_PRODUCT_DIM;

  • IBM Software

    Page 22

    -- total rows in SLS_PRODUCT_LINE_LOOKUP = 5

    select count(*) from MYBIGSQL.SLS_PRODUCT_LINE_LOOKUP;

    -- total rows in SLS_PRODUCT_LOOKUP = 6302

    select count(*) from MYBIGSQL.SLS_PRODUCT_LOOKUP;

    -- total rows in SLS_SALES_FACT = 446023

    select count(*) from MYBIGSQL.SLS_SALES_FACT;

    -- total rows gosalesdw.MRK_PROMOTION_FACT = 11034

    select count(*) from MYBIGSQL.MRK_PROMOTION_FACT;

  • IBM Software

    Hands-on-Lab Page 23

    1.4 Querying Big SQL data from tables and views In this section, you will get a chance to further explore the tables that you have create and loaded with data. You have already seen how to write simple queries. Now you will get to work with some examples that are more sophisticated.

    You will create and run Big SQL queries that join data from multiple tables as well as perform aggregations and other SQL operations. Note that the queries included in this section are based on queries shipped with BigInsights as samples.

    You will be using Eclipse for this section as well, but you can still choose to use JSqsh or the console if you prefer, but some of the statement returns hundreds of thousands of rows. Eclipse limits the results to 500 rows. You can change that value in the Data Management preferences.

    __ 1. Join data from multiple tables to return the product name, quantity and order method of goods that have been sold. To do so, execute the following query:

    use MYBIGSQL; SELECT pnumb.product_name, sales.quantity, meth.order_method_en FROM sls_sales_fact sales, sls_product_dim prod, sls_product_lookup pnumb, sls_order_method_dim meth WHERE pnumb.product_language='EN' AND sales.product_key=prod.product_key AND prod.product_number=pnumb.product_number AND meth.order_method_key=sales.order_method_key; Lets take a moment to see what the query does.

    The query will be working within the MYBIGSQL schema.

    The query selects from four different tables as referenced in the FROM clause.

    The predicates in the WHERE clause, filter the data from these tables. There are three equi-joins (these help narrow the results from the specified tables in the predicates).

    The predicate product_language=EN also limits the results to just the English output.

    The SELECT clause uses aliases to improve the readability of the query. For example, pnumb refers to the sls_product_lookup table as references in the FROM clause.

    Three columns are being selected from across four tables.

    Important: For all of the queries, you will need to specify the MYBIGSQL schema since our tables have been created in there. You can either execute the USE MYBIGSQL command or

  • IBM Software

    Page 24

    explicitly qualified each table. The queries provided here requires one of those two methods to work

    __2. The query returns 500 results, but there are likely more rows than this. Eclipse limits the number of rows returned to 500, but you can change this setting if you wish. Take a moment to inspect the results to see how the products were sold and in what quantity.

    __3. Modify the query to restrict the order method to one type those involving a Sales visit. To do so, add the following query predicate just before the semi-colon: AND order_method_en='Sales visit'

    __4. Run the modified query.

    __5. Inspect the results to see that only products that were sold by the method Sales visit are listed.

    __6. To find out which sales method of all the methods has the greatest quantity of orders, add a GROUP BY clause (group by pll.product_line_en, md.order_method_en). In addition, invoke the SUM aggregate function (sum(sf.quantity)) to total the orders by product and method. Finally, this query cleans up the output a bit by using aliases (e.g., as Product) to substitute a more readable column header.

    use MYBIGSQL;

    SELECT pll.product_line_en AS Product,

    md.order_method_en AS Order_method,

    sum(sf.QUANTITY) AS total

    FROM

    sls_order_method_dim AS md,

    sls_product_dim AS pd,

    sls_product_line_lookup AS pll,

    sls_product_brand_lookup AS pbl,

    sls_sales_fact AS sf

    WHERE

    pd.product_key = sf.product_key

    AND md.order_method_key = sf.order_method_key

    AND pll.product_line_code = pd.product_line_code

    AND pbl.product_brand_code = pd.product_brand_code

  • IBM Software

    Hands-on-Lab Page 25

    GROUP BY pll.product_line_en, md.order_method_en;

    __7. The GROUP BY clause groups the product lines and the order method together so that you can run the SUM aggregate function to see which method sold the most. Inspect the results to see that a total of 35 rows have been returned.

    Big SQL supports views (virtual tables) based on one or more physical tables. In this section, you will create a view that spans multiple tables. Then you will query this view using a simple SELECT statement. In doing so, you'll see that you can work with views in Big SQL much as you can work with views in a relational DBMS.

    __8. Create a view named MYVIEW that extracts information about product sales featured in marketing promotions.

    use MYBIGSQL;

    create view myview as

    select product_name, sales.product_key, mkt.quantity,

    sales.order_day_key, sales.sales_order_key, order_method_en

    from

    mrk_promotion_fact mkt,

    sls_sales_fact sales,

    sls_product_dim prod,

    sls_product_lookup pnumb,

    sls_order_method_dim meth

    where mkt.order_day_key=sales.order_day_key

    and sales.product_key=prod.product_key

    and prod.product_number=pnumb.product_number

    and pnumb.product_language='EN'

    and meth.order_method_key=sales.order_method_key;

  • IBM Software

    Page 26

    __9. Query the view:

    select * from mybigsql.myview

    order by product_key asc, order_day_key asc

    fetch first 20 rows only;

    __10. Inspect the results to see the 20 rows returned from the query.

  • IBM Software

    Hands-on-Lab Page 27

    1.5 Using additional load operations This section will go over some additional bulk load operations that you can use to get data into Big SQL. Remember that you do not want to use regular INSERTs in a production environment. It is very inefficient.

    With Big SQL, you can populate a table with data based on the results of a query. In this section, you will use an INSERT INTO ...SELECT statement to retrieve data from multiple tables and insert that data into another table. Executing an INSERT INTO...SELECT exploits the machine resources of your cluster because Big SQL can parallelize both read (SELECT) and write (INSERT) operations.

    __1. Execute the following statement to create a sample table named sales_report: CREATE HADOOP TABLE MYBIGSQL.sales_report ( product_key INT NOT NULL, product_name VARCHAR(150), quantity INT, order_method_en VARCHAR(90) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS TEXTFILE;

    __2. Now populate the newly created table with results from a query that joins data from multiple tables. USE MYBIGSQL; INSERT INTO sales_report SELECT sales.product_key, pnumb.product_name, sales.quantity, meth.order_method_en FROM sls_sales_fact sales, sls_product_dim prod, sls_product_lookup pnumb, sls_order_method_dim meth WHERE pnumb.product_language='EN'

  • IBM Software

    Page 28

    AND sales.product_key=prod.product_key AND prod.product_number=pnumb.product_number AND meth.order_method_key=sales.order_method_key AND sales.quantity > 1000;

    __3. Verify that the previous query was successful by executing the following query: -- total number of rows should be 14441 select count(*) from mybigsql.sales_report;

    There is also another type of load operation and that is the CREATE TABLE .... AS (CTAS) statement. You will create a table from an existing table.

    __4. Create a new table from sales_report called sales_report_modified. Type in use MYBIGSQL; CREATE HADOOP TABLE sales_report_modified as select product_key, product_name, quantity from sales_report where order_method_en = 'E-mail';

    __5. Verify the results. Type in: select * from mybigsql.sales_report_modified;

    __6. Now lets add in some data that is a PARQUET file format. Create a new table use MYBIGSQL; CREATE HADOOP TABLE IF NOT EXISTS big_sales_parquet ( product_key INT NOT NULL, product_name VARCHAR(150), quantity INT, order_method_en VARCHAR(90) ) STORED AS parquetfile;

    With the exception of the last line all of the query should be familiar to you by now. The last line tells Big SQL to store the files as PARQUET format.

    __7. Populate this table with data based on the results of a query. Note that this query joins data from 4 tables you previously defined in Big SQL using a TEXTFILE format. Big SQL will automatically reformat the result set of this query into a Parquet format for storage.

    use MYBIGSQL;

    insert into big_sales_parquet

    SELECT sales.product_key, pnumb.product_name, sales.quantity,

    meth.order_method_en

  • IBM Software

    Hands-on-Lab Page 29

    FROM

    sls_sales_fact sales,

    sls_product_dim prod,

    sls_product_lookup pnumb,

    sls_order_method_dim meth

    WHERE

    pnumb.product_language='EN'

    AND sales.product_key=prod.product_key

    AND prod.product_number=pnumb.product_number

    AND meth.order_method_key=sales.order_method_key

    and sales.quantity > 5500;

    __8. Inspect the results to see that 471 records are returned:

    select * from mybigsql.big_sales_parquet;

    __9. We have not seen how these files are actually stored within the file system. Launch the BigInsights web console.

    __10. Login with the username: bigsql and the password bigsql.

    __11. Click on the Files tab

    __12. Drill down to biginsights-->hive-->warehouse-->mybigsql.db

    __13. Expand the big_sales_parquet table and click on the data file inside. You will see that the file is not human readable.

    __14. Alternatively, click any of the other table, such as go_region_dim to see its content. You should be able to actually read what is inside the file.

  • IBM Software

    Page 30

    1.6 Working with partitioned tables

    Now lets get into partitioned tables. As you know, partitioning tables will allow queries to perform much faster if you perform the partitioning on the columns that are commonly used in predicates. The query will only search in the partition that it needs to, thus saving a lot of time when the data set is large. In our example of sales report, we have an order method column. This column can be used to search on a specific ordering method, such as email.

    __ 1. Create a partitioned table called sales_report_part. Copy and paste this into the aFirstFile.sql script in Eclipse:

    use MYBIGSQL; CREATE HADOOP TABLE IF NOT EXISTS sales_report_part ( product_key INT NOT NULL, product_name VARCHAR(150), quantity INT ) partitioned by (order_method_en VARCHAR(90)) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS TEXTFILE; Everything here should look familiar to you, with the exception of the PARTITIONED BY clause. You define all your columns as you normally would, except for the column that you wish to partition by -- you include that column in the PARTITIONED BY clause instead.

    __ 2. Load the data from the original sales_report table into this new partitioned table. Type in:

    use MYBIGSQL; insert into sales_report_part select product_key, product_name, quantity, order_method_en from sales_report;

    __ 3. Inspect the results by going to Files tab in the BigInsights console. You may need to do a refresh before seeing the new results.

    __ 4. Drill down to the sales_report_part table (biginsights->hive->warehouse->mybigsql.db) Within that directory, you see each of the partitions divided up into their respective folders labeled with the column name equals column value. (order_method_en=E-mail).

    __ 5. Expand each partition to see the data files stored within.

  • IBM Software

    Hands-on-Lab Page 31

    Summary

    Having completed this exercise, you should now be able to

    - Create Big SQL schemas and tables

    - Load data into Big SQL tables

    - Query data in Big SQL tables and views

    - Using additional load operations

    - Work with partitioned tables

  • NOTES

  • NOTES

  • Copyright IBM Corporation 2013.

    The information contained in these materials is provided for

    informational purposes only, and is provided AS IS without warranty

    of any kind, express or implied. IBM shall not be responsible for any

    damages arising out of the use of, or otherwise related to, these

    materials. Nothing contained in these materials is intended to, nor

    shall have the effect of, creating any warranties or representations

    from IBM or its suppliers or licensors, or altering the terms and

    conditions of the applicable license agreement governing the use of

    IBM software. References in these materials to IBM products,

    programs, or services do not imply that they will be available in all

    countries in which IBM operates. This information is based on

    current IBM product plans and strategy, which are subject to change

    by IBM without notice. Product release dates and/or capabilities

    referenced in these materials may change at any time at IBMs sole

    discretion based on market opportunities or other factors, and are not

    intended to be a commitment to future product or feature availability

    in any way.

    IBM, the IBM logo and ibm.com are trademarks of International

    Business Machines Corp., registered in many jurisdictions

    worldwide. Other product and service names might be trademarks of

    IBM or other companies. A current list of IBM trademarks is

    available on the Web at Copyright and trademark information at

    www.ibm.com/legal/copytrade.shtml.

    Working with Big SQL data1.1 Getting Started1.2 Creating Big SQL schema and tables1.3 Loading data into Big SQL tables1.4 Querying Big SQL data from tables and views1.5 Using additional load operations1.6 Working with partitioned tablesSummary