Lab - Working With SQL Using Big SQL v3

Hands-on Lab:

Working with SQL using Big SQL

Working with SQL using Big SQL Hands-on Lab

2013 BigDataUniversity.com 2

Overview Apache Hadoop and its map-reduce framework have become very popular for its robust, scalable distributed processing. While Hadoop is very good at munching big data, developing map-reduce applications is complex and time-consuming. Scripting languages such as Pig try to solve this problem; however, this requires mastering these new languages. Big SQL alleviates both of these problems by allowing users to write their queries in well-understood SQL language. Under the hood it takes advantage of Hadoops scalable distributed processing when necessary. While there are other SQL processors for Hadoop, Big SQL is much superior in terms of functionality and performance.

Objective of this lab The objective of this lab is to get you familiar with Big SQL server and client. Henceforth, the term bigsql will be used interchangeably with Big SQL.

After completing this lab, you will understand how to:

Start, stop, restart, get status of bigsql server Configure bigsql server Perform basic administration of bigsql server Connect to and disconnect from bigsql server using a client Execute some DDL, DML, queries using bigsql client on bigsql server Connect to and run queries from a JDBC client Troubleshoot bigsql server

Lab environment details Your lab environment has been set up so that a Hadoop cluster is shared amongst all participants of the Big SQL technology preview. Because the cluster is shared, your account has been given Application privilege, not full administration privileges; therefore, in this lab, some of the commands will fail with a permission denied message. Nonetheless, you will have the chance to try these commands. The videos in this lesson show you the output you would receive if you had full administration privileges. Note: Whenever a user ID / password is required, use the following: User ID: Use your user ID for the IM Demo Cloud. This appears at the top right corner after you log on. For example, in the screenshot below, the user ID is raulchongbiz



Password: Use the password used when you logged on to my.imdemocloud.com For simplicity in this document we will refer to this userID/password as IMDC userID/psw

Prerequisite Ensure you have followed the instructions to set up your lab environment (First lab in this course) When you gain access to my.imdemocloud.com, enter the appropriate credentials, and start a

MindTerm terminal window or a putty session.

Exercise-1 (5 minutes): Explore the servers directory structure The servers directory structure can be found when you:

o cd $BIGSQL_HOME

This is equivalent to:

o cd $BIGINSIGHTS_HOME/bigsql

and should typically take you to /opt/ibm/biginsights/bigsql. Explore the directory structure. It should look like this:



bin executables to start/stop bigsql server/client lib servers jars conf server configuration files jdbc jdbc driver odbc ODBC driver jsqsh bigsql command line client msg error messages security keystore file for ssl encryption between client and server (if ssl is enabled)

Note:

In this lab environment, most configuration files can be read, but no changes are allowed. Even if you could edit files in those directories, configuration changes will not take effect unless you restart the Big SQL server. You will not have the permissions to restart the Big SQL server.

Exercise-2 (5 minutes): How to stop/start the bigsql server Start, stop, restart, get status of bigsql server.

Only a user with appropriate admin privileges can start and stop the bigsql-server. Though your user does not have the appropriate privileges, practice executing the commands below. You will receive a permission denied message.

From the terminal window issue:

o cd $BIGSQL_HOME/bin (This has been added to your path, so its optional)

Get status of the bigsql server (Checks if its running) o bigsql status

Stop the bigsql server o bigsql stop

Force stop the bigsql server o bigsql forcestop

Start the bigsql server o bigsql start

Restart the bigsql server o bigsql restart

Other options o bigsql -help

From the BigInsights Web-console you can also stop/start/monitor the Big SQL server process:



From the my.imdemocloud.com dashboard, cluster tab, start the Web Console: o Click on the link besides BigInsights Web Console

Your browser may display a warning message indicating the security certificate is not trusted as shown below. Proceed anyway. We do not purchase SSL certificates for these demonstration systems.

You will be prompted for credentials: User Name and Password. Use the IMDC UserID/psw as

described earlier. Click on Login.



Click on the Cluster Status tab, and look for the Big SQL process. It should show it is running. Click

on this process, and on the right side you will see more detail. Note that the Start and Stop buttons are grayed out since you do not have admin privileges, and thus are not allowed to stop/start this process.



Exercise-3 (5 minutes): How to configure bigsql-server

Bootstrap configurations via environmental variables:

To set initial and/or max memory available to the bigsql JVM you can set these two environmental variables from your terminal and restart the bigsql server.

export BIGSQL_CONF_INSTANCE_INITIAL_MEM=4096m # default is 1GB

export BIGSQL_CONF_INSTANCE_MAX_MEM=8192m # default is 2GB

Server wide job-conf setting:

To pass hadoop job-conf settings at the bigsql server level, put them in:

$BIGSQL_HOME/conf/bigsql-site.xml

and restart the bigsql server.

Other server settings:

Explore $BIGSQL_HOME/conf/bigsql-conf.xml to make changes to other settings such as:

Network-interface where bigsql-server should listen (default is 0.0.0.0 i.e. all interfaces) Port where bigsql-server should listen (default is 7052) Whether to authenticate using web-console or some other mechanism Whether to turn on SSL encryption and SSL specific settings List of jars/jaql modules to load at server startup time Etc

Changes will take effect after restarting the bigsql server.

Exercise-4 (5 minutes): jsqsh command line client Start jsqsh client jsqsh



o The first time you invoke jsqsh, you will get a Welcome message indicating that some files have been created for you as shown below.

o Press enter to continue. This will start a series of screens that allow you to define one or more connection aliases. Lets go through the screens one time to set up one connection alias.

o Enter 1 to choose the Big SQL driver in the screen below.



o Next, enter (S) to save the configuration properties (We will take all defaults). Later in the lab there is a section that explains how to configure your Big SQL server.



Next you will be prompted to enter a connection name. Type myconn, then on the next screen enter A to add. This is an alias for a connection. We will explain later how to manually create other aliases, and edit existing ones.

Next, type Q to exit, otherwise you will be prompted again for driver information, and so on to set up

another connection alias. Exit jsqsh client (The prompt will be a line number like 1>, however we are using jsqsh> for ease of

readability) jsqsh> quit

Start it again. This time it should not prompt you for anything jsqsh



Get help jsqsh> \help

Create a connection alias to database-server

\connect -U -P -S -p -d -a

For example, lets say your user ID is rfchong and the password is passw0rd. Say the alias for the connection you want to create is myconn1. Then the statement to use is: \connect -Urfchong Ppassw0rd -Slocalhost -p7052 -dbigsql -amyconn1

Note: When you follow the step above, remember to use the IMDC user ID/psw indicated in a previous note. The lab environment has been set up so this user ID/psw is also used for the BigInsights Web console authentication, and the Big SQL environment has been set up to use this same authentication method. Ensure to enclose the password in double quotes when using special characters like &.

If you make a mistake in the above command, simply execute it again with the correct information and using the same alias name. The incorrect information would be overwritten. Alternative use the r flag to remove the connection alias. For more details about the syntax, type: \help connect

Verify the connection alias was indeed created by listing the connection aliases: \connect l

Connect using alias myconn1 \connect myconn1

Show currently connected users user-name set v jaql.system.dataSource.userName;

Note: If for any reason, after you press enter, you keep getting prompts with line numbers, ensure to add a semicolon (;) at the end of the line. If that doesnt work, type go and press enter.

Show schemas \show schemas

Show tables \show tables

Show columns \show columns



Describe (get schema of) a table \describe system.dual;

See the contents of the catalog table syscat.columns select * from syscat.columns;

Exercise-5 (5 minutes): Admin commands You can use administrative commands to list applications or connections info (e.g. application-id, client-ip, client-port, and so on). Regular users will only be able to list and cancel their own applications, while users with full administration privileges can list and cancel applications from all users. For example, try the following:

Jsqsh> list applications all;

Assume the output of this command (under the applicationID column) returns these application IDs: 1, 3, 77, 92, 95, 1002. To list the first four in this list include the applicationID explicitly in the command as follows:

Jsqsh> list applications 1 3 77 92;

To figure out the application-id of your own connection, issue:

Jsqsh> set -v jaql.system.session.id;

To cancel specific applications/connections issue:

Jsqsh> cancel applications 3 5;

where in this example, the applications IDs of the applications to cancel are 3 and 5. Note that you will not be able to cancel the application of your current session. Try opening other terminals and connecting with the myconn1 alias to list more applications and cancel them. If using the MindTerm Java terminal from my.imdemocloud.com, you can quickly open another terminal window by using CTRL + SHIFT + o

Note:

At the time of writing (tech preview code), there is a defect opened for the cancel command. It may not work.

If you had admin privileges, you could cancel all applications as follows:



Jsqsh> cancel applications all;

Exercise-6 (30 minutes): DDL, DML, Queries

In this lab environment, all participants are sharing the same cluster. To keep your work separate, create your own schema as follows:

jsqsh> set dfs.umaskmode = 077; jsqsh> CREATE SCHEMA location '/user//.db' ; For example, in the screenshot below, the user ID is raulchongbiz1. The command to execute would be:

jsqsh> set dfs.umaskmode = 077; jsqsh> CREATE SCHEMA raulchongbiz1 location '/user/raulchongbiz1/raulchongbiz1.db' ; In the current Big SQL implementation, objects are being created in HDFS when you issue different SQL commands. The mask in the above command sets the permissions to the specified HDFS location so others cannot access the objects. In this lab environment, each participant will have its own /user/ directory created ahead of time, so you dont need to create it yourself.

If at any point you need to drop the schema created (DONT DO IT NOW), you can issue:

DROP SCHEMA IF EXISTS CASCADE; For example:

jsqsh> DROP SCHEMA raulchongbiz1 CASCADE; The CASCADE clause is needed in case there are objects created in the schema; this will drop the objects too.

Verify the schema was created (i.e. schema exists in catalog?):

Jsqsh> \show schemas

Tell bigsql server that all objects (table etc) we refer to in our DDL, DML, queries should use this schema.



Tip:

Everytime a new bigsql connection is established, this should be the 1st statement. Otherwise you will end up creating objects in default schema

Set the default schema for the session: USE ;

For example:

Jsqsh> USE raulchongbiz; Check current default schema

Jsqsh> Set v jaql.sql.defaultSchema;

Simple query and local-access hint

Set schema

Jsqsh> USE ;

Drop existing table

Jsqsh> DROP TABLE IF EXISTS lineitem;

Create a table. Pay attention to how the delimiter | for this csv file is specified in the statement. The STORED AS TEXTFILE clause tells that this will be a text-file in hdfs

Note:

The tables lineitem, and supplier to be created in this section are the same tables used in the Working with SQL using Hive lab. A few things to keep in mind while going through the steps below:

When using Jsqsh (instead of the Hive CLI), you can copy/paste the entire statements below. Jsqsh will know how to paste the lines appropriately. In the Hive CLI, you had to type the entire statement, or copy/paste one line at a time.

With Big SQL, you have the option to run the queries in local mode or map-reduce mode. In Hive



its only map-reduce.

You should notice gains in performance for some of the queries either in map-reduce or local mode. If you did not complete the Working with SQL using Hive lab, or dont remember the performance

running queries in Hive, we suggest you open two terminal windows (If using MindTerm, click on File > Clone Terminal), and run the same queries side by side; one in Hive, one in Big SQL. Tables created in Hive or Big SQL are the same, so you dont need to create the table if it was created already in one or the other.

Jsqsh> CREATE TABLE lineitem ( L_ORDERKEY INT, L_PARTKEY INT, L_SUPPKEY INT, L_LINENUMBER INT, L_QUANTITY DOUBLE, L_EXTENDEDPRICE DOUBLE, L_DISCOUNT DOUBLE, L_TAX DOUBLE, L_RETURNFLAG STRING, L_LINESTATUS STRING, L_SHIPDATE STRING, L_COMMITDATE STRING, L_RECEIPTDATE STRING, L_SHIPINSTRUCT STRING, L_SHIPMODE STRING, L_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE;

Load 10,000 rows in the newly created table.

Note:

The LOAD command in Big SQL is currently passing the command to Hive. At the time of writing, Hive seems to have a problem where even though a user creates a table, he cannot LOAD data to it because it doesnt have UPDATE privilege; therefore, you need to go to the Hive shell, and issue a GRANT



statement to yourself (as yourself), granting UPDATE privilege, or better yet, to grant ALL privileges.

Open another terminal window, lets call it Terminal 2 (If using Mindterm, go to File > Clone terminal) Start the Hive shell, and execute the GRANT statement:

hive hive> use ; hive> grant all on table lineitem to user ; hive> quit;

Go back to Terminal 1, and issue:

Jsqsh> LOAD HIVE DATA LOCAL INPATH '/userdata/lineitem.data' OVERWRITE INTO TABLE lineitem; In this environment, for this lab we have copied the input files to /userdata. Note that this directory and input files must be readable by biadmin (because bigsql-server, who is running as biadmin, will read the files).

Run a simple query

Jsqsh> SELECT COUNT (*) FROM lineitem; Note:

Queries on large amount of data typically run faster with map-reduce parallelism. On the other hand, queries on small amount of data or full table-scan typically runs faster with no map-reduce (i.e. local-read) because in this case, the overhead introduced by map-reduce parallelism is more than benefits offered by it.

bigsql uses map-reduce by default for most cases. For some simpler cases like a select * from t1, bigsql uses local-read by default.

You can force local-read mode using accessmode hint.



Jsqsh> SELECT COUNT (*) FROM lineitem /*+ accessmode='local' +*/;

Similarly you can force map-reduce as follows:

Jsqsh> SELECT COUNT (*) FROM lineitem /*+ accessmode='MR' +*/;

Here is how you can set local mode for all queries in the session:

Jsqsh> SET FORCE LOCAL ON; Jsqsh> SELECT COUNT (*) FROM lineitem; Jsqsh> SELECT L_SHIPMODE, COUNT (*) FROM lineitem GROUP BY L_SHIPMODE; Jsqsh> SET FORCE LOCAL OFF; Lets now create another table:

Jsqsh> CREATE TABLE supplier ( S_SUPPKEY INT, S_NAME STRING, S_ADDRESS STRING, S_NATIONKEY INT, S_PHONE STRING, S_ACCTBAL DOUBLE, S_COMMENT STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE;



Load the data: Jsqsh>

LOAD DATA LOCAL INPATH '/userdata/supplier.data' OVERWRITE INTO TABLE supplier;

Verify that there are 10,000 rows in your supplier table Jsqsh>

SELECT COUNT(*) FROM supplier; In our tests, the above query takes 24.723 seconds in Hive vs. 8.31 seconds in Big SQL. Next, run the following queries (also ran in the Hive lab)

jsqsh> SELECT

SUM(L_EXTENDEDPRICE*L_DISCOUNT) AS REVENUE FROM lineitem WHERE L_SHIPDATE >= '1994-01-01'

AND L_SHIPDATE < '1995-01-01' AND L_DISCOUNT >= 0.05 AND L_DISCOUNT = '1994-01-01'

AND L_SHIPDATE < '1995-01-01' AND L_DISCOUNT >= 0.05 AND L_DISCOUNT



WITH REVENUE ( SUPPLIER_NO, TOTAL_REVENUE ) AS (SELECT L_SUPPKEY, SUM(L_EXTENDEDPRICE * (1-L_DISCOUNT)) FROM LINEITEM WHERE L_SHIPDATE >= '1996-01-01' AND L_SHIPDATE < '1996-04-01' GROUP BY L_SUPPKEY) SELECT S_SUPPKEY, S_NAME, S_ADDRESS, S_PHONE, TOTAL_REVENUE FROM SUPPLIER, REVENUE WHERE S_SUPPKEY = SUPPLIER_NO AND TOTAL_REVENUE = (SELECT MAX(TOTAL_REVENUE) FROM REVENUE) ORDER BY S_SUPPKEY;

You should get a result like:

1 row in results(first row: 1m18.7s; total: 1m18.7s)

Working with CTAS (create table as select), and joins

Set schema for the session:

Jsqsh> USE ;

Drop existing table if exists:

Jsqsh> DROP TABLE IF EXISTS orders1;

Create a regular table (not a CTAS yet):

CREATE TABLE orders1 ( O_ORDERKEY BIGINT,



O_CUSTKEY INTEGER, O_ORDERSTATUS CHAR(1), O_TOTALPRICE FLOAT, O_ORDERDATE TIMESTAMP, O_ORDERPRIORITY CHAR(15), O_CLERK CHAR(15), O_SHIPPRIORITY INTEGER, O_COMMENT VARCHAR(79) ) row format delimited fields terminated by '|' stored as textfile;

Load 10k rows.

From the other terminal window, Terminal 2 , start the Hive shell, and execute the GRANT statement:

hive hive> use ; hive> grant all on table order1 to user ; hive> quit;

Return to Terminal 1 and issue:

Jsqsh> LOAD HIVE DATA LOCAL INPATH '/userdata/orders.data' OVERWRITE INTO TABLE orders1;

Verify the data was loaded:

Jsqsh> SELECT COUNT (*) FROM orders1 /*+ accessmode=local +*/;



Create a CTAS (create table as select) query to prepare for another query:

CREATE TABLE q4_order_priority_tmp (O_ORDERKEY) as select DISTINCT l_orderkey as O_ORDERKEY from lineitem where l_commitdate < l_receiptdate; Using a JOIN to join the previously two created tables:

select o_orderpriority, count(1) as order_count from orders1 o join q4_order_priority_tmp t on o.o_orderkey = t.o_orderkey and o.o_orderdate >= cast('1993-07-01' as timestamp) and o.o_orderdate < cast('1993-10-01' as timestamp) group by o_orderpriority order by o_orderpriority; Tip:

If we knew that one table was small, then during the join, it could be pulled in memory and we can do cheaper memory join (aka hash join, aka map-side join) e.g.

FROM T1, T2 /*+ tablesize=small +*/

Or

WHERE T1.c1 = T2.c1 /*+ joinmethod=mapsidehash, buildtable=T2 +*/

Complex types Lets create another table that uses complex data types (namely array and struct)

Set the schema for this session:



Jsqsh> USE ;

Drop existing table

Jsqsh> DROP TABLE IF EXISTS employees;

Create a table with complex data types. The clause COLLECTION ITEMS TERMINATED BY tells how to separate struct/array members.

Jsqsh> CREATE TABLE employees ( name STRING, phones ARRAY, address STRUCT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' COLLECTION ITEMS TERMINATED BY ':';

Load some data.

From the other terminal window, Terminal 2 , start the Hive shell, and execute the GRANT statement:

hive hive> use ; hive> grant all on table employees to user ; hive> quit;

Return to Terminal 1 and issue:

Jsqsh>



LOAD HIVE DATA LOCAL INPATH '/userdata/employees.data' OVERWRITE INTO TABLE employees;

Lets see how to access complex types. Note that for this query, bigsql used local mode by default

Jsqsh> SELECT * FROM EMPLOYEES;

Look at names and 1st phone# (in array of phone numbers)

Jsqsh> SELECT name, phones[1] FROM EMPLOYEES;

Look at names and city (from address struct)

Jsqsh> SELECT name, address.city FROM EMPLOYEES;

Set command to get/set session level job-conf settings You can set job-conf properties at session level, such that they will be used by future queries in this session e.g.

Check the # of reducers parameter:

Jsqsh> set v mapred.reduce.tasks;

If we know that very large amount of data needs to be processed by reducers, then increase the # of reducers

Jsqsh> set mapred.reduce.tasks = 4;

Check back to ensure that setting is now in effect:

Jsqsh> set v mapred.reduce.tasks;



Print all settings that are applicable to this session

Jsqsh> set v;

Print the settings that you have manually set in this session

Jsqsh> set;

----------- End of lab -------------

Documents

Lab - Working With SQL Using Big SQL v3