Upload
others
View
19
Download
0
Embed Size (px)
Citation preview
New Data Lake – Work with BDCS-CE (Notebooks,
Object Storage/HDFS, Spark, and Spark SQL)
Before You Begin
Purpose
In this tutorial, you learn how to get started with your new Big Data Cloud Service - Compute Edition
(BDCS-CE) instance. You will learn how to work with the Notebook, and then use the Notebook to learn
how to work with the Object Storage and HDFS and with Spark and Spark SQL.
There are five sections in this tutorial:-
Importing Notes
Tutorial 1 – Notebook Basics
Tutorial 2 - Setting up your BDCSCE Environment
Tutorial 3 – Working with the Object Storage and HDFS
Tutorial 4 – Working with Spark and Spark SQL
Time to Complete
60 minutes
Background
Notebooks are used to explore and visualize data in an iterative fashion. Oracle Big Data Cloud Service -
Compute Edition uses Apache Zeppelin as its notebook interface and coding environment. Information
about Zeppelin can be found here: https://zeppelin.apache.org/ . To see examples of notes created and
shared by other Zeppelin users, see https://www.zeppelinhub.com/viewer .
What Do You Need?
Before starting this tutorial, you should have:
A running BDCS-CE cluster
BDCS-CE account credentials or Big Data Cluster Console direct URL (for example:
https://xxx.xxx.xxx.xxx:1080/)
BDCS-CE cluster login credentials
The Object Store credentials you specified when you create the BDCS-CE instance
The Note files downloaded from here for importing. Unzip the file into a folder on your computer
and make a note of the folder path. After the download is finished, you should have the following
note files:
Tutorial 1 Notebook Basics 1721.json
Tutorial 2 Setting up your BDCSCE Environment 1721.json
Tutorial 3 Working with the Object Store and HDFS 1721.json
Tutorial 4 Introduction to Spark and Spark SQL 1721.json
Context
This tutorial is part of the New Data Lake series Oracle Big Data Journey. The sequence to follow is:
Module 1: New Data Lake Overview
Module 2: Sign up for an Oracle Cloud Trial, Create Object Storage Instance, and Create Big Data
Cloud Service - Compute Edition (BDCS-CE) Instance
Module 3: Work with BDCS-CE (Notebooks, Object Storage/HDFS, Spark, and Spark SQL)
Module 4: Create Event Hub Cloud Service (OEHCS) Instance
Module 5: Work with OEHCS and Spark Streaming
Importing Notes
Navigate/login to the Oracle Cloud My Services web page and navigate to the My Services page for your
BDCS-CE cluster.
In the Services page, click the Manage this Service icon of the cluster and then click Big Data
Cluster Console.
Enter your BDCS-CE cluster administrator user name and password and click Log In if a window titled
Authentication Required appears.
In the Big Data Cloud - Compute Edition Console, click Notebook.
The Notebook page is displayed, listing any notes created for this notebook.
In the Big Data Cloud - Compute Edition Console Notebook page, click Import Note.
The Import Note window is displayed. Click Browse.
Browse for or specify the file you want to import, the file must have a .json extension. You should have
previously downloaded the .json files via a Notes.zip file earlier. Select Tutorial 1 Notebook Basics
1721.json. Click OK.
The note is imported and listed in the list of notes.
Import the rest of .json files you downloaded in the same way.
Click the link for the note, which is named …-Tutorial 1 Notebook Basics, to start the first tutorial.
Tutorial 1 – Notebook Basics
The paragraphs of the note are displayed.
Please walk through the paragraphs one by one. Read through the content of the paragraphs as you get to
them. There is much useful information that in the paragraphs that is not reproduced into these
instructions.
Interpreters enable Zeppelin users to mix and match various language/data-processing-backends into a
single platform. For example, to use Python code in Zeppelin, you can use the %pyspark interpreter. If you
are not familiar about Zeppelin, click the links to learn about it.
Read through the content in this paragraph as it introduces our first interpreter: Markdown.
Also explore the actions you can perform using the note and paragraph icons. You can:
Show and hide the code editor
Show and hide results
Clear results
Export the note
Use keyboard shortcuts
Bind interpreters and change the default interpreter
Click the Show editor icon at the top right corner of the paragraph.
Markdown is a plain text formatting syntax. It can be converted to HTML.
Read through the content in the code editor. Click the links in the paragraph to learn more about
Markdown Interpreter.
Add a new line, new item, at the bottom of the list. Click Run icon at the top right corner of the
paragraph.
The new item is displayed in the list.
Here is a quick exercise before you continue.
Add your name in the paragraph and run it.
Here are two paragraphs about the shell interpreter(%sh). Read through the quick introduction in the first
one.
Read through the shell script in this paragraph.
Among other things, the script demonstrates a useful, repeatable trick to ensure you see the full output
when running shell commands. Note that there are two yum command lines. 2>&1 is added to the
second one as a useful trick to see the standard error as well as standard output.
Click the Run icon and check the result.
From the output, you can see yum command fails since you are not root. And the standard error is visible
in the second yum command because 2>&1 is added.
At the end of Tutorial 1, we introduce you some tips such as how to print notebooks. Please follow the
steps described in the paragraphs to learn more about these tips.
After you complete the tutorial, click Notebook on the top to go back to the notebook list page.
Then click on the note named …-Tutorial 2 Setting up your BDCSCE Environment, to start the second
tutorial.
Tutorial 2 - Setting up your BDCSCE Environment
Read through the introduction in this paragraph.
You will learn how to set up the BDCSCE environment.
Note that you need to run all of the paragraphs one at a time in order from top to bottom because some
steps are performed manually.
First, follow the steps in the Enabling SSH Network Access paragraph.
Then, follow the steps in the following paragraphs:-
Connecting to your BDCS-CE Zeppelin Server via SSH
Setting up the zeppelin user with sudo access
Configuring yum and pip
Read through the command script in the Commands to setup yum and pip paragraph.
Yum allows automatic updates, package and dependency management, on RPM-based linux distributions.
(From Wikipedia)
Pip is a package management system used to install and manage software packages written in Python.
(From Wikipedia)
Because you have now set up the zeppelin user with sudo access, we can use sudo in the scripts.
Click Run. The blue bar in the middle is the progress bar of running the paragraph.
Read through the command script in the Installing the swift object store command line utility paragraph.
Oracle Object Storage supports the industry standard OpenStack Object Storage API. This OpenStack
Object Storage API is known as "swift". BDCS-CE includes swift drivers that work with Spark and Hadoop
so that you can use those tools with Object Store data.
Click Run.
Click Notebook on the top to go back to the notebook list page and proceed to the next tutorials.
Tutorial 3 – Working with the Object Storage and
HDFS
Read through the quick introduction of Tutorial 3 – Working with the Object Store and HDFS. Ensure
that you have run Tutorial 2 first as this tutorial requires it.
Data can be placed in Object Storage, HDFS, and local file system, so in this tutorial you will experiment
with loading/unloading data between any two of them.
Read through the content in the paragraph to learn more about Oracle Storage Cloud Object Store.
Here is a brief picture of why the Object Storage is important.
With Object Storage, we can detach compute from storage allow for the two environments to grow
independently. It gives us a great way of being able to scale compute and store separately from each other.
We might have massive data and we only need to work on some of them some of the time, it allows us to
keep more data in the Object Storage than that we need to keep in the BDCS-CE environment.
We can maintain a core, distribution based environment while being able to use the latest and greatest
Hadoop projects on demand. We can have different clusters of compute all working against the same
object store. We might have data in the object store which is used by three or four groups in the
organization. The groups can have their own independent BDCS-CE environments but yet work on the
same data.
We can also persist all the data in a low cost, globally distributed store that speeds processes up while
making it more durable.
Configuring your Object Store Credentials
This paragraph will setup a shell script (swift_env.sh) to retrieve your Storage Cloud credentials to simplify
running the swift command in the future. Fill in your object store credentials before you run the
paragraph. Note that your credentials might be different from these shown in the screenshot.
Click Run to test the Storage Cloud connectivity by listing the Object Store containers with swift list
command.
Download some sample data to experiment with. In this example, you are going to download the text of a
handful of United States Presidential Inauguration Speeches. Click the link of the Yale law School Avalon
Project website to see the data source if you like.
Read through the command script in the code editor. You will install the lynx browser to help with the
downloading. Then you will download five speeches from the Yale Law School Avalon Project website.
Click Run.
The five speeches are downloaded and listed in the result. These files are downloaded to the local linux
filesystem on the BDCS-CE server.
Input a name of your Object Storage container; click Run to upload the five speeches to the Object
Storage. If you forget the container name, you can run the How to List Containers in the Object Store
paragraph.
Swift upload… command is used here to copy a file from the local linux filesystem of the BDCS-CE server
to the Object Storage.
You can list the containers in your Object Store and select one of them for the five speeches.
You can list the speeches in the container after uploading. Make sure you see the presidential speeches in
the container.
Read through the script in the code editor to learn about how to download files from the Object Store to
the linux file system. Swift download… command is used here. Enter the name of container and click Run.
Read through the script in the code editor to learn about how to upload and download files from the
Zeppelin server’s linux file system to BDCS-CE’s HDFS file system. Hadoop fs –put/-get command and
parameters are used here. Click Run.
Read through the script in the code editor to learn about how to copy files from the Object Store to
BDCS-CE’s HDFS file system.
Click Run.
Click Notebook on the top to go back to the notebook list page and proceed to the next tutorials.
Tutorial 4 – Working with Spark and Spark SQL
Read through the quick introduction of Tutorial 4 – Working with Spark and Spark SQL. Ensure that you
have run the previous 2 tutorials first as this tutorial depends on it.
This BDCS-CE version supplies Zeppelin interpreters for Spark(Scala), Spark(Python), and Spark SQL. This
tutorial will give you examples using all of these.
Check out the links to get a basic knowledge about Spark and Spark SQL if you need.
In Example 1, you run Scala Spark code to manipulate data from HDFS. It defines a Spark RDD (Resilient
Distributed Dataset) against a text file (pres1861_lincoln.txt) stored in HDFS. Then it runs a few actions
against the RDD, such as counting the # of lines, displaying the first line, and counting the number of lines
matching a given term - Constitution.
From the output, you can see that there are 364 rows, 24 lines contain the word Constitution, and the
first line is First Inaugural Address of Abraham Lincoln, in the text file.
Python is an important language for data scientists because of its easy-to-understand syntax and rich
ecosystem.
The second example does the same logic as the first example, just using Python instead of Scala.
Note that %pyspark interpreter is used here.
The third example is a slight variation of the first example. The only difference is that it uses data
(pres1933_fdr1.txt) on the zeppelin's server linux file system, not HDFS.
Here is the output of running Example 3.
The fourth example is another variation of the first example. The difference is that it works with data
stored on the Object Storage, not in HDFS.
Run Example 4.
It works with data named pres1981_reagon1.txt stored on the Object Store
The fifth example runs the classic Wordcount algorithm. You can choose if you want to operate on HDFS,
Object Storage, or linux file system data by commenting out the appropriate code. The results of the
Wordcount are an RDD named wordCounts that will also be used in the following example.
Click Run.
From the output, you can see the array of pairs of word and count.
The sixth example continues from the previous example. Specifically, it takes the wordCounts RDD and
registers it as Spark DataFrame. Then it registers the new data frame as a temporary Spark SQL table
named wordcounts. Click the link in the paragraph if you want to review the Spark SQL programming
guide to learn more about the features of Spark SQL.
The seventh example continues from the previous example. Specifically, it provides two samples of
running Spark SQL against the wordcounts table.
It also demonstrates some of the features of Zeppelin's Spark SQL interpreter and display visualization
capabilities.
Click the Table icon in the tool bar to see the data in table format.
Click the Bar Chart icon in the tool bar to see the data in bar chart format.
Click the settings alongside the tool bar if you want to do advanced setting.
The eighth example continues from the previous example. Specifically, it builds an RDD against all of the
speeches we stored in the Object Storage. Then it performs a word count against the RDD and converts
the result into a Data Frame that we register as a Spark SQL temporary table named filewordsraw.
Click Run.
The chart shows the trend of # of words per speech.
Note that we use explode function in the SQL statement to create a new row for each element in the
words array.
The chart shows the trend of # of various words (america, constitution, nation, rights, freedom, and
people), per speech.
The final example continues from the previous example. Specifically, it defines a new data frame based off
one of the sample SQL statements and writes that data frame back to the Object Store (as a json file
named pres_wordcount.json). Then it reads the json data back from the Object Store into a new Data
Frame. The repartition(1) ensures that we write a single output file, which makes sense since we know
the output is small.
From the output, you can see the array of pairs of file name and the count of words.
Optional: List the contents of your Object Store container to see the structure of the saved data frame.
Note that your container name might be different from that in the screenshot. There is only output file
named “part-r-0000-…”.
Optional - Explore the Spark UI
When you use Spark (via Scala, Python, and/or SQL), you start a session with the Spark server. In many
situations, it can be helpful to view the Spark UI for your session. BDCS-CE provides easy access to Spark
UI for your Zeppelin session.
To view it, follow these steps in the paragraph.
Want to Learn More?
Working with Notebook
Running a Batch Spark Job in a Big Data Cloud Service - Compute Edition Cluster
Get Started with Oracle Big Data Cloud Service - Compute Edition
Get Started with Oracle Storage Cloud Service