Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data...

Preview:

Citation preview

2015©Trivadis

BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIENBASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN

BigDataSparkSQL

Why doIneed SparkSQL?Firststeps of aOracleExpert

Author: JanOtt– Trivadis AG

BigData- SparkSQL1

2015©Trivadis

Mitüber600IT- undFachexpertenbeiIhnenvorOrt

2

12 Trivadis Niederlassungen mitüber 600 Mitarbeitenden

200 Service Level Agreements

Mehr als 4'000 Trainingsteilnehmer

Forschungs- und Entwicklungs-budget: CHF 5.0 Mio. / EUR 4.0 Mio.

Finanziell unabhängig undnachhaltig profitabel

Erfahrung aus mehr als 1'900 Projekten pro Jahr bei über 800 Kunden

Hamburg

Düsseldorf

Frankfurt

FreiburgMünchen

Wien

BaselZürichBern

Lausanne

2

Stuttgart

Brugg

2BigData- SparkSQL

2015©Trivadis

Agenda

1. Introduction2. FirstStepsintheSparkWorld3. Project14. Summary

BigData- SparkSQL3

2015©Trivadis

Introduction§ AfewwordsaboutBigData

§ BigData§ Hadoop§ WhySpark,SparkSQL?

§ SparkSQL– myfirststeps§ GetsomedataintoHadoop§ TablesinSpark- Hive§ UseSQL§ Diverse

§ Project– Twitter

BigData- SparkSQL4

2015©Trivadis

BigData:Introduction§ BigData

§ TurningDataintoInsights

§ HadoopanditsZoo§ HDFS– MapReduce§ SQL– Impala,HBase,Hive,…§ Zookeeper

§ SparkandSparkSQL

§ NoSQL Databases§ Architecture

§ LAMBDA

BigData- SparkSQL5

2015©Trivadis

What is Spark

§ ApacheSpark™(Apachewebheadline)isafastandgeneralengineforlarge-scaledataprocessing.

§ Spark(Wiki)§ clustercomputing frameworkSpark§ Interfaceforprogramming entireclusters§ Implicitdataparallelism§ Fault-tolerance§ AnApacheOpenSourceProject§ DevelopedbyUCBerkeley

§ Goal§ Lightning-fastclustercomputing

§ Performance§ Faster10xondisc– 100xinmemory

BigData- SparkSQL6

2015©Trivadis

What is Spark(2)

§ SparkParts§ Core§ SQL§ MLLib– machinelearning§ Streaming§ GraphX

§ SparkSQLandHIVE§ Workingwithstructureddata§ SQLinsideSparkprograms§ HIVEmetadatastore§ JDBC/ODBC

BigData- SparkSQL7

2015©Trivadis

What is Spark(3)

§ Runseverywhere§ HadoopHDFS§ Mesos§ Cassandra§ HBase§ S3§ …

BigData- SparkSQL8

2015©Trivadis

What is Hadoop

§ afilesystem– HDFS§ BasedonpapersfromGoogle§ ApacheOpenSourceProject

§ Goal§ Fast§ Handleshugeamountofdata§ Handlesunstructured tofullystructureddata§ Horizontally scalable§ Reliable

BigData- SparkSQL9

2015©Trivadis

Agenda

1. Introduction2. FirstStepsintheSparkWorld– SparkSQL3. Project14. Summary

BigData- SparkSQL10

2015©Trivadis

FirstSteps

§ Keepitsimple

§ GetsomedataintoHadoop

§ GetsomedataintoSpark- Hive

§ Java– keepittoaminimum

§ Datasmall

§ Getanenvironmentthatissetup§ OracleVM– BigDataLight§ PickonewaytogetthedataintoSpark- Hive

§ SeeSQLonaHDFSsystemwithSpark

BigData- SparkSQL11

2015©Trivadis

Pre-Requisite– Environment

§ OracleBigDataLite§ VM§ Version4.4.0§ http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-

bigdatalite-2104726.html

§ Contains§ OracleDatabase12c(12.1.0.2)§ Cloudera’sDistribution including ApacheHadoop (CDH5.5.1)§ Hadoop2.6.0§ Hive1.1.0§ Spark1.5§ OracleSQLDeveloper4.1.3

§ OracleVirtualBox

BigData- SparkSQL12

2015©Trivadis

Informationabout the VM

§ Login§ oracle/welcome1

§ Starthere§ file:///home/oracle/GettingStarted/StartHere.html

§ Start§ OracleDB§ Hive§ Spark

§ Yourdonepreparing

BigData- SparkSQL13

2015©Trivadis

TheSteps – simple– focus

BigData- SparkSQL14

SQLQuery

HIVETable

2015©Trivadis

Step 1– Data– 2files

§ t_10.txtandt_us_cities.lst§ Commadelimited§ Flatfile§ Formatthedatesoitfitsthestandarddateformat

- YYYY-MM-DDHH24:MI:SS.XXXX

BigData- SparkSQL15

2015©Trivadis

Step 1– Data

BigData- SparkSQL16

t_us_cities.lst

New York New YorkLos Angeles CaliforniaChicago IllinoisHouston Texas...

t_10

1 Hans Meier 3000 1968-02-02 00:00:00 2000-01-01 00:00:00 12 Stefan Müller 5000 1970-10-15 00:00:00 2001-07-01 00:00:00 13 Susanne Kieser 3500 1972-03-14 00:00:00 2005-05-01 00:00:00 24 Paul Steiner 4000 1960-07-28 00:00:00 2000-01-01 00:00:00 25 Monika Hausmann 7000 1975-03-29 00:00:00 2000-01-01 00:00:00 3...

2015©Trivadis

DEMO

§ OracleBigDataLightVM§ Spark– Scala- sqlContext

§ GoogleCloudBigData§ SparkSQL– CLI§ Spark/HIVE/Hadoop

BigData- SparkSQL17

2015©Trivadis

Agenda

1. Introduction2. FirstStepsintheImpalaWorld3. Project4. Summary

BigData- SparkSQL18

2015©Trivadis

Project1– Figures

§ 400– 500Miotweetsperday

§ 1tweetcontains§ Around50metadatapieces

- Geo-location- Re-tweets- Followers

§ Thatisabout2A4pages

§ TwitterSampleStream§ 1%§ 4-5Miotweetsperday§ 50tweetspersecond

§ 20otherstreamswithdefinedkeywords

§ HDFS§ 1TBevery2monthsincluding replication

BigData- SparkSQL19

2015©Trivadis

TheLambdaArchitecture- adopted

BigData- SparkSQL20

Batchlayer

Speedlayer

AllData(HDFS)

Pre-computedViews

(MapReduce)Batch(re)compute

Query&

MergeREST

ProcessStream

IncrementedViews

Realtime Increment

Servinglayer

QFD= QueryFocusedData

QFD1 QFD2 QFDn…

QFD1 QFD2 QFDn

Realtime views

BatchviewsMessagingKafka

ClientWebApp

Consumerlayer

TwitterAPI

JavaAPP

Hadoop

Storm

Impala

Cassandra

2015©Trivadis

Project3

§ Livestreams

§ Batchcomputation

§ Scalable

§ Openfornewsources

BigData- SparkSQL21

2015©Trivadis

Agenda

1. Introduction2. FirstStepsintheImpalaWorld3. Project14. Summary

BigData- SparkSQL22

2015©Trivadis

Summary

§ AnewWorld

§ Spark,Hive,Hadoopand… it’sazoo§ VMOracleBigDataLight– CDH5.5.1– Spark1.5– SparkSQLCLIdoesnotrun§ VMCloudera– CDH5.5.0.2– Spark1.5- SparkSQLCLInotinstalled§ InstallitbymyselfintotheseVM’s…notagood idea

§ Google– Version1– Spark1.6containsSparkSQLCLI

§ LotscanbedonewithRDBMS

§ Starttocollectnow

BigData- SparkSQL23

2015©Trivadis

Why Spark- SQL

§ SQL§ Known§ Analystscanusedit

§ JDBC§ Diverstoolscanconnectanduseit

§ Noprogrammingneeded

§ Speed!§ Adhoc§ Batch

§ ItisINMEMORY– nolimit– spillstodisk

BigData- SparkSQL24

2015©Trivadis

BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIENBASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN

Questions?

THANKYOU.

Trivadis AG

JanOtt

Sägereistrasse29CH-8152Glattbrugg-Zurich

Tel. +41-44-808 7020(reception)Fax +41-44-808 7021

info@trivadis.comwww.trivadis.com

BigData- SparkSQL25

2015©Trivadis

SessionFeedback– now

TechEventFebruary201626

§ Please use the Trivadis TechEventMobileApptogive session feedback

§ Use "My schedule"if you registeredfor thissession

§ Otherwise use "Agenda"and the search function

§ If themobileAppdoes notwork (or if you have aWindowsPhone)use your MobileBrowser§ URL:https://trivadis16.quickmobile.center§ Username:<your_loginname> (likesvv)§ Password:sentby mail...

2015©Trivadis

Sources§ Spark

§ https://spark.apache.org

§ OracleVM– BigDataLight§ http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-

bigdatalite-2104726.html

§ Books:§ BigData– MEAPbyNathanMarz§ SparkCookbookbyRishiYadav§ LearningSparkbyMatei Zaharia,PatrickWendell,AndyKonwinski,HoldenKarau

§ Pictures§ Oracle.com§ Twitter.com§ Apache.com§ Cloudera.com

BigData- SparkSQL27

Recommended