27
2015 © Trivadis BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN Big Data Spark SQL Why do I need Spark SQL? First steps of a Oracle Expert Author: Jan Ott – Trivadis AG Big Data - Spark SQL 1

Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

2015©Trivadis

BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIENBASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN

BigDataSparkSQL

Why doIneed SparkSQL?Firststeps of aOracleExpert

Author: JanOtt– Trivadis AG

BigData- SparkSQL1

Page 2: Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

2015©Trivadis

Mitüber600IT- undFachexpertenbeiIhnenvorOrt

2

12 Trivadis Niederlassungen mitüber 600 Mitarbeitenden

200 Service Level Agreements

Mehr als 4'000 Trainingsteilnehmer

Forschungs- und Entwicklungs-budget: CHF 5.0 Mio. / EUR 4.0 Mio.

Finanziell unabhängig undnachhaltig profitabel

Erfahrung aus mehr als 1'900 Projekten pro Jahr bei über 800 Kunden

Hamburg

Düsseldorf

Frankfurt

FreiburgMünchen

Wien

BaselZürichBern

Lausanne

2

Stuttgart

Brugg

2BigData- SparkSQL

Page 3: Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

2015©Trivadis

Agenda

1. Introduction2. FirstStepsintheSparkWorld3. Project14. Summary

BigData- SparkSQL3

Page 4: Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

2015©Trivadis

Introduction§ AfewwordsaboutBigData

§ BigData§ Hadoop§ WhySpark,SparkSQL?

§ SparkSQL– myfirststeps§ GetsomedataintoHadoop§ TablesinSpark- Hive§ UseSQL§ Diverse

§ Project– Twitter

BigData- SparkSQL4

Page 5: Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

2015©Trivadis

BigData:Introduction§ BigData

§ TurningDataintoInsights

§ HadoopanditsZoo§ HDFS– MapReduce§ SQL– Impala,HBase,Hive,…§ Zookeeper

§ SparkandSparkSQL

§ NoSQL Databases§ Architecture

§ LAMBDA

BigData- SparkSQL5

Page 6: Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

2015©Trivadis

What is Spark

§ ApacheSpark™(Apachewebheadline)isafastandgeneralengineforlarge-scaledataprocessing.

§ Spark(Wiki)§ clustercomputing frameworkSpark§ Interfaceforprogramming entireclusters§ Implicitdataparallelism§ Fault-tolerance§ AnApacheOpenSourceProject§ DevelopedbyUCBerkeley

§ Goal§ Lightning-fastclustercomputing

§ Performance§ Faster10xondisc– 100xinmemory

BigData- SparkSQL6

Page 7: Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

2015©Trivadis

What is Spark(2)

§ SparkParts§ Core§ SQL§ MLLib– machinelearning§ Streaming§ GraphX

§ SparkSQLandHIVE§ Workingwithstructureddata§ SQLinsideSparkprograms§ HIVEmetadatastore§ JDBC/ODBC

BigData- SparkSQL7

Page 8: Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

2015©Trivadis

What is Spark(3)

§ Runseverywhere§ HadoopHDFS§ Mesos§ Cassandra§ HBase§ S3§ …

BigData- SparkSQL8

Page 9: Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

2015©Trivadis

What is Hadoop

§ afilesystem– HDFS§ BasedonpapersfromGoogle§ ApacheOpenSourceProject

§ Goal§ Fast§ Handleshugeamountofdata§ Handlesunstructured tofullystructureddata§ Horizontally scalable§ Reliable

BigData- SparkSQL9

Page 10: Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

2015©Trivadis

Agenda

1. Introduction2. FirstStepsintheSparkWorld– SparkSQL3. Project14. Summary

BigData- SparkSQL10

Page 11: Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

2015©Trivadis

FirstSteps

§ Keepitsimple

§ GetsomedataintoHadoop

§ GetsomedataintoSpark- Hive

§ Java– keepittoaminimum

§ Datasmall

§ Getanenvironmentthatissetup§ OracleVM– BigDataLight§ PickonewaytogetthedataintoSpark- Hive

§ SeeSQLonaHDFSsystemwithSpark

BigData- SparkSQL11

Page 12: Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

2015©Trivadis

Pre-Requisite– Environment

§ OracleBigDataLite§ VM§ Version4.4.0§ http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-

bigdatalite-2104726.html

§ Contains§ OracleDatabase12c(12.1.0.2)§ Cloudera’sDistribution including ApacheHadoop (CDH5.5.1)§ Hadoop2.6.0§ Hive1.1.0§ Spark1.5§ OracleSQLDeveloper4.1.3

§ OracleVirtualBox

BigData- SparkSQL12

Page 13: Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

2015©Trivadis

Informationabout the VM

§ Login§ oracle/welcome1

§ Starthere§ file:///home/oracle/GettingStarted/StartHere.html

§ Start§ OracleDB§ Hive§ Spark

§ Yourdonepreparing

BigData- SparkSQL13

Page 14: Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

2015©Trivadis

TheSteps – simple– focus

BigData- SparkSQL14

SQLQuery

HIVETable

Page 15: Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

2015©Trivadis

Step 1– Data– 2files

§ t_10.txtandt_us_cities.lst§ Commadelimited§ Flatfile§ Formatthedatesoitfitsthestandarddateformat

- YYYY-MM-DDHH24:MI:SS.XXXX

BigData- SparkSQL15

Page 16: Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

2015©Trivadis

Step 1– Data

BigData- SparkSQL16

t_us_cities.lst

New York New YorkLos Angeles CaliforniaChicago IllinoisHouston Texas...

t_10

1 Hans Meier 3000 1968-02-02 00:00:00 2000-01-01 00:00:00 12 Stefan Müller 5000 1970-10-15 00:00:00 2001-07-01 00:00:00 13 Susanne Kieser 3500 1972-03-14 00:00:00 2005-05-01 00:00:00 24 Paul Steiner 4000 1960-07-28 00:00:00 2000-01-01 00:00:00 25 Monika Hausmann 7000 1975-03-29 00:00:00 2000-01-01 00:00:00 3...

Page 17: Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

2015©Trivadis

DEMO

§ OracleBigDataLightVM§ Spark– Scala- sqlContext

§ GoogleCloudBigData§ SparkSQL– CLI§ Spark/HIVE/Hadoop

BigData- SparkSQL17

Page 18: Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

2015©Trivadis

Agenda

1. Introduction2. FirstStepsintheImpalaWorld3. Project4. Summary

BigData- SparkSQL18

Page 19: Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

2015©Trivadis

Project1– Figures

§ 400– 500Miotweetsperday

§ 1tweetcontains§ Around50metadatapieces

- Geo-location- Re-tweets- Followers

§ Thatisabout2A4pages

§ TwitterSampleStream§ 1%§ 4-5Miotweetsperday§ 50tweetspersecond

§ 20otherstreamswithdefinedkeywords

§ HDFS§ 1TBevery2monthsincluding replication

BigData- SparkSQL19

Page 20: Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

2015©Trivadis

TheLambdaArchitecture- adopted

BigData- SparkSQL20

Batchlayer

Speedlayer

AllData(HDFS)

Pre-computedViews

(MapReduce)Batch(re)compute

Query&

MergeREST

ProcessStream

IncrementedViews

Realtime Increment

Servinglayer

QFD= QueryFocusedData

QFD1 QFD2 QFDn…

QFD1 QFD2 QFDn

Realtime views

BatchviewsMessagingKafka

ClientWebApp

Consumerlayer

TwitterAPI

JavaAPP

Hadoop

Storm

Impala

Cassandra

Page 21: Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

2015©Trivadis

Project3

§ Livestreams

§ Batchcomputation

§ Scalable

§ Openfornewsources

BigData- SparkSQL21

Page 22: Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

2015©Trivadis

Agenda

1. Introduction2. FirstStepsintheImpalaWorld3. Project14. Summary

BigData- SparkSQL22

Page 23: Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

2015©Trivadis

Summary

§ AnewWorld

§ Spark,Hive,Hadoopand… it’sazoo§ VMOracleBigDataLight– CDH5.5.1– Spark1.5– SparkSQLCLIdoesnotrun§ VMCloudera– CDH5.5.0.2– Spark1.5- SparkSQLCLInotinstalled§ InstallitbymyselfintotheseVM’s…notagood idea

§ Google– Version1– Spark1.6containsSparkSQLCLI

§ LotscanbedonewithRDBMS

§ Starttocollectnow

BigData- SparkSQL23

Page 24: Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

2015©Trivadis

Why Spark- SQL

§ SQL§ Known§ Analystscanusedit

§ JDBC§ Diverstoolscanconnectanduseit

§ Noprogrammingneeded

§ Speed!§ Adhoc§ Batch

§ ItisINMEMORY– nolimit– spillstodisk

BigData- SparkSQL24

Page 25: Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

2015©Trivadis

BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIENBASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN

Questions?

THANKYOU.

Trivadis AG

JanOtt

Sägereistrasse29CH-8152Glattbrugg-Zurich

Tel. +41-44-808 7020(reception)Fax +41-44-808 7021

[email protected]

BigData- SparkSQL25

Page 26: Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

2015©Trivadis

SessionFeedback– now

TechEventFebruary201626

§ Please use the Trivadis TechEventMobileApptogive session feedback

§ Use "My schedule"if you registeredfor thissession

§ Otherwise use "Agenda"and the search function

§ If themobileAppdoes notwork (or if you have aWindowsPhone)use your MobileBrowser§ URL:https://trivadis16.quickmobile.center§ Username:<your_loginname> (likesvv)§ Password:sentby mail...

Page 27: Spark SQL First stepsofa Oracle Expert...Spark, Hive, Hadoop and …it’s a zoo VM Oracle Big Data Light –CDH 5.5.1 –Spark 1.5 –Spark SQL CLI does not run VM Cloudera –CDH5.5.0.2

2015©Trivadis

Sources§ Spark

§ https://spark.apache.org

§ OracleVM– BigDataLight§ http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-

bigdatalite-2104726.html

§ Books:§ BigData– MEAPbyNathanMarz§ SparkCookbookbyRishiYadav§ LearningSparkbyMatei Zaharia,PatrickWendell,AndyKonwinski,HoldenKarau

§ Pictures§ Oracle.com§ Twitter.com§ Apache.com§ Cloudera.com

BigData- SparkSQL27