Upload
others
View
15
Download
0
Embed Size (px)
Citation preview
2015©Trivadis
BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIENBASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN
BigDataSparkSQL
Why doIneed SparkSQL?Firststeps of aOracleExpert
Author: JanOtt– Trivadis AG
BigData- SparkSQL1
2015©Trivadis
Mitüber600IT- undFachexpertenbeiIhnenvorOrt
2
12 Trivadis Niederlassungen mitüber 600 Mitarbeitenden
200 Service Level Agreements
Mehr als 4'000 Trainingsteilnehmer
Forschungs- und Entwicklungs-budget: CHF 5.0 Mio. / EUR 4.0 Mio.
Finanziell unabhängig undnachhaltig profitabel
Erfahrung aus mehr als 1'900 Projekten pro Jahr bei über 800 Kunden
Hamburg
Düsseldorf
Frankfurt
FreiburgMünchen
Wien
BaselZürichBern
Lausanne
2
Stuttgart
Brugg
2BigData- SparkSQL
2015©Trivadis
Agenda
1. Introduction2. FirstStepsintheSparkWorld3. Project14. Summary
BigData- SparkSQL3
2015©Trivadis
Introduction§ AfewwordsaboutBigData
§ BigData§ Hadoop§ WhySpark,SparkSQL?
§ SparkSQL– myfirststeps§ GetsomedataintoHadoop§ TablesinSpark- Hive§ UseSQL§ Diverse
§ Project– Twitter
BigData- SparkSQL4
2015©Trivadis
BigData:Introduction§ BigData
§ TurningDataintoInsights
§ HadoopanditsZoo§ HDFS– MapReduce§ SQL– Impala,HBase,Hive,…§ Zookeeper
§ SparkandSparkSQL
§ NoSQL Databases§ Architecture
§ LAMBDA
BigData- SparkSQL5
2015©Trivadis
What is Spark
§ ApacheSpark™(Apachewebheadline)isafastandgeneralengineforlarge-scaledataprocessing.
§ Spark(Wiki)§ clustercomputing frameworkSpark§ Interfaceforprogramming entireclusters§ Implicitdataparallelism§ Fault-tolerance§ AnApacheOpenSourceProject§ DevelopedbyUCBerkeley
§ Goal§ Lightning-fastclustercomputing
§ Performance§ Faster10xondisc– 100xinmemory
BigData- SparkSQL6
2015©Trivadis
What is Spark(2)
§ SparkParts§ Core§ SQL§ MLLib– machinelearning§ Streaming§ GraphX
§ SparkSQLandHIVE§ Workingwithstructureddata§ SQLinsideSparkprograms§ HIVEmetadatastore§ JDBC/ODBC
BigData- SparkSQL7
2015©Trivadis
What is Spark(3)
§ Runseverywhere§ HadoopHDFS§ Mesos§ Cassandra§ HBase§ S3§ …
BigData- SparkSQL8
2015©Trivadis
What is Hadoop
§ afilesystem– HDFS§ BasedonpapersfromGoogle§ ApacheOpenSourceProject
§ Goal§ Fast§ Handleshugeamountofdata§ Handlesunstructured tofullystructureddata§ Horizontally scalable§ Reliable
BigData- SparkSQL9
2015©Trivadis
Agenda
1. Introduction2. FirstStepsintheSparkWorld– SparkSQL3. Project14. Summary
BigData- SparkSQL10
2015©Trivadis
FirstSteps
§ Keepitsimple
§ GetsomedataintoHadoop
§ GetsomedataintoSpark- Hive
§ Java– keepittoaminimum
§ Datasmall
§ Getanenvironmentthatissetup§ OracleVM– BigDataLight§ PickonewaytogetthedataintoSpark- Hive
§ SeeSQLonaHDFSsystemwithSpark
BigData- SparkSQL11
2015©Trivadis
Pre-Requisite– Environment
§ OracleBigDataLite§ VM§ Version4.4.0§ http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-
bigdatalite-2104726.html
§ Contains§ OracleDatabase12c(12.1.0.2)§ Cloudera’sDistribution including ApacheHadoop (CDH5.5.1)§ Hadoop2.6.0§ Hive1.1.0§ Spark1.5§ OracleSQLDeveloper4.1.3
§ OracleVirtualBox
BigData- SparkSQL12
2015©Trivadis
Informationabout the VM
§ Login§ oracle/welcome1
§ Starthere§ file:///home/oracle/GettingStarted/StartHere.html
§ Start§ OracleDB§ Hive§ Spark
§ Yourdonepreparing
BigData- SparkSQL13
2015©Trivadis
TheSteps – simple– focus
BigData- SparkSQL14
SQLQuery
HIVETable
2015©Trivadis
Step 1– Data– 2files
§ t_10.txtandt_us_cities.lst§ Commadelimited§ Flatfile§ Formatthedatesoitfitsthestandarddateformat
- YYYY-MM-DDHH24:MI:SS.XXXX
BigData- SparkSQL15
2015©Trivadis
Step 1– Data
BigData- SparkSQL16
t_us_cities.lst
New York New YorkLos Angeles CaliforniaChicago IllinoisHouston Texas...
t_10
1 Hans Meier 3000 1968-02-02 00:00:00 2000-01-01 00:00:00 12 Stefan Müller 5000 1970-10-15 00:00:00 2001-07-01 00:00:00 13 Susanne Kieser 3500 1972-03-14 00:00:00 2005-05-01 00:00:00 24 Paul Steiner 4000 1960-07-28 00:00:00 2000-01-01 00:00:00 25 Monika Hausmann 7000 1975-03-29 00:00:00 2000-01-01 00:00:00 3...
2015©Trivadis
DEMO
§ OracleBigDataLightVM§ Spark– Scala- sqlContext
§ GoogleCloudBigData§ SparkSQL– CLI§ Spark/HIVE/Hadoop
BigData- SparkSQL17
2015©Trivadis
Agenda
1. Introduction2. FirstStepsintheImpalaWorld3. Project4. Summary
BigData- SparkSQL18
2015©Trivadis
Project1– Figures
§ 400– 500Miotweetsperday
§ 1tweetcontains§ Around50metadatapieces
- Geo-location- Re-tweets- Followers
§ Thatisabout2A4pages
§ TwitterSampleStream§ 1%§ 4-5Miotweetsperday§ 50tweetspersecond
§ 20otherstreamswithdefinedkeywords
§ HDFS§ 1TBevery2monthsincluding replication
BigData- SparkSQL19
2015©Trivadis
TheLambdaArchitecture- adopted
BigData- SparkSQL20
Batchlayer
Speedlayer
AllData(HDFS)
Pre-computedViews
(MapReduce)Batch(re)compute
Query&
MergeREST
ProcessStream
IncrementedViews
Realtime Increment
Servinglayer
QFD= QueryFocusedData
QFD1 QFD2 QFDn…
QFD1 QFD2 QFDn
Realtime views
…
BatchviewsMessagingKafka
ClientWebApp
Consumerlayer
TwitterAPI
JavaAPP
Hadoop
Storm
Impala
Cassandra
2015©Trivadis
Project3
§ Livestreams
§ Batchcomputation
§ Scalable
§ Openfornewsources
BigData- SparkSQL21
2015©Trivadis
Agenda
1. Introduction2. FirstStepsintheImpalaWorld3. Project14. Summary
BigData- SparkSQL22
2015©Trivadis
Summary
§ AnewWorld
§ Spark,Hive,Hadoopand… it’sazoo§ VMOracleBigDataLight– CDH5.5.1– Spark1.5– SparkSQLCLIdoesnotrun§ VMCloudera– CDH5.5.0.2– Spark1.5- SparkSQLCLInotinstalled§ InstallitbymyselfintotheseVM’s…notagood idea
§ Google– Version1– Spark1.6containsSparkSQLCLI
§ LotscanbedonewithRDBMS
§ Starttocollectnow
BigData- SparkSQL23
2015©Trivadis
Why Spark- SQL
§ SQL§ Known§ Analystscanusedit
§ JDBC§ Diverstoolscanconnectanduseit
§ Noprogrammingneeded
§ Speed!§ Adhoc§ Batch
§ ItisINMEMORY– nolimit– spillstodisk
BigData- SparkSQL24
2015©Trivadis
BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIENBASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN
Questions?
THANKYOU.
Trivadis AG
JanOtt
Sägereistrasse29CH-8152Glattbrugg-Zurich
Tel. +41-44-808 7020(reception)Fax +41-44-808 7021
BigData- SparkSQL25
2015©Trivadis
SessionFeedback– now
TechEventFebruary201626
§ Please use the Trivadis TechEventMobileApptogive session feedback
§ Use "My schedule"if you registeredfor thissession
§ Otherwise use "Agenda"and the search function
§ If themobileAppdoes notwork (or if you have aWindowsPhone)use your MobileBrowser§ URL:https://trivadis16.quickmobile.center§ Username:<your_loginname> (likesvv)§ Password:sentby mail...
2015©Trivadis
Sources§ Spark
§ https://spark.apache.org
§ OracleVM– BigDataLight§ http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-
bigdatalite-2104726.html
§ Books:§ BigData– MEAPbyNathanMarz§ SparkCookbookbyRishiYadav§ LearningSparkbyMatei Zaharia,PatrickWendell,AndyKonwinski,HoldenKarau
§ Pictures§ Oracle.com§ Twitter.com§ Apache.com§ Cloudera.com
BigData- SparkSQL27