Upload
spark-summit
View
341
Download
0
Embed Size (px)
Citation preview
EverywhereDefined• 26Bforecasts/dayor250,000/second– vs 3.5BGooglequeriesdaily
• 2.2billionuniquelocations• 200kpersonalweatherstations• 200Mactivemobileusers• Petabytesofdatagenerateddaily
OurBrands
Over
30BillionServed
FlightRouting
EnergyTrading
Insurance
WeatherAlerting
DecisionsatScale
101001110100101
101001110100101010100101011001101010101011100000011010110010
WhoAreYou?
RDBMS?
WhoAreYou?
?
SocialWeather
SocialWeather
RDBMS
SocialWeather
RDBMS SELECTcount(*)FROMwx_reportsGROUPBYtime/300000*300000
SocialWeather
Live Reporting
ETL
SocialWeather
Live
Reporting
SqoopM/R
ScalingwithSpark
Live
Reporting
EasingtheTransition
101001110100101
101001110100101010100101011001101010101011100000011010110010
EasingtheTransition
101001110100101
101001110100101010100101011001101010101011100000011010110010
EasingtheTransition
101001110100101
101001110100101010100101011001101010101011100000011010110010
101001110100101010100101011001101010101011100000011010110010
10100,11101,0010101010,01010,1100110101,01010,1110000001,10101,...
EasingtheTransition
101001110100101
101001110100101010100101011001101010101011100000011010110010
101001110100101010100101011001101010101011100000011010110010
10100,11101,0010101010,01010,1100110101,01010,1110000001,10101,...
ScalingwithSpark
Live
ScalingwithSpark
Live
Reporting
BatchAggregationval wx_reports = // load data from database
val sql = new org.apache.spark.sql.SQLContext(sc)import sql.implicits._
wx_reports.toDF.registerTempTable("wx_reports")
val counts = sql("select count(*) from wx_reports group by timestamp / 300000 * 300000")
StreamingAggregationval wx_reports = // load from streaming source
wx_reports.foreachRDD { rdd =>val sql = SQLContext.getOrCreate(rdd.sparkContext)import sql.implicits._rdd.toDF.registerTempTable("wx_reports")val count = sql("select count(*) from wx_reports")
}
DataScienceRoles
Data Scientist Data Engineer
DataScienceRoles
Data Scientist Data Engineer
Machine learningexpert
DataScienceRoles
Data Scientist Data Engineer
Machine learningexpert Scalablealgorithms expert
DataScienceRoles
Data Scientist Data Engineer
Buildspipelines thatworkonherlaptop
DataScienceRoles
Data Scientist Data Engineer
Rewritesherpipelinestoscalebetter
CollaborativeDataScience
TheAnalyticsOS
Notebooks StreamAnalytics
BatchAnalytics
But…
TheRealWorld(EnterpriseVersion)
TheRealWorld(StartupVersion)
Application MySQL
Step1:PickaProblemtoSolve
Step2:BuildaDataLake
Step3:SetupSpark
• Directdownload• Hadoop distribution(Hortonworks,Cloudera,etc)
• Managedservice(ElasticMapReduce,Databricks,BlueMix,etc)
Step4:StartCollectingData• Options:– Sqoop tomoveRDBMStables– Flume/FluentD tomovelogs– ImportfromSpark-supporteddatasources– UsingSparkStreamingattachedtoaqueue– …
Step5:UseaNotebook
FinalThoughts
ThankYou!
Robbie Strickland@rs_atl
(we’rehiring!)