The Internet of Everywhere—How IBM The Weather Company Scales

EverywhereDefined• 26Bforecasts/dayor250,000/second– vs 3.5BGooglequeriesdaily

• 2.2billionuniquelocations• 200kpersonalweatherstations• 200Mactivemobileusers• Petabytesofdatagenerateddaily

OurBrands

Over

30BillionServed

FlightRouting

EnergyTrading

Insurance

WeatherAlerting

DecisionsatScale

101001110100101

101001110100101010100101011001101010101011100000011010110010

WhoAreYou?

RDBMS?

WhoAreYou?

?

SocialWeather

SocialWeather

RDBMS

SocialWeather

RDBMS SELECTcount(*)FROMwx_reportsGROUPBYtime/300000*300000

SocialWeather

Live Reporting

ETL

SocialWeather

Live

Reporting

SqoopM/R

ScalingwithSpark

Live

Reporting

EasingtheTransition

101001110100101

101001110100101010100101011001101010101011100000011010110010

EasingtheTransition

101001110100101

101001110100101010100101011001101010101011100000011010110010

EasingtheTransition

101001110100101

101001110100101010100101011001101010101011100000011010110010

101001110100101010100101011001101010101011100000011010110010

10100,11101,0010101010,01010,1100110101,01010,1110000001,10101,...

EasingtheTransition

101001110100101

101001110100101010100101011001101010101011100000011010110010

101001110100101010100101011001101010101011100000011010110010

10100,11101,0010101010,01010,1100110101,01010,1110000001,10101,...

ScalingwithSpark

Live

ScalingwithSpark

Live

Reporting

BatchAggregationval wx_reports = // load data from database

val sql = new org.apache.spark.sql.SQLContext(sc)import sql.implicits._

wx_reports.toDF.registerTempTable("wx_reports")

val counts = sql("select count(*) from wx_reports group by timestamp / 300000 * 300000")

StreamingAggregationval wx_reports = // load from streaming source

wx_reports.foreachRDD { rdd =>val sql = SQLContext.getOrCreate(rdd.sparkContext)import sql.implicits._rdd.toDF.registerTempTable("wx_reports")val count = sql("select count(*) from wx_reports")

}

DataScienceRoles

Data Scientist Data Engineer

DataScienceRoles


Machine learningexpert

DataScienceRoles


Machine learningexpert Scalablealgorithms expert

DataScienceRoles


Buildspipelines thatworkonherlaptop

DataScienceRoles


Rewritesherpipelinestoscalebetter

CollaborativeDataScience

TheAnalyticsOS

Notebooks StreamAnalytics

BatchAnalytics

But…

TheRealWorld(EnterpriseVersion)

TheRealWorld(StartupVersion)

Application MySQL

Step1:PickaProblemtoSolve

Step2:BuildaDataLake

Step3:SetupSpark

• Directdownload• Hadoop distribution(Hortonworks,Cloudera,etc)

• Managedservice(ElasticMapReduce,Databricks,BlueMix,etc)

Step4:StartCollectingData• Options:– Sqoop tomoveRDBMStables– Flume/FluentD tomovelogs– ImportfromSpark-supporteddatasources– UsingSparkStreamingattachedtoaqueue– …

Step5:UseaNotebook

FinalThoughts

ThankYou!

Robbie Strickland@rs_atl

(we’rehiring!)

Data & Analytics

The Internet of Everywhere—How IBM The Weather Company Scales