40
Big Data: It’s all about the use cases James Serra Big Data Evangelist Microsoft [email protected]

Big Data: It’s all about the Use Cases

Embed Size (px)

Citation preview

Page 1: Big Data: It’s all about the Use Cases

Big Data: It’s all about the use casesJames SerraBig Data [email protected]

Page 2: Big Data: It’s all about the Use Cases

About Me Business Intelligence Consultant, in IT for 30 years Microsoft, Big Data Evangelist Worked as desktop/web/database developer, DBA, BI and DW architect and

developer, MDM architect, PDW/APS developer Been perm, contractor, consultant, business owner Presenter at PASS Business Analytics Conference and PASS Summit MCSE: Data Platform and Business Intelligence MS: Architecting Microsoft Azure Solutions Blog at JamesSerra.com Former SQL Server MVP Author of book “Reporting with Microsoft SQL Server 2012”

Page 3: Big Data: It’s all about the Use Cases

Use Cases (theory)

Use Cases (practice)

Popular Technologies

Topics

Page 4: Big Data: It’s all about the Use Cases

Popular Technologies

Topics

Page 5: Big Data: It’s all about the Use Cases

Harness the growing and changing nature of dataWhat is Big Data?

StreamingStructured

Challenge is combining transactional data stored in relational databases with less structured data

Big Data = All Data

Get the right information to the right people at the right time in the right format

Unstructured

“ ”

Page 6: Big Data: It’s all about the Use Cases

What is the Internet of Things?

Connectivity Data AnalyticsThings

IoT = sensor-acquired data

Page 7: Big Data: It’s all about the Use Cases

Using a Data Lake Modern Architecture

All data sources are considered

Leverages the power of on-prem technologies and the cloud for storage and capture

Native formats, streaming data, big data

Extract and load, no/minimal transform

Storage of data in near-native format

Orchestration becomes possible

Streaming data accommodation becomes possible

Refineries transform data on read

Produce curated data sets to integrate with traditional warehouses

Users discover published data sets/services using familiar tools

CRMERPOLTP LOB

DATA SOURCES

FUTURE DATA SOURCESNON-RELATIONAL DATA

EXTRACT AND LOAD DATA LAKE DATA REFINERY PROCESS (TRANSFORM ON READ)

Transform relevant data into data sets

BI AND ANALYTCIS

Discover and consume predictive analytics, data sets and other reports

DATA WAREHOUSE

Star schemas,viewsother read-optimized structures

Page 8: Big Data: It’s all about the Use Cases

What is Hadoop?

Microsoft Confidential

9

Distributed, scalable system on commodity HW

Composed of a few parts: HDFS – Distributed file system MapReduce – Programming model Other tools: Hive, Pig, SQOOP, HCatalog,

HBase, Flume, Mahout, YARN, Tez, Spark, Stinger, Oozie, ZooKeeper, Flume, Storm

Main players are Hortonworks, Cloudera, MapR

WARNING: Hadoop, while ideal for processing huge volumes of data, is inadequate for analyzing that data in real time (companies do batch analytics instead)

Core Services

OPERATIONAL SERVICES

DATASERVICES

HDFS

SQOOP

FLUME

NFS

LOAD & EXTRACT

WebHDFS

OOZIE

AMBARI

YARN

MAP REDUCE

HIVE &HCATALOGPIG

HBASEFALCON

Hadoop Clustercompute

&storage . . .

. . .

. .compute

&storage

.

.

Hadoop clusters provide scale-out storage and distributed data processing on commodity hardware

Page 9: Big Data: It’s all about the Use Cases

Can I use the cloud with my DW?• Public and private cloud• Cloud-born data vs on-prem born data• Transfer cost from/to cloud and on-prem• Sensitive data on-prem, non-sensitive in cloud• Look at hybrid solutions

Page 10: Big Data: It’s all about the Use Cases

MPP Logical Architecture“Compute” node Balanced

storageSQL“Control” node

SQL

“Compute” node Balanced storage

SQL

“Compute” node Balanced storage

SQL

“Compute” node Balanced storage

SQL

DMS

DMS

DMS

DMS

DMS

1) User connects to the appliance (control node) and submits query

2) Control node query processor determines best *parallel* query plan

3) DMS distributes sub-queries to each compute node

4) Each compute node executes query on its subset of data

5) Each compute node returns a subset of the response to the control node

6) If necessary, control node does any final aggregation/computation

7) Control node returns results to userQueries running in parallel on a subset of the data, using separate pipes effectively making the pipe larger

Page 11: Big Data: It’s all about the Use Cases

NoSQL databases• Non-relational databases (semi-structured data)• Types: Document, Key-value, Column, Graph• MongoDB, Cassandra, HBase, DocumentDB, Riak• Large-scale OLTP (i.e. popular web application)• Scale-out solution• High-availability• JSON data• Cons: data consistency, join data, use SQL, quick mass updates,

skillset• Bad solution for a data warehouse, but can have a place in a big

data solution• Polyglot Persistence: use the right tool for the job

Page 12: Big Data: It’s all about the Use Cases

Use Cases (theory)

Topics

Page 13: Big Data: It’s all about the Use Cases

Speed/Real-time

Batch/Traditional

Reporting Needs

Hybrid

Page 14: Big Data: It’s all about the Use Cases

Modern Data WarehouseThe Dream

All Source

s

EnterpriseData

Warehouse

Page 15: Big Data: It’s all about the Use Cases

The Reality

Page 16: Big Data: It’s all about the Use Cases

Let’s set off light bulbs in your head

Page 17: Big Data: It’s all about the Use Cases

Recommenda-tion engines

Smart meter monitoring

Equipment monitoring

Advertising analysis

Life sciences research

Fraud detection

Healthcare outcomes

Weather forecasting for business planning

Oil & Gas exploration

Social network analysis

Churn analysis

Traffic flow optimization

IT infrastructure & Web App optimization

Legal discovery and document archiving

Data Analytics is needed everywhere

Intelligence Gathering

Location-based tracking & services

Pricing Analysis

Personalized Insurance

Page 18: Big Data: It’s all about the Use Cases

The Internet of Things – ManufacturingGLOBAL OPERATIONS

I can see my production line status and recommend adjustments to better manage operational cost.

I know when to deploy the right resources for predictive maintenance to minimize equipment failures and reduce service cost.

I gain insight into usage patterns from multiple customers and track equipment deterioration, enabling me to reengineer products for better performance.

MANUFACTURING PLANT

Aggregate product data, customer sentiment, and other third-party syndicated data to identify and correct quality issues.

Manage equipment remotely, using temperature limits and other settings to conserve energy and reduce costs.

Monitor production flow in near-real time to eliminate waste and unnecessary work in process inventory.

GLOBAL FACILITY INSIGHT

Implement condition-based maintenance alerts to eliminate machine down-time and increase throughput.

THIRD-PARTY LOGISTICS

Provide cross-channel visibility into inventories to optimize supply and reduce shared costs in the value chain.

CUSTOMER SITE

Transmits operational information to the partner (e.g. OEM) and to field service engineers for remote process automation and optimization.

Management

R&D

Field Service

Page 19: Big Data: It’s all about the Use Cases

The Internet of Things – Oil & Gas

Utilize advanced 3D and 4D visualizations based on analytic algorithms to model subsurface geology

Production Manager

Onsite personnel

Establish near real-time communication and automatically publish events and alarms to the field to guide and protect onsite personnel and assets

Integrate all upstream data onto a unified platform to facilitate analytics, information sharing, and organizational transition

1. Exploration

2. Development

3. Drilling4. ProductionGeologist

Consolidate data from surveys, drill logs, and external sources to generate advanced reservoir models and production forecasts

Maximize recovery by monitoring near real-time production data and generating alerts for conditional maintenance needs

Combine near real-time drilling and seismic data to optimize drilling trajectories and recovery potential, while minimizing environmental risk

Operations Control Center

Find new hydrocarbon reservoirs quicker with seismic data uploaded to the cloud

and prepared for analysis

NORTH SHORE PRODUCTION

Page 20: Big Data: It’s all about the Use Cases

PHARMACY

The Internet of Things – Pharma

Customer Service

Monitor device data to make more timely health decisions, such as adjusting dosages

Enable advanced product tracking and authentication to prevent counterfeits

Develop better products, faster, informed by a much larger data set based on patient outcomes

R&D

Anticipate medical device maintenance needs, and alert patients to schedule a doctor visit for replacement or repair

Healthcare Provider

Monitor medical device functionality for better customer service, reduced risk, and insight to improve product designs

Manage equipment remotely, using appropriate KPIs

Reduce machine downtime with condition-based maintenance alerts

Patient Home

Distribution

Manufacturing

Aggregate and correlate data from disparate medical devices with medications and health outcomes for advanced insight

Page 21: Big Data: It’s all about the Use Cases

Producers Event Ingestion Storage Transformation Presentation & action

Event Hubs (Service Bus) SQL Database Machine

Learning Azure Websites

Heterogeneous client agents

Table/Blob Storage HD Insight Mobile Services

External Data Sources DocumentDB Stream

AnalyticsNotification Hubs

External Data Sources Cloud Services Power BI

External Services

Microsoft Azure services for IoT

Event Hubs (Service Bus)

Stream Analytics

SQL Database Azure Websites

Mobile Services

Notification Hubs

Power BI

External Services

Table/Blob Storage

DocumentDB{ }

HD Insight

Machine Learning

Page 22: Big Data: It’s all about the Use Cases

Hybrid

Page 23: Big Data: It’s all about the Use Cases

Use Cases (practice)

Topics

Page 24: Big Data: It’s all about the Use Cases

Manufacturing

Page 25: Big Data: It’s all about the Use Cases

Manufacturer of Automobiles

ManufacturerOne of the leading multinational automobile corporations that is one of the largest companies in the world by revenue. They manufacture over 10 million vehicles a year.

Part 1: What They Did | Produces Internet of Things insights for their automobiles

ChallengeNeeded to analyze the telemetry being emitted from their luxury car line in real-time.Wanted to build a scalable, reliable, and highly available solution that has the ability to receive and process a large volume of vehicle information and maintenance events

SolutionUse Azure Blob, HDInsight, Storm in HDInsight, HBase in HDInsight, Event Hubs, DocumentDB, Machine Learning, and Power BI Collect IoT data from automobiles:• Telemetry data comes in real-time• Able to process and generate insights around vehicle information and maintenance events

Internet of Things

BK1

Toyota

Page 26: Big Data: It’s all about the Use Cases

BK1

Manufacturer of Automobiles Part 2: How They Did It | Produces Internet of Things insights for automobiles

How They Did ItCollect data from automobiles• Send events in real-time to Event Hubs• Stored into Azure Blobs

Retrieve reference data and do predictive analytics• Get reference data stored in HBase• Run ML algorithms on the telemetry to predict outcomes

Store into queryable store DocumentDB• Stored in DocumentDB for Power BI to display as a

dashboard• Trigger Apache Storm in HDInsight to process and return

results back to the vehicles

Internet of Things

Cloud gateways

HDFS Store ML No SQL Store

Live Dashboard

Queuing Service

Event Hubs

Azure Blob HBase Azure ML DocumentDB

PowerBI

Event Hubs

Apache Storm on HDInsight

Queuing Service

Get Data Store in Blob

Get Reference

Data

Do Machine Learning

Store in Query able

Store

Page 27: Big Data: It’s all about the Use Cases

Power and Utilities& Oil and Gas

Page 28: Big Data: It’s all about the Use Cases

Industrial automation company partnering with multinational oil company Oil and GasLeading industrial automation company who employs over 20,000 people.partnering with Leading multinational oil and gas company (one of the six oil and gas super majors) who employs over 90,000 people.

Part 1: What They Did | IoT internet-connected sensors to generate analytics for proactive maintenance

ChallengeManage sites used for dispensing liquefied natural gas (clean fuel for commercial customers who do heavy-duty road transportation)Built LNG refueling stations across US interstate highwayStations are unmanned so they built 24x7 remote management and monitoring to track diagnostics of each station for maintenance or tuningBuilt internet-connected sensors embedded in 350 dispenser sites worldwide generating tens of thousands data points per second• Temperature, pressure, vibration, etc.Data needs outgrew company’s internal datacenter and data warehouse

SolutionChose Azure HDInsight, Data Factory, SQL Database, Machine LearningDashboards used to detect anomalies for proactive maintenance• Changes in performance of the components• Energy consumption of components • Component downtime and reliability Future: Goal is to expand program to hundreds of thousands of dispensers

IoT, Analytics

Rockwell Automation

Page 29: Big Data: It’s all about the Use Cases

BK1

Industrial automation company partnering with multinational oil companyPart 2: How They Did It | IoT internet-connected sensors to generate analytics for proactive maintenance

How They Did ItCollect data from internet-collected sensors• Tens of thousands data points per second• Interpolate time-series prior to analysis• Stored raw sensor data in Blobs every 5 minutesUse Hadoop to execute scripts and Data Factory to orchestrate• Hive and Pig scripts orchestrated by Data Factory• Data resulting from scripts loaded in SQL Database• Queries detect site anomalies to indicate

maintenance/tuningProduced dashboards with role-based reporting• Azure Machine Learning , SSRS, Power BI for O365• Provide users with customizable interface• View current and historical data (day-to-day operations,

asset performance over time, etc.)• Leveraged Azure Mobile Notification Hub for real-time

notifications, alarms, or important eventsUse Azure ML to predict • Understand which pumps, run at what speeds, maximized

water supply while minimizing energy use

IoT, Analytics

Store sensor data every 5 minutes• Temperature, pressure, vibration, etc.• Tens of thousands of data points / second

Azure HDInsight

Hive, Pig,

Data Factory

Azure SQL DB

Power BI for O365

Mobile Notification Hub

Mobile Device

Real-time notification

Azure Machine Learning

Azure Blobs

Page 30: Big Data: It’s all about the Use Cases

Government

Page 31: Big Data: It’s all about the Use Cases

Secretary of Finance and Public Credit - Government GovernmentGovernment organization that handles finances, taxes, budget, income, and national debt for their country.

Part 1: What They Did | Fraud and Money Laundering Detection

ChallengeThe government passed a law to have all invoice submission to be in electronic formatThe tax department allows clients to uploads their digital documents (pay stubs, expenditure slips) and now have 4 billion documents uploadedWant to get insights into the data to do analysis and identify trends and fraud and ensure compliance with tax obligationsSolutionBuilt electronic digital invoicing solution to upload invoices• Paystubs, expenditure slipsUse HDInsight to run queries and to process the electronic invoices to gain insightsNeeded to scale to a peak of 150+ million invoices uploaded / dayDo Fraud detection by understanding what people are doing to detect anomalies (ie. tax fraud, money laundering, etc.)Output of the system saved to SQL Server on-premises databases to run ad hoc queries

Fraud Detection

SAT

Page 32: Big Data: It’s all about the Use Cases

BK1

Secretary of Finance and Public Credit - Government Part 2: How They Did It | Fraud and Money Laundering Detection

How They Did ItStore electronic digital invoices as XML document in Azure Blobs• Store approximately 4 billion invoices total• Store 40 million – 180 million files every day• Data is stored as XML files with metadata information• Average size of each XML document is 5-10KBUse Azure HDInsight (>140 node clusters)• Do batch querying • Use Hive, Pig, and MapReduce• Hive external tables to make files queryable• Run once per day• Detect anomalies / fraudSend to SQL Server in IaaS VM and then to SQL Server On-premises• SQOOP data from Azure Blobs to SQL Server VMs• ETL to SQL Server on-premises• Do BI on top of SQL Server as a data mart

Fraud Detection

Website to submit electronic documents

Store 4 billion invoices totalAt peaks, 150M invoices submitted/day

Run invoices through a parser and write out to Blob storageData is stored as XML files

Hive, Pig, MapReduceTo detect anomalies/fraud

Use Hive external tables to make files queryable

HDInsight140+ node cluster

SQL ServerOn-premises

SQOOP

SQL ServerIn IaaS VM

ETL

BI for insights

Azure Blobs

Page 33: Big Data: It’s all about the Use Cases

Entertainment and Gaming

Page 34: Big Data: It’s all about the Use Cases

Game Development Company

GamingA predominantly mobile-based game development company. While they are a mid-sized organization, they have partnered with media giants on various gaming projects

Part 1: What They Did | In-game Analytics

ChallengeAs a game development studio, they wanted to do in-game analytics to understand their players more and what they do in the games

SolutionChose Azure HDInsight (MapReduce and Storm), Service Bus and also use SQL Server for reportingSwitched from Amazon AWS EMRCollects telemetry and logging data to gain in-game analytics:• How many players using the game• How many players invited their friends• How far along did players get into the tutorial• How many attempts did they make on one level/stage

In-game Analytics

Media tonic

Page 35: Big Data: It’s all about the Use Cases

BK1

Game Development Company Part 2: How They Did It | In-game Analytics

How They Did ItCollect data from games in Azure Blobs• Game sends telemetry/logging data as JSON files• Contains every action of user in the game• Data is pushed to Azure Service Bus as real-time• Tens of Gigabytes of data captured daily HDInsight picks up real-time data and processes• From Service Bus, HDInsight processes using Apache

Storm and MapReduce• Constantly running experiments to determine insight• A/B testing• In-game metrics and analytics• Spin up 32-node cluster nightly for four hoursOutput sent to SQL Server for BI• Transfer data to SQL Server for BI

In-game Analytics

Service Bus

Real-time Event

Azure BlobsAzure HDInsight

SQL ServerOn-premises

BI for insights

Page 36: Big Data: It’s all about the Use Cases

Non-Profit

Page 37: Big Data: It’s all about the Use Cases

JustGiving, Non-Profit

Non-profitJustGiving, a global online social platform for giving. It's a financial service (not a charity) that lets you "raise money for a cause you care about" through your network of friends. Their goal is to become "Facebook of Giving"

Part 1: What They Did | Recommendation Engine

ChallengeThey wanted to identify what was personal and relevant to people and what they cared about, so that they could suggest further causes that may inspire continual involvement.With 22 million customers this meant storing and processing huge amounts of data that their existing infrastructure simply couldn’t support.

SolutionChose SQL Server on-premises, Azure HDInsight, Blobs, Tables, Cache, and Service BusDeployed a network of “social giving” for people to make it a group activity to support a cause• Built a way to inform givers a charity goal based on a person’s position in their social graph• Help identify causes that a user might be interested in (based on demographics, and their

social graph)• Recommend people to add to their social graph as well as other charitable causes

Recommendation

Just Giving

Page 38: Big Data: It’s all about the Use Cases

JustGiving, Non-ProfitPart 2: How They Did It | Recommendation engine

How They Did ItCollect data in Azure Blobs• Move data from SQL Server through an Agent to Azure

Blobs

HDInsight processes data for insights• Input data is 20-30GB / job• Use MapReduce jobs to create a graph• Further job to denormalize activity feeds for all users• Generates an activity recommendation

Generates a real-time recommendation• Real-time activity feeds/events coming in from Service

Bus (~50 events/second)• Activity recommendation coming out of daily HDInsight

job• Sent to web-site

Recommendation

SQL ServerOn-premises

Agent

Azure Blobs

Azure HDInsight

ActivityFeeds

Give Graph

Azure TablesWeb APIWebsite +

Event store

Service Bus

Real-time Event

Serves results

Azure Cache

Page 39: Big Data: It’s all about the Use Cases

Resources The Modern Data Warehouse: http://bit.ly/1xuX4Py Should you move your data to the cloud? http://bit.ly/1xuXbKU Presentation slides for Modern Data Warehousing: http://bit.ly/1xuXcP5 Presentation slides for Building an Effective Data Warehouse Architecture:

http://bit.ly/1xuXeX4 Hadoop and Data Warehouses: http://bit.ly/1xuXfu9

Page 40: Big Data: It’s all about the Use Cases

Q & A ?James Serra, Big Data EvangelistEmail me at: [email protected] me at: @JamesSerra Link to me at: www.linkedin.com/in/JamesSerra Visit my blog at: JamesSerra.com (where this slide deck will be posted)