Upload
at-internet
View
690
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Big Data with Amazon Redshift and ATI - AT Internet
Citation preview
Big Data with Amazon Redshift and ATINovember, 27th 2013
HI, I’M OLE
SOUNDCLOUD IS THE WORLD’S LEADING AUDIO PLATFORM
Every minute, creators upload
12hrs of audio
reaching over
250m
people every month
8% of the internet
FOO FIGHTERS SNOOP LION MADONNA MACKLEMOREPRESIDENT OBAMA JOHN OLIVER(DAILY SHOW/BUGLE)
SKRILLEX
How‘s the sales funnel performingin Brazil and what‘s the split between products?
• Avoid Silos
• Remove unnecessary restrictions
• Provide simple tools
• Teach People how to use data
DATA DEMOCRATIZATION
In one sentence:
DATA DEMOCRATIZATION
Deliver the right information to the
right person at the right time.
PRODUCTION DB
ANALYTICS DB
DATA ANALYSIS AND REPORTING
2010-2012
AT Internet
DATA ANALYSIS AND REPORTING
ListensSoundsUsersCommentsFavoritesSharesReposts
ImpressionsClicksConversionsSuggestionsDownloadsTaggingsUploads
DATA ANALYSIS AND REPORTING
Listens
timestamp
duration
sound
owner
listener
API-key
(location)
country
DATA ANALYSIS AND REPORTING
additional metadata:
• location within sound
• context (location on site)
• segmentation
Listening creates >6000 events/s
BIG DATA
HADOOP TO THE RESCUE
2 Datacenter in AMS
200+ Nodes
HADOOP TO THE RESCUE
listen data
listen metadata
search data
recommender data
product testing data
backend production data
backend logs
HADOOP AND DATA DEMOCRATIZATION
Data is siloed on hadoop
Data governance not existing
Technical hurdles for access
Not realtime
Slow access
AMAZON REDSHIFT
Fast fully managed DW service
Optimized for petabyte or more
datasets
Fast query and I/O performance
Columnar storage technology
Staging Area
Pig/Ruby Scripts
Amazon EMR
COPY
Pig/Ruby Scripts
Job execution powered by:
2013BI INFRASTRUCTURE
Data Exploration
Source Systems
Hadoop
MySql
External Systems
(production db)MySql
DataWarehouse
ETL Scripts ETL Scripts
AT Internet
How‘s the sales funnel performingin Brazil and what‘s the split between products?
ATI Data Query
Create query:
1. filter on funnel
pages
2.select metrics
and dimension
3.add REST URL to
ETL pipeline
Staging Area
Pig/Ruby Scripts
Amazon EMR
COPY
Pig/Ruby Scripts
Job execution powered by:
Data Exploration
Source Systems
Hadoop
MySql
External Systems
(production db)MySql
DataWarehouse
ETL Scripts ETL Scripts
AT Internet
DATA EXPLORATION
Simple and fast access to data
More time for “deep dives” into
data
Individualized Reporting
Allows interactivity between users
Integrated with RedShift
• Reports designed by end users
• Central repository for data analysis
• User interaction
• Data from one source only
• Scalable solution
• Data to the people!
DATA DEMOCRATIZATION
QUESTIONS?
THANK YOU!
P.S. WE’RE HIRING.SOUNDCLOUD.COM/JOBS
APPENDIX
First: Gather data from the several source systems into S3
Hadoop
MySql
External Systems
(production db)MySql
Full/Daily Imports
MapReduce for: - Listens - Plays- Impressions- Affiliations- ...
IMPORT DATA FROM SOURCE SYSTEMS
Second: Rebuild staging area tables for full imports
IMPORT DATA FROM SOURCE SYSTEMS
Staging Area
tracks users client applications
...
Based on configuration files
Create statements generated
Re-create DISTKEYS and SORTKEYS
Full control in changes in the data
model
yaml config files
Third: Import the data from S3 to RedShift
Staging Area
tracks users client applications
...
Full import: TRUNCATE & COPYDaily import: COPY
IMPORT DATA FROM SOURCE SYSTEMS
ETL scripts divided into layers:
- Layer 1: Staging -> DW (dimensions)
- Layer 2: Staging -> DW (fact tables - raw data)
- Layer 3: DW -> DW (aggregated fact tables)
- Layer 4: DW -> Reporting Data Cubes (reporting data)
ETL AND DW DATAMODEL
DataWarehouse
ETL AND DW DATAMODEL
Staging Area
Data CleaningData Transformation
Ruby/SQL Scripts
ETL Layer 1 & 2
Data Aggregation
Ruby/SQL Scripts
ETL Layer 3
Data Exploration
ETL Layer 4
Data Presentation
SQL
JOB SCHEDULE AND EXECUTION
Job-scheduling tool developed
internally
Set dependencies between jobs
Execution in multiple machines
Supports all the ETL layers
TIMELINEWeek 2 Week 4 Week 8 Week 10 Week 12 Week 14 Week 16
• Gap Analysis
• Business Exploration
(requirements
interviews)
• Information Mapping
Design
• Solution Design (Draft)
Requirement Analysis
Analysis Stage
End of Analysis Stage
Milestones Design & Build
• Define Infrastructure
• Design Data Model
Week 6
Infrastructure Ready!
• Build ETL
• Build Data Cubes
• Design Reports/Dashboards (Presentation
Layer)
BI 1.0 is built!
• System/Integration
Tests
• User Acceptance
BI 1.0 is tested!
• User Workshops
• BI 1.0 Evaluation
BI 1.0 is ready to use!
Test & Deploy