Upload
alexandra-sasha-blumenfeld
View
232
Download
4
Tags:
Embed Size (px)
Citation preview
Optimize Your Reporting In Less Than 10 Minutes
David Nhim, News Distribution Network, Inc.
June 24th, 2015
Housekeeping
• The recording will be sent to all webinar participants after the event.• Questions? Type them in the chat box and we will answer. • Posting to social? Use #AWSandChartio
Today’s Speakers
Matt Train
@Chartio
David Nhim
@Newsinc
Brandon Chavis
@AWScloud
Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year
Amazon Redshift
Common Customer Use Cases
• Reduce costs by extending DW rather than adding HW
• Migrate completely from existing DW systems
• Respond faster to business
• Improve performance by an order of magnitude
• Make more data available for analysis
• Access business data via standard reporting tools
• Add analytic functionality to applications
• Scale DW capacity as demand grows
• Reduce HW & SW costs by an order of magnitude
Traditional Enterprise DW Companies with Big Data SaaS Companies
Amazon Redshift is easy to use
• Provision in minutes
• Monitor query performance
• Point and click resize
• Built in security
• Automatic backups
Amazon Redshift is priced to let you analyze all your data
Price is nodes times hourly cost
No charge for leader node
3x data compression on avg
Price includes 3 copies of data
DS2 (HDD)Price Per Hour for
DW1.XL Single NodeEffective Annual
Price per TB compressed
On-Demand $ 0.850 $ 3,725
1 Year Reservation $ 0.500 $ 2,190
3 Year Reservation $ 0.228 $ 999
DC1 (SSD)Price Per Hour for DW2.L Single Node
Effective Annual Price per TB compressed
On-Demand $ 0.250 $ 13,690
1 Year Reservation $ 0.161 $ 8,795
3 Year Reservation $ 0.100 $ 5,500
Amazon Redshift Node Types
• Optimized for I/O intensive workloads
• High disk density
• On demand at $0.85/hour
• As low as $1,000/TB/Year
• Scale from 2TB to 2PB
DS2.XL: 31 GB RAM, 2 Cores 2 TB compressed storage, 0.5 GB/sec scan
DS2.8XL: 244 GB RAM, 16 Cores16 TB compressed, 4 GB/sec scan
• High performance at smaller storage size
• High compute and memory density
• On demand at $0.25/hour
• As low as $5,500/TB/Year
• Scale from 160GB to 326TB
DC1.L: 16 GB RAM, 2 Cores 160 GB compressed SSD storage
DC1.8XL: 256 GB RAM, 32 Cores 2.56 TB of compressed SSD storage
Amazon Redshift Architecture• Leader Node
– SQL endpoint
– Stores metadata
– Coordinates query execution
• Compute Nodes– Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via Amazon S3; load from Amazon DynamoDB or SSH
• Two hardware platforms– Optimized for data processing
– DW1: HDD; scale from 2TB to 2PB
– DW2: SSD; scale from 160GB to 330TB
10 GigE(HPC)
IngestionBackupRestore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
Amazon S3 / DynamoDB / SSH
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute Node
128GB RAM
16TB disk
16 coresCompute Node
128GB RAM
16TB disk
16 coresCompute Node
LeaderNode
Amazon Redshift enables end-to-end security
• SSL to secure data in transit; load encrypted from Amazon S3; ECDHE perfect forward security
• Encryption to secure data at rest– AES-256; hardware accelerated
– All blocks on disks & in Amazon S3 encrypted
– On-premises HSM & AWS CloudHSM support
• UNLOAD to Amazon S3 supports SSE and client-side encryption
• Audit logging & AWS CloudTrail integration
• Amazon VPC and IAM support
• SOC 1/2/3, PCI-DSS Level 1, FedRAMP, HIPAA
10 GigE(HPC)
IngestionBackupRestore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
S3 / EMR / DynamoDB / SSH
Customer VPC
InternalVPC
JDBC/ODBC
LeaderNode
Compute Node
Compute Node
Compute Node
Amazon Redshift integrates with multiple data sources
Amazon S3
Amazon EMR
Amazon Redshift
DynamoDB
Amazon RDS
Corporate Datacenter
NDN Introduction2015
• Transition Items & Interim Plan• Marketing Approach & Priorities• Brand Development Process• Resourcing• Next Steps
The Broadest Offering ofVideo Available
Anywhere400+ Premium Sources4,000 New Videos Daily
The Digital Media Exchange
400 Premium Content Providers
4,000 High-Traffic Publishers
The Web’s Best Publishers Lead with Video from NDN
Competitive Insight
NDN is a leader in the News/Information category, ranked #2 behind Huffington Post Media Group.
NDN Powers the Full Video Experience for Publishers
NDN Single Video Player & Fixed Placement
Perfect Pixel has Redefined the Video Workflow
NDN Wire Match
NDN Wire Match: automates placement of AP video recommended by AP editors
Powering Video On 44 of the Top 50 Newspaper Sites
Top
U.S
. N
ew
spa
pe
rs O
nlin
e
NDN is the Leader in Local News
• Breaking News Video Available from over 250 Stations in 155 US News Markets
• Coverage for 90% of the US Audience
The Largest Consortium of Digital Local News Video Ever Created
Participating broadcasters:
257 Stations in 155 Markets
BI Initiative
• Needed self-service BI • Must be user-friendly• Easy to Manage• Reviewed over a dozen BI vendors
– Build or Buy– Self Hosted vs Cloud– Training/Support– POC process
Tech @ NDN
• Tools– Kinesis for Real-Time Data Collection– Python / EMR / Pentaho for ETL– Redshift for Data Warehousing– Chartio for Visualization
Data Warehouse
Architecture
RDBMS
Logs
ETL
DIMENSIONS
Architecture
• Real-time data collector encodes messages in protocol buffers and sends payload to kinesis
• Micro-batching – ETL process continuously reads from kinesis, batches the data, and
loads into Redshift– ~15 minutes behind real-time
Redshift Basics
• Redshift is a distributed column store– don’t treat it like a traditional row store– Don’t do “SELECT * FROM” queries
• No Referential Integrity – primary / foreign keys ignored except for query planning– Enforce uniqueness via ETL
• No UDFs or Stored Procedures– Must rely on built in functions– Do as much pre-processing outside of cluster
Redshift
• Use COPY command to bulk load data– Raw inserts are slow – “Insert Into Table … Values …”
• Deep copies to rebuild tables rather than do a full vacuum.– Create table then Insert Into “Select * from”– Vacuum took as long as three days for some tables
Distribution
• Distribution Styles– Use “All” distribution for dimension tables– Use “Even” distribution for summary tables– Use “Key” distribution for fact tables
Select most often joined column as dist key.
Strive for join data locality
Sort Keys
• Select a timestamp based column with the lowest grain that makes sense (minute truncated timestamp)
• Insert Data in Sort key order to minimize the need for vacuum
Compression Encoding
• Use compression to reduce I/O– Use ANALYZE COMPRESSION to get recommended encodings for
your table or use COPY bulk loading tool do it for you– Use Run Length Encoding on rollup columns like hour, day, month, year,
booleans (assuming a timestamp for your sortkey)
Summary Tables
• Aggregate Tables / Materialized Views– Pre-build your summaries and complex queries– Your biggest boost in query performance will come from using summary
tables– Adds ETL complexity, but reduces reporting complexity – Chartio’s Data Store is also an option if your data set is < 1 M rows
Avoid Updates on fact tables
• Avoid doing Updates on your fact tables– Updates are equivalent to delete then insert and will ruin your sort order– Vacuum will be required after large updates
• Deletes remain in your table– Marked and hidden, but don’t disappear until a vacuum delete or full
vacuum is performed
Caching
• Configure Chartio with the appropriate cache timeout values – 15 min, 1 hour, 8 hours
• Use Chartio’s data store feature– Ideal for storing complex query results or aggregates
Views
• Use views instead of tables– Easier to update Chartio schemas if using a view– Can add mandatory filters– Can change view w/o affecting Chartio
Chartio Filters and Drilldowns
• Encourage use of dashboard filters and variables– Allows for dynamic filtering and focused reporting
• Configure drilldowns on dashboards– Makes exploration more natural
Redshift Workload Manager
• Use the Workload Manager (WLM)– Prevent long queries from blocking other users– Create multiple query queues for ETL, BI, Machine Learning, etc– Set separate memory settings and query timeout values for each queue
Quick Stats
• 14 event types• 300 M ~ 1 B events / day• ½ Terabyte uncompressed data / day • 30 – 50 data points per event type• 50+ users (about half the company)• 80+ dashboards, majority user generated• Reportable dimensions include:
– Partners, Geo-location, Device, EventType, Playlists, Widgets, Date/Time …
Data At A Glance
Data At A Glance
Chartio Summary
• Easy to deploy• Easy to manage• Dead simple to use• Great performance• Responsive support• Continually improving and adding new features
Redshift Summary
• Easy to Deploy• Easy to Resize• Automated backups• Familiar postgres-like interface• High performance• Can use OLAP/Relational tools
Data Sources
Schema/Business Rules
Interactive Mode
SQL ModeData Stores
TV Screens
Scheduled Emails
Data Exploration
Dashboards
Embedded
Data Pipeline/Data Blending
Data Caching
Security
Next stepsDownload Chartio Guide: Optimizing Amazon Redshift Query Performance
https://chartio.com/redshift
Questions?
ChartioMatt Train
chartio.com
News Distribution Network, Inc.
David Nhim
[email protected] newsinc.com
AWSBrandon Chavis
[email protected] aws.amazon.com