Upload
elephantscale
View
250
Download
0
Embed Size (px)
Citation preview
1
Y O UR DA TA , NO L I M I T S
Kent Graziano Senior Technical EvangelistSnowflake Computing
Changing the Game with Cloud Data Warehousing
@KentGraziano
2
My Bio•Senior Technical Evangelist, Snowflake Computing•Oracle ACE Director (DW/BI)•OakTable•Blogger – The Data Warrior•Certified Data Vault Master and DV 2.0 Practitioner•Former Member: Boulder BI Brain Trust (#BBBT)•Member: DAMA Houston & DAMA International•Data Architecture and Data Warehouse Specialist
•30+ years in IT•25+ years of Oracle-related work•20+ years of data warehousing experience
•Author & Co-Author of a bunch of books (Amazon)•Past-President of ODTUG and Rocky Mountain Oracle User Group
3
Agenda
•Data Challenges•What is a Cloud Data Warehouse?•What can a Cloud DW do for me?•Cool Features of Snowflake•Other Cloud DW – Redshift, Azure, BigQuery•Real Metrics
Data challenges today
5
Scenarios with affinity for cloud
Gartner 2016 Predictions:
By 2018, six billion connected things will be requesting support.
Connecting applications, devices, and “things”
Reaching employees, business partners, and consumers
Anytime, anywhere mobility
On demand, unlimited scale
Understanding behavior;; generating, retaining, and analyzing data
6
40 Zettabytes by 2020
Web ERP3rd party apps Enterprise apps IoTMobile
7
It’s not the data itself
it’s how you take full advantage of the insight it provides
Web ERP3rd party apps Enterprise apps IoTMobile
8
All possible data All possible actions
Most firms don’t consistently turn data into action
73% 29%of firms
aspire to be data-driven.
of firms are good at turning data into action.
Source: Forrester
9
New possibilities with the cloud
•More & more data “born in the cloud”•Natural integration point for data•Capacity on demand•Low-cost, scalable storage•Compute nodes
10
Cloud characteristics & attributes
DYNAMIC EASY FLEXIBLE SECURE
ScalableElasticAdaptive
Lower costFaster implementation
Supports many scenarios
Trust by design
11
The evolution of data platformsData warehouse & platform softwareVertica, Greenplum,
Paraccel, Hadoop,Redshift
Data warehouse applianceTeradata
1990s 2000s 2010s
Cloud-native Data
WarehouseSnowflake
1980s
Relational databaseOracle, DB2,SQL Server
12
What is a Cloud-Native DW?
•DW- Data Warehouse•Relational database•Uses standard SQL•Optimized for fast loads and analytic queries
•aaS – As a Service•Like SaaS (e.g. SalesForce.com)•No infrastructure set up•Minimal to no administration•Managed for you by the vendor•Pay as you go, for what you use
13
Goals of Cloud DW
•Make your life easier•So you can load and use your data faster
•Support business•Make data accessible to more people•Reduce time to insights
•Handle big data too!•Schema-less ingestion
14
Common customer scenarios
Data warehouse for SaaS offerings
Use Cloud DW as back-end data warehouse supporting data-driven SaaS products
noSQL replacementReplace use of noSQL system (e.g. Hadoop) for transformation and SQL analytics of multi-structured data
Data warehouse modernizationConsolidate legacy datamarts and support
new projects
15
Over 250 customers demonstrate what’s possible
Up to 200x faster reports that enable analysts to make decisions in minutes rather than days
Load and update data in near real time by replacing legacy data warehouse + Hadoop clusters
Developing new applications that provide secure (HIPPA) access to analytics to 11,000+ pharmacies
16
Introducing Snowflake
17
About Snowflake
Experienced, accomplishedleadership team
2012 Founded by
industry veterans with over 120
database patents
Vision: A world with
no limits on data
First datawarehousebuilt for the cloud
Over 230enterprise customers inone yearsince GA
18
The 1st Data Warehouse Built for the Cloud
Data Warehousing…
• SQL relational database• Optimized storage & processing• Standard connectivity – BI, ETL, …
•Existing SQL skills and tools•“Load and go” ease of use•Cloud-based elasticity to fit any scale
Data scientists
SQL users & tools
…for Everyone
19
Concurrency Simplicity
Fully managed with a pay-as-you-go model. Works on any data
Multiple groups access data simultaneously
with no performance degradation
Multi petabyte-scale, up to 200x faster performance
and 1/10th the cost
200x
The Snowflake difference
Performance
20
The Data Warrior’sTop 10 Cool ThingsAbout Snowflake
(A Data Geeks Guide to DWaaS)
21
#10 – Persistent Result Sets
•No setup•In Query History•By Query ID
•24 Hours•No re-execution•No Cost for Compute
22
#9 Connect with JDBC & ODBC
Data Sources
Custom & Packaged Applications
ODBC WEB UIJDBC
Interfaces
Java
>_
Scripting
Reporting & Analytics
Data Modeling, Management & Transformation
SDDM
SPARK too!https://github.com/snowflakedb/
JDBC drivers available via MAVIN.orgPython driver available via PyPI
23
#8 - UNDROPUNDROP TABLE <table name>
UNDROP SCHEMA <schema name>UNDROP DATABASE <db name>
Part of Time Travel feature: AWESOME!
24
#7 Fast Clone (Zero-Copy)•Instant copy of table, schema, or database:CREATE OR REPLACE TABLE MyTable_V2CLONE MyTable
• With Time Travel:CREATE SCHEMAmytestschema_clone_restoreCLONE testschemaBEFORE (TIMESTAMP =>TO_TIMESTAMP(40*365*86400));;
25
#6 – JSON Support with SQL
Apple 101.12 250 FIH-2316
Pear 56.22 202 IHO-6912
Orange 98.21 600 WHQ-6090
"firstName": "John", "lastName": "Smith", "height_cm": 167.64, "address":
"streetAddress": "21 2nd Street", "city": "New York", "state": "NY","postalCode": "10021-3100"
, "phoneNumbers": [
"type": "home", "number": "212 555-1234" , "type": "office", "number": "646 555-4567"
]
Structured data(e.g. CSV)
Semi-structured data(e.g. JSON, Avro, XML)
• Optimized storage• Flexible schema - Native• Relational processing
select v:lastName::string as last_namefrom json_demo;;
26
#5 – Standard SQL w/Analytic Functions
Complete SQL database• Data definition language (DDLs)• Query (SELECT)• Updates, inserts and deletes (DML)• Role based security• Multi-statement transactions
select Nation, Customer, Totalfrom (select
n.n_name Nation,c.c_name Customer,sum(o.o_totalprice) Total,rank() over (partition by n.n_nameorder by sum(o.o_totalprice) desc)
customer_rankfrom orders o,customer c,nation nwhere o.o_custkey = c.c_custkeyand c.c_nationkey = n.n_nationkeygroup by 1, 2)
where customer_rank <= 3order by 1, customer_rank
27
Snowflake’s multi-cluster, shared data architecture
Centralized storage
Instant, automatic scalability & elasticity
ServiceComputeStorage
#4 – Separation of Storage & Compute
28
#3 – Support Multiple Workloads
Scale processing horsepower up and down on-the-fly, with zero downtime or disruption
Multi-cluster “virtual warehouse” architecture scales concurrent users & workloads without contention
Run loading & analytics at any time, concurrently, to get data to users faster
Scale compute to support any workload
Scale concurrency without performance impact
Accelerate the data pipeline
29
#2 – Secure by Design with Automatic Encryption of Data!
Authentication
Embedded multi-factor authentication
Federated authentication available
Access control
Role-based access control model
Granular privileges on all objects & actions
Data encryption
All data encrypted, always, end-to-end
Encryption keys managed automatically
External validation
Certified against enterprise-class requirements
HIPPA Certified!
30
#1 - Automatic Query Optimization
•Fully managed with no knobs or tuning required•No indexes, distribution keys, partitioning, vacuuming,…
•Zero infrastructure costs•Zero admin costs
31
Other Cloud Data Warehouse Offerings
32
Amazon Redshift
•Amazon's data warehousing offering in AWS • First announced in fall 2012 and GA in early 2013• Derived ParAccel (Postgres) moved to the cloud
•Pluses•Maturity: based on Paraccel, on the market for almost 10 years. • Ecosystem: Deeper integration with other AWS products • Amazon backing
33
Amazon Redshift•Challenges (vs Snowflake)• Semi-structured data: Redshift cannot natively handle flexible-schema data (e.g. JSON, Avro, XML) at scale.• Concurrency: Redshift architecture means that there is a hard concurrency limit that cannot be addressed short of creating a second, independent Redshift cluster.• Scaling: Scaling a cluster means read-only mode for hours while data is redistributed. Every new cluster has a complete copy of data, multiplying costs for dev, test, staging, and production as well as for datamarts created to address concurrency scaling limitations.•Management overhead: Customers report spending hours, often every week, doing maintenance such as reloading data, vacuuming, updating metadata.
34
Customer Analysis – Snowflake vs Redshift
•Ability to provision compute and storage separately –•Storage might grow exponentially but compute needs may not •No need to add nodes to cluster to accommodate storage
•Compute capacity can be provisioned during business hours and shut down when not required (saving $$$)•Predictable/ exact processing power for user queries with dedicated separate warehouses•No concurrency issues between warehouses•No constraints on completing the ETL run before business hours •Analytical workload and ETL can run in parallel
35
Customer Analysis
•0 maintenance/ 100% managed •No tuning (distkey, sortkey, vacuum, analyze)•100% uptime, no backups•Can restore at transaction level through time travel feature
•3X- 5X better compression compared to Redshift•Auto compressed
•Data at rest is encrypted by default•With Redshift, performance is degraded by 2x-3x
36
Customer Analysis
•Supports cross database joins•With cloning feature, we can spin up Dev/ test by cloning entire prod database in seconds → run tests→and drop the clone•There is no charge for a clone. Only incremental updates on storage are charged
•Instant Resize (scale up) •NO 20+ hrs read only mode like Redshift! •Resize also allows provisioning higher compute capacity for faster processing when required
37
Microsoft Azure DW•Based on Analytics Platform System •Emerging out of MSFT’s on-premises MPP data warehouse DataAllegro acquisition in 2008
•Pluses•Maturity: very mature (on-prem) database for over 20 years • Ecosystem: Deep integration with Azure and SQL server ecosystem• Separation of compute and storage: can scale compute without the need of unloading/loading/moving the underlying at the database level. • Integration w/ Big Data & Hadoop: allows querying 'semi-structured/unstructured' data, such as Hadoop file formats ORC, RC, and Parquet. This works via the external table concept
38
Microsoft Azure DW
•Challenges (vs Snowflake)• Concurrency: Azure architecture means there are hard concurrency limit (currently 32 users) per DWU (Data Warehouse Units) that cannot be addressed short of creating a second warehouse and copy of the data.• Scaling: Cannot scale clusters easily and automatically.•Management overhead: Azure is difficult manage, specially with considerations around data distribution, statistics, indices, encryption, metadata operations, replication (for disaster recovery) and more.• Security: lack of end to end encryption. • Support for modern programmability: Lacks support for wide range of APIs due to commitments to its own ecosystem.
39
Google BigQuery
•Query processing service offered in the Google Cloud, first launched in 2010• Follow up offering from Dremel service developed internally for Google only
•Pluses• Google-scale horsepower: Runs jobs across a boatload of servers. As a result, BigQuery can be very fast on many individual queries.• Absolutely zero management: You submit your job and wait for it to return–that's it. No management of infrastructure, no management of database configuration, no management of compute horsepower, etc.
40
Google BigQuery
•Challenges (vs Snowflake)• BigQuery is not a data warehouse: Does not implement key features that are expected in a relational database, which means existing database workloads will not work in BigQuery without non-trivial change • Performance degradation for JOINs: Limits on how many tables can be joined• BigQuery is a black box: You submit your job and it finishes when it finishes–users have no ability to control SLAs nor performance. • Lots of usage limitations: Quotas on how many concurrent jobs can be run, how many queries can run per day, how much data can be processed at once, etc.• Obscure pricing: Prices per query (based on amount of data processed by a query), making it difficult to know what it will cost and making costs add up quickly• BigQuery only recently introduced full SQL support
41
Snowflakein Action Today
42
Simplifying the data pipeline
Event Data
Kafka Hadoop SQL Database Analysts & BI Tools
Import Processor
Key-value Store
Event Data
Kafka Amazon S3 Snowflake Analysts & BI Tools
Scenario• Evaluating event data from various sources
Pain Points• 2+ hours to make new data available for
analytics• Significant management overhead • Expensive infrastructure
SolutionSend data from Kafka to S3 to Snowflake with schemaless ingestion and easy querying
Snowflake Value• Eliminate external pre-processing• Fewer systems to maintain• Concurrency without contention & performance
impact
43
Simplifying the Data pipeline EDW
Game Event DataInternal Data
Third-party Data
Analysts & BI Tools
StagingnoSQL Database
ExistingEDW
Game Event DataInternal Data
Third-party Data
Analysts & BI Tools
SnowflakeKinesis
Cleanse Normalize Transform
11-24 hours 15 minutes
ScenarioComplex pipeline slowing down analytics
Pain Points• Fragile data pipeline• Delays in getting updated data• High cost and complexity• Limited data granularity
SolutionSend data from Kinesis to S3 to Snowflake with schemaless ingestion and easy querying
Snowflake Value• >50x faster data updates• 80% lower costs• Nearly eliminated pipeline failures• Able to retain full data granularity
44
Delivering compelling results
Simpler data pipelineReplace noSQL database with Snowflake for storing & transforming JSON event data Snowflake: 1.5 minutes
noSQL data base: 8 hours to prepare data
Snowflake: 45 minutes
Data warehouse appliance: 20+ hours
Faster analyticsReplace on-premises data warehouse with Snowflake for analytics workload
Significantly lower costImproved performance while adding new workloads--at a fraction of the cost Snowflake: added 2 new workloads for $50K
Data warehouse appliance: $5M + to expand
45
The fact that we don’t need to do any configuration or tuning is great because we can focus on analyzing data instead of on managing and tuning a data warehouse.”
Craig Lancaster, CTO
The combination of Snowflake and Looker gives our business users a powerful, self-service tool to explore and analyze diverse data, like JSON, quickly and with ease.”
We went from an obstacle and cost-center to a value-added partner providing business intelligence and unified data warehousing for our global web property business lines.”
JP Lester, CTO
Music & film distributionPublishers of Ask.com, Dictionary.com, About.com,
and other premium websitesInternet access service company toover 30 million smartphone users
• Replaced MySQL• Substituted Snowflakenative JSON handlingand queries with SQLin place of MapReduce• Integrated with Hadooprepository
• Consolidated global web data pipelining• Replaced 36-node datawarehouse• Replaced 100-node Hadoop cluster
Erika Baske, Head of BI
• Eliminated bottlenecksand scaling pain-points• Consolidated multiplecloud data marts• Now handling larger datasets and higher concurrency with ease
46
Cloud-scale data warehouse
47
Steady growth in data processing
•Over 20 PB loaded to date!•Multiple customers with >1PB
•Multiple customers averaging >1M jobs / week
•>1PB / day processed
•Experiencing 4X data processinggrowth over last six months
Jobs / day
48
What does a Cloud-native DW enable?Cost effective storage and analysis of GBs, TBs, or even PB’s
Lightning fast query performance
Continuous data loading without impacting query performance
Unlimited user concurrencyODBC WEB UIJDBC
Interfaces
Java
>_
Scripting
Full SQL relational support of both structured and semi-structured data
Support for the tools and languages you already use
49
Making Data Warehousing Great Again!
50
As easy as 1-2-3!
Discover the performance, concurrency, and simplicity of Snowflake
1 Visit Snowflake.net
2 Click “Try for Free”
3 Sign up & register
Snowflake is the only data warehouse built for the cloud. You can automatically scale compute up, out, or down—independent of storage. Plus, you have the power of a complete SQL database, with zero management, that can grow with you to support all of your data and all of your users. With Snowflake On Demand™, pay only for what you use.
Sign up and receive$400 worth of free
usage for 30 days!
Kent GrazianoSnowflake [email protected] Twitter @KentGraziano
More info athttp://snowflake.net
Visit my blog athttp://kentgraziano.com
Contact Information
YOUR DATA, NO LIMITS
Thank you