21
Connect. Describe. Explore. Performance Tuning for Optimal Data Exploration... Segah Meer Data & Analytics

Performance tuning in database, sql, and LookML for optimal data exploration, a case study

Embed Size (px)

Citation preview

Page 1: Performance tuning in database, sql, and LookML for optimal data exploration, a case study

Connect. Describe. Explore.

Performance Tuning for Optimal Data Exploration...Segah MeerData & Analytics

Page 2: Performance tuning in database, sql, and LookML for optimal data exploration, a case study

1. Signals to Watch for (for existing Lookers)2. Case #1: Very wide reports3. Case #2: Lots of event data4. Case #3: Live data with complex analysis5. When to use PDTs for performance needs

Performance Tuning for Optimal Data Exploration...

Page 3: Performance tuning in database, sql, and LookML for optimal data exploration, a case study

● Slow response time○ Queries returning in >1 minute on large*

datasets○ Queries returning in >10-20 seconds on

small* datasets● Bad throughput/resource utilization

○ Concurrent queries are affected○ Excessive CPU usage○ Queries are aborted due to limited disk

space

Performance Tuning for existing LookersSignals to watch for...

“What’s wrong with the Donkey?”

Page 4: Performance tuning in database, sql, and LookML for optimal data exploration, a case study

Specs:❏ Database: MySQL❏ Size: >100 million rows❏ Normalization: 3rd Normal Form❏ Indexing: lots of missing Indexes on timestamps❏ Model Stage: Complete

Ask:❏ Query excel-like reports with 30-50 columns❏ Show metrics (by date) associated with different parts of the business in one

Look

Three Cases Case 1: Very Wide Reports

Page 5: Performance tuning in database, sql, and LookML for optimal data exploration, a case study

Problem:❏ One query could be generating 1000+ lines of SQL on the

entire dataset (all 100 million rows) - this led to slow queries, stalled concurrent queries, and high CPU usage

Solution:➔ Force date ranges with an always_filter➔ Heavy use of templated_filter➔ Result caching with persist_for

Three Cases Case 1: Very Wide Reports

Page 6: Performance tuning in database, sql, and LookML for optimal data exploration, a case study

Three Cases Case 1: Very Wide Reports

● Force date ranges with an always_filter● Cache every result for 8 hours

Takeaway #1: “most queries do NOT need to run against the entire dataset”

Page 7: Performance tuning in database, sql, and LookML for optimal data exploration, a case study

Three Cases Case 1: Very Wide Reports

Caveat #1: “for date ranges to be effective, timestamp columns should be indexed”Caveat #2: “Normalized schemas require more understanding of the final model for correct set-up of indexes”

Check indexes in the SQL Runner:

Page 8: Performance tuning in database, sql, and LookML for optimal data exploration, a case study

Three Cases Case 1: Very Wide Reports

Heavy use of templated_filter

Takeaway #2: “Joins can be optimized by filtering the joined table for a subset of data”

Page 9: Performance tuning in database, sql, and LookML for optimal data exploration, a case study

Three Cases Case 1: Very Wide Reports

More templated filter optimization (in Derived Tables)....

Page 10: Performance tuning in database, sql, and LookML for optimal data exploration, a case study

Specs:❏ Database: Redshift (3 HDD Nodes)❏ Size: >5 billion rows❏ Normalization: Heavily Denormalized❏ Indexing: some Sortkeys/Distkeys❏ Growth: >50 million new events added daily❏ Model Stage at start: Partially Complete (suitable for general exploration)

Ask:❏ Query new retention metrics by date❏ Show many metrics in a dashboard (drill paths are not important)

Three Cases Case 2: Lots of Event Data

Page 11: Performance tuning in database, sql, and LookML for optimal data exploration, a case study

Problem:❏ These metrics require 1) date ranges with 500+ million events (even for 2-

4 weeks or 30 day intervals), 2) and expensive self-joins❏ Queries were taking minutes or had to be scheduled to pre-run at

midnight and be cached..

Solution:➔ Roll-up Events in PDTs into metrics by Date, Channel, Campaign➔ Match correctly date filters with dates, time filters with timestamps

Three Cases Case 2: Lots of Event Data

Page 12: Performance tuning in database, sql, and LookML for optimal data exploration, a case study

Three Cases Case 2: Lots of Event Data

Roll-up Events in PDTs into metrics by Date, Channel, Campaign

Takeaway: “Event data is rarely looked at on a row level. OK to roll-up!”

Page 13: Performance tuning in database, sql, and LookML for optimal data exploration, a case study

Three Cases Case 2: Lots of Event Data

Caveat #1: “Incorrect use of datatype can lead to Redshift not effectively making use of SORTKEYs”

Page 14: Performance tuning in database, sql, and LookML for optimal data exploration, a case study

Three Cases Case 2: Lots of Event Data

Caveat #2: “Sometimes this even requires us to create a date-timestamp PDT”

Page 15: Performance tuning in database, sql, and LookML for optimal data exploration, a case study

Specs:❏ Database: Redshift (10 SSD Nodes)❏ Size: >800 million rows❏ Normalization: 3rd Normal Form❏ Indexing: Correct Sortkeys/Distkeys❏ Growth: 0.1 - 10 million new rows depending on the table❏ Model Stage at start: Complete

Ask:❏ Very flexible exploration environment (drill to row-level)❏ Very complex analytical patterns (financial use case)❏ Nearly live data (ETL running every 10 minutes)

Three Cases Case 3: Live Data with Complex Analysis

Page 16: Performance tuning in database, sql, and LookML for optimal data exploration, a case study

Problem:❏ Complex queries were not returning results fast enough (or at

all) to reflect changes in data.❏ Data caching overloaded the storage.

Solution:➔ Swap sql_trigger_value for persist_for on intermediate PDTs➔ Wait until completion of Vacuum and Analyze commands

before making the table accessible (2 approaches)

Three Cases Case 3: Live Data with Complex Analysis

Page 17: Performance tuning in database, sql, and LookML for optimal data exploration, a case study

Three Cases Case 3: Live Data with Complex Analysis

Swap sql_trigger_value for persist_for on intermediate PDTs to garbage collect scratch tables.

Takeaway #1: “Choosing Compute Nodes over Dense Storage does NOT have to come at a steeper price if done intelligently”

Page 18: Performance tuning in database, sql, and LookML for optimal data exploration, a case study

Three Cases Case 3: Live Data with Complex Analysis

Takeaway #2: “Even with appropriate Sortkeys and Distkey, queries can be slow if Redshift never finishes Vacuuming and Sorting the data.”

● Union all tables in a view and then only query the Redshift view from Looker

Page 19: Performance tuning in database, sql, and LookML for optimal data exploration, a case study

Three Cases Case 3: Live Data with Complex Analysis

Takeaway #2: “Even with appropriate Sortkeys and Distkey, queries can be slow if Redshift never finishes Vacuuming and Sorting the data.”

● Swap table schemas upon completion of Vacuum and Analyze commands.

Page 20: Performance tuning in database, sql, and LookML for optimal data exploration, a case study

Major Takeaways

Do NOT use:● Should not be a first line of defense● If PDTs are considered only on the basis of performance, the logic is best suited

for the ETL process

Do Use:● If no exploration is necessary into a complex and slow-running query other than

into the summary it produces● If the joins in the query are too expensive due to size of the joined tables

When to use PDTs for Performance Improvements

Page 21: Performance tuning in database, sql, and LookML for optimal data exploration, a case study

Questions?