Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
Nothing is for granted:Making wise decisions using real-time intelligence
Anastasia Ailamaki
Catching up With an Evolving LandscapeOne size fits all:
– Easier administration/development– Easier usage– Single point of truth
Modern requirements:– OLAP, OLTP, Streams, federated data, ML, FaaS, …., mixed– Custom & domain-specific types/operations– Diverse requirements and priorities
2Treat heterogeneity as a first-class citizen
ODBMS
DSMSSpatialDBMS
KV-store
“Useful” time: 9%(mining data for patterns)Source: Forbes
Image: “Three-way tug of war” (https://www.momahler.com/ProArtistManifesto)
Hardware
Data
Workload
Change Means Trouble
Next generation systems must adapt
Change-driven ArchitecturesData & operations hint to optimal architectureAllow operations to shape DBMS architecture to their sizeReduce uncertain, preemptive work and fixed components
4Each workload runs on its own custom DBMS
Capabilities Runtime
ComposabilityModularity
Specialization
Runtime specialization embraces heterogeneity
5
Data
Heterogene
ous
Workload
Hardware
Unstandardized Inconsistencies Skewness
Uncertainty Complexity Volume
Underutilization Boundaries Interference
data growth
From Data to Information Veracity
6Need timely & informed data preparation
TXT
LOAD
LOAD
LOAD
Clean Data
TuneDatabase
Time
Query
predictability
5. Plan execution6. (finally!) run & get answer
Database
Preparation Kills Discovery
QUERY
LOAD
TXT
LOAD
LOAD
1. Load data
2. Clean Data
3. TuneDatabase
4. Ask analytical question
$$$$$
Load30%
Clean50%
Index17%
Cost grows with owned – not used! – dataPlanning is expensive, often even wrong
Heterogeneous Data: Convert the EngineConservative: pre-load & convert data
– Time-consuming data adaptation to engine– Transformation: pre-determined order wrt execution– Wasteful loading of write-only data
Real-time: customize access paths at runtime– Virtualize data; specialize access paths to queries+data– On-the-fly access paths and positional caches– Pay-as-you-go heterogeneous data accesses
8Real-time adaptation to data formats à flexibility
TXT
LOAD
LOAD
LOAD
Query
TXT
Query
CSV JSON.bin
Query
CSV JSON.bin
DBMS
Tran
sform
Query
Proteus: Customized Data Access
Each source as native format, generate special engine/query
Engine adapts to dataPlug-in per data sourceBuild auxiliary structures
PredicateExpression Type
Collection Type
ü Reduce branches
ü Minimize function calls
ü Pipelining
Krikellas [ICDE2010], Neumann [VLDB2011]
Interpretation Overhead Generate
Operators
Full Cleaning: A Dirty SolutionConservative: query cleaned database copy
– Time-consuming transformation: Dirty=>Clean DB– Analysis dependent– Wasteful effort on unnecessary data
Real-time: Sanitize only interesting data– Virtualized clean DB– Query relaxation to guarantee correctness– Clean touched data as-you-go
10Focus on cleaning useful data
Query
Query Result
Cleaning
Offline Cleaning
Query-Driven Data Sanitization
11Clean only useful data with probabilistic fixes
DenialConstraints
DuplicateEliminationIntegrity
Constraints
Common algebra q1
q2Offline Cleaning
Query Result
Cleaning
Indexing Decisions: It’s All About AccessesConservative: parse before tune
– Based on workload-expectations– Data duplication– Coarse-grained access path selection
Real-time: on-the-fly indexing of raw data– Virtualized (logical) partitioning & online index tuning– Reuse raw data– Fine-grained tuning, as-you-go
12Online partial indexing, specialized to access pattern
TXT
LOAD
LOAD
LOAD
TuneDatabase
Query
TXT
Access
TuneQueryAccess
Access
Evolving Indexes
13On-the-fly virtual indexes to invest on important data
B+
Data skippingFine-grained access path selectionChoose what to build & when - Value-Existence (i.e., Bloom filters)- Value-Position (i.e., B+ Trees)Build / drop based on budget
...
…
attr1 attrN costs vs. gainsShould I build or not?
Bf
Qm
TXT
Access
TuneQueryAccess
Access
Runtime specialization embraces heterogeneity
14
Data
Heterogene
ous
Workload
Hardware
Unstandardized Inconsistencies Skewness
Uncertainty Complexity Volume
Underutilization Boundaries Interference
Conservative: offline vs online AQP– Tradeoff between performance and flexibility– Preprocessing cost or reduced gains– Static sampling user-driven tactic
Real-time: Workload-driven data summarization– Maintain sample/sketch set specialized to workload history– Materialize summaries based on gain estimates– Pay-as-you-go summary creation and storage overheads
AQP: From Expectations to Adaptation
15Adapt summaries based on workload patterns
Select Persistent
SummariesQuery
Query
Query
Query
RefineSummaries
Query
Query
Query
QueryRefine
Summaries
Tuning Materialized SummariesWindow-based prediction
Estimate prospective gains vs. storage cost
Adapt window size based on quality of predictions
16Materialize summaries maximizing gain on past workload
Q1 Q2 Q3 Q4 Q5 Q6
S1
S2S2
S3S2
S4S4 S4 S1
S2Summary
warehouse S1 S2
w = 2
S4
materialize materialize use
Useful Summaries
Bypassing Intra-Query DependenciesConservative: Query unnesting
– Staged query execution: waiting for unnecessary details– Sharing prohibited across barriers– Blocking & limited parallelism dues to dependencies
Real-time: Speculate intra-query dependencies– Virtualize query execution by speculating and repairing– Exposed sharing and parallelism across barrier– Verify & repair query results as-you-go
17Expect, predict & proceed: specialize to uncertainty
Query
Sub-query
Sub-query
Query
Sub-querySub-querySub-querySub-query
Chocking Points Reduce Parallelism
18Subquery speculation to create task-independence
Outer query
Subquery 2
Subquery 3
approx2
SVJoin
Δ(Q1)
Guard
Guard
SVJoin
Δ(Q2)
approx3
Apply recursively
Task parallelism and sharing opportunities
Stable repair propagation
Query
Sub-querySub-querySub-querySub-query
Planning in Multi-Query ExecutionConservative: Opportunistic sharing
– Missed sharing opportunities, to avoid optimization time– (Multi-) Query & statistic dependent– Sensitive to query order, plans, correlations
Real-time: Adaptive multi-query optimization– Specialize to running queries- & data-at-hand– Inspect actual statistics, observed by minibatches– Reconfigure execution for pay-as-you-go plans
19
Learn & explore using running Qs à fine granularity reopt
Query Plan Execute
Query Plan Execute
Query Plan Execute
Query
Query
Query
Execute
Execute
Execute & revise plans
Concurrent Queries: From Complexity to Shareability
20Query-interoperability-based, on-the-fly batch-reoptimization
RouLetteQ1Q2Q3
Feedback
Plan efficiencyExecution Efficiency
Reinforcement learningLow-overhead adaptation
Query Processing Time
Query
Query
Query
Execute
Execute
Execute & revise plans
Runtime specialization embraces heterogeneity
21
Data
Heterogene
ous
Workload
Hardware
Unstandardized Inconsistencies Skewness
Uncertainty Complexity Volume
Underutilization Boundaries Interference
Hardware Acceleration: A Balancing GameConservative: Performance – portability tradeoff across devices
– Device specific operator implementations– Limited or expectation-based load-balancing– Inefficient hardware use
Real-time: Synergistic CPU-GPU execution– Virtualize hardware & generate hardware-specific code– Throughput-based data-flow load balancing– Exploit Accelerator-Level Parallelism
22Use devices based on their relative performance
GPU
CPUQuery
GPU
CPUQuery
Hardware Heterogeneity: Let the Data FlowDecouple data- from control-flowEncapsulate trait conversions into operatorsInspect flows to load-balance
23Flow inspection to load balance across heterogeneous devices
filter
unpack
cpu2gpu
gpu2cpu
pack
mem-move
mem-move
aggregate
router
router
unpack
aggregate
GPU
CPUQuery
Hardware Boundaries à Isolation Mechanisms
Conservative: Device-collocated OLAP & OLTP– Wasted parallelism and throughput– Static hardware preferences– Destructive interference across workloads
Real-time: Dynamic task assignment minimizes interference– ALP & hardware boundaries as an isolation mechanism– Fresh-data-rate- & isolation-driven task assignment– Pay-as-you-access-fresh-data interference
24Align isolation requirements with hardware boundaries
GPU
CPUCommand
GPU
CPUCommandSchedule
GPU Accesses Fresh Data from CPU MemoryOLTP generates fresh data
on CPU Memory
Data access protected byconcurrency control
OLAP needs to accessfresh data
25Provide snapshot isolation for OLAP w/o CC overheads
Read / Write
Storage
Fetch Fresh Data
Main Memory
DRAM NVLink
PCIe
OLTP
OLAP
GPU
CPUCommandSchedule
HTAP: Chasing Freshness LocalityConservative: Static OLAP-OLTP assignment
– Unnecessary tradeoff between interference and performance– Pre-determined resource assignment based on workload type– Wasteful data consolidation and synchronization
Real-time: Adaptive scheduling of HTAP workloads– Specialize to requirements and data/freshness-rates– Workload-based resource assignment– Pay-as-you-go snapshot updates
26Task placement based on resource usage
GPU
CPU
GPU
CPU
GPU
CPU
GPU
CPU
Workload Isolation & Fresh Data Throughput
27Freshness-based: from destructive to constructive interference
ETL
Isolated Elastic-ComputeHybrid-Access Colocated
Performance IsolationFresh Data Access Bandwidth
Fresh Data
OLAPOLTP
Runtime specialization embraces heterogeneity
28
Data
Heterogene
ous
Workload
Hardware
Unstandardized Inconsistencies Skewness
Uncertainty Complexity Volume
Underutilization Boundaries Interference
Inspect & Reduce
Relax & Repair
Track & Adjust
Runtime specialization embraces heterogeneity
29Application landscape changes data processing
Data
Heterogene
ous
Workload
Hardware
Unstandardized Inconsistencies Skewness
Uncertainty Complexity Volume
Underutilization Boundaries Interference
Queries
Workload
Complexity of Modern WorkloadsDiverse modern data problems
– IOT, OCR, ML, NLP, Medical, Mathematics etc…
DBMS catch-up for popular functionality– Human effort and big delays– Oblivious to out-of-DBMS workflows
Vast resource of libraries– Authored by domain experts, used by everybody– Loose library-to-data-sources integration and optimization
30Need for systems that can “learn” new functionality
five old friends revisitedData variety à Operational environment variety
– Unpredictable application requirements
Data veracity à Inter-component veracity– Heterogeneous data & variable importance
Data volume à Structural volume– Multi-layered system architectures
Data value à Resource value– Broader, multi-featured analytics
Data velocity à Technological velocity– Hardware heterogeneity & volatility
31Intelligent systems to catch-up with an evolving landscape
Incorporate change into native design.Anticipate change and react, learning from errors.
A solution is only as efficient as its least adaptive component.
Intelligent Real-time Systems