View
221
Download
2
Tags:
Embed Size (px)
Citation preview
Supporting Streaming Updates in an Active Data
Warehouse
Neoklis Polyzotis, Spiros Skiadopoulos,
Panos Vassiliadis, Alkis Simitsis,
Nils-Erik Frantzell
ICDE 2007, Constantinople 18/4/2007 2
Forecast
• Problem in active data warehousing: – the join between a fast stream of source updates and a
disk-based relation under the constraint of limited memory
• Solution:– the mesh join, a novel join operator that operates
under minimum assumptions for the stream and the relation
• Features: – a cost model and tuning methodology that accurately
associates memory consumption with the incoming stream rate
ICDE 2007, Constantinople 18/4/2007 3
Roadmap
• Motivation & Problem statement• The Mesh-Join Algorithm• Cost model & Tuning• Experiments• Conclusions
ICDE 2007, Constantinople 18/4/2007 4
Roadmap
• Motivation & Problem statement• The Mesh-Join Algorithm• Cost model & Tuning• Experiments• Conclusions
ICDE 2007, Constantinople 18/4/2007 5
Add_SPK1
SUPPKEY=1
SK1
DS.PS1.PKEY, LOOKUP_PS.SKEY,
SUPPKEY
$2€
COST DATE
DS.PS2 Add_SPK2
SUPPKEY=2
SK2
DS.PS2.PKEY, LOOKUP_PS.SKEY,
SUPPKEYCOST DATE=SYSDATE
AddDate CheckQTY
QTY>0
U
DS.PS1
Log
rejected
Log
rejected
A2EDate
NotNULL
Log
rejected
Log
rejected
Log
rejected
DIFF1
DS.PS_NEW1.PKEY,DS.PS_OLD1.PKEYDS.PS_NEW
1
DS.PS_OLD
1
DW.PARTSUPP
Aggregate1
PKEY, DAYMIN(COST)
Aggregate2
PKEY, MONTHAVG(COST)
V2
V1
TIME
DW.PARTSUPP.DATE,DAY
FTP1S1_PARTSU
PP
S2_PARTSUPP
FTP2
DS.PS_NEW
2
DIFF2
DS.PS_OLD
2
DS.PS_NEW2.PKEY,DS.PS_OLD2.PKEY
Sources DW
DSA
ETL workflows
ICDE 2007, Constantinople 18/4/2007 6
Active Data Warehousing
• Traditionally, data warehouse refreshment has been performed off-line, through Extraction-Transformation-Loading (ETL) software
• Active Data Warehousing refers to a new trend where data warehouses are updated as frequently as possible, to accommodate the high demands of users for fresh data
ICDE 2007, Constantinople 18/4/2007 7
Issues around Active Warehousing
• Smooth upgrade of the software at the (legacy) source
– minimal modification of the software configuration at the source side
• Minimal overhead of the source system • No data losses are allowed in the long run• Maximum freshness of data
– the response time for the transport, cleaning, transformation and loading of a new source record to the DW should be small and predictable
• Scalability at the warehouse side – the architecture should scale up with respect to the
number of sources and data consumers at the DW– if possible, cover issues like checkpointing, index
maintenance
ICDE 2007, Constantinople 18/4/2007 8
Grand view of an Active DWReal-time
Stream of S1 updates DS
Relation R
Source Relation
S
Join module
Load shedder
(Active) ETL activities for regular DW load
...
...
DSA
DW
Active ETL workflow for approximate, real-time reporting
Off-line synchronization
IntroductoryStage
GrowthStage
MaturityStage
Decline Stage
TotalMarketSales
Time
10090
8070
6050
4030
4050
Real-time DW refreshment
Offline update of reports
DW refreshmentData to
be loaded
ICDE 2007, Constantinople 18/4/2007 9
Problem statement
• Joining a fast stream of updates with a persistent relation within limited memory bounds is of particular importance in the Active Warehousing setting
• Example practical cases:– Surrogate Key assignment– Duplicate detection– …
ICDE 2007, Constantinople 18/4/2007 10
Example: Surrogate Key
Sources DWETL
id descr
1020
cokepepsi
R1
id descr
1020
pepsifanta
R2
id source
10201020
R1
R1
R2
R2
Lookup
skey
100110110120
id descr
100110120
cokepepsifanta
RDW
ICDE 2007, Constantinople 18/4/2007 11
Roadmap
• Motivation & Problem statement• The Mesh-Join Algorithm• Cost model & Tuning• Experiments• Conclusions
ICDE 2007, Constantinople 18/4/2007 12
Operation of Mesh-Join
s1
Stream S
Join module
Relation R
p1
p1s1
t = 0p2
Stream S
Join module
Relation R
p1
p2s2
t = 1p2
already joined with p1
Stream S
Join module
Relation R
p1
p1s3
t = 2p2
s2s1
already joined with p2
scan resumes
ICDE 2007, Constantinople 18/4/2007 13
(Not really any) Assumptions
• No assumption of any order in either the stream or the relation
• No indexes are necessarily present • Limited memory is available• The join condition is arbitrary (equality,
similarity, range, etc.) • The join relationship is general (i.e., many-
to-many, one-to-many, or many-to-one)• The result is exact.
… But ..• The relation remains fixed throughout the
join
ICDE 2007, Constantinople 18/4/2007 14
Architecture of Mesh-Join
Queue QHash H
Relation R
Stream S
Output Stream
b pages of R
w tupples
of S
hash function
hash function
w pointers
...w
pointers
Buffer
Buffer
Join
ICDE 2007, Constantinople 18/4/2007 15
ICDE 2007, Constantinople 18/4/2007 16
Roadmap
• Motivation & Problem statement• The Mesh-Join Algorithm• Cost model & Tuning• Experiments• Conclusions
ICDE 2007, Constantinople 18/4/2007 17
Critical issues
• The important measures are:– the stream rate λ– the available memory M– the service rate μ of the join
• The main challenge is to interrelate these metrics in a cost formula, so as to be able to tune the system– minimize M, given a desirable rate μ– maximize μ, give a constraint of available
memory M
ICDE 2007, Constantinople 18/4/2007 18
Cost model: Memory wrt b, s
Size of b
buffers
Size of w buffe
rs
Size of
queue Q
Size of
hash H
NR
b= # iterations a stream tuple must “see”
ICDE 2007, Constantinople 18/4/2007 19
Cost model: cost of an iteration wrt b, s
ICDE 2007, Constantinople 18/4/2007 20
Cost model
Cloop = function (w, b)
M = function (w, b)
Interrelated M, μ, λ via w, s
ICDE 2007, Constantinople 18/4/2007 21
Tuning: M,μ as a function of b
ICDE 2007, Constantinople 18/4/2007 22
Minimize M, given a desirable rate μ
• Minimize w => minimize M
• Minimum wmin = λcloop In this case λ = μ
• Thus, M is a function only of b, computed by simple calculus
ICDE 2007, Constantinople 18/4/2007 23
Roadmap
• Motivation & Problem statement• The Mesh-Join Algorithm• Cost model & Tuning• Experiments• Conclusions
ICDE 2007, Constantinople 18/4/2007 24
Experimental methodology
• Synthetic data set: Zipf distribution, skew in [0,1], 10% of R as available memory, 3.5M rows, domain of 1.35M values
• Real data set: cloud cover data, 10M rows, domain of 36,000 values
• INL as an opponent, based on a clustered B+, in Berkeley DB
• Platform: Pentium IV 3GHz, 1GB main memory, 7200 RPM disk
ICDE 2007, Constantinople 18/4/2007 25
Predicted and measuredperformance (synthetic data)
ICDE 2007, Constantinople 18/4/2007 26
Performance for varyingmemory (synthetic data)
ICDE 2007, Constantinople 18/4/2007 27
Performance for varyingdata skew (synthetic data)
ICDE 2007, Constantinople 18/4/2007 28
Performance for varyingmemory (real-life data)
ICDE 2007, Constantinople 18/4/2007 29
Roadmap
• Motivation & Problem statement• The Mesh-Join Algorithm• Cost model & Tuning• Experiments• Conclusions
ICDE 2007, Constantinople 18/4/2007 30
Conclusions
• We have proposed the mesh join, a join operator particularly fit for active data warehousing that operates under minimum assumptions for the stream and the relation
• We have presented a cost model and tuning methodology that accurately associates memory consumption with the incoming stream rate
ICDE 2007, Constantinople 18/4/2007 31
Other capabilities & Possible extensions
• Approximate processing• Ordered join output• Tuning for join conditions other than
equality• Dynamic tuning for changes in the stream
rate• Possible Extensions
– multi-way joins– other active ETL operators
ICDE 2007, Constantinople 18/4/2007 32
Thank you for your attention!
… many thanks to our hosts!
This research was co-funded by the European Union in the framework of the program “Pythagoras IΙ” of the “Operational Program for Education and Initial Vocational Training” of the 3rd Community Support Framework of the Hellenic Ministry of Education, funded by 25% from national sources and by 75% from the European Social Fund (ESF).
Figures of the Antikythera mechanism by Rupert Russell <[email protected]> URL: http://www.giant.net.au/users/rupert/kythera/kythera.htm
ICDE 2007, Constantinople 18/4/2007 33
Questions?
ICDE 2007, Constantinople 18/4/2007 34
Backup Slides
ICDE 2007, Constantinople 18/4/2007 35
Related work
• Applications of Symmetric Hash-Joins over windows of streaming inputs that fit in M/M– Chandrasekaran, Franklin @ VLDBJ, 2003– Golab, Ozsu @ VLDB 2003– Hammad, Franklin, Aref, Elmagarmid @ VLDB 2003– Viglas, Naughton, Burger @ VLDB 2003
• Joins of streamed bounded relations: Xjoin variants that flush overflow tuples to disk– Dittrich, Seeger, Taylor, Widmayer @ VLDB 2002– Tao, Yiu, Papadias, Hadjieleftheriou, Mamoulis @
SIGMOD 2005
ICDE 2007, Constantinople 18/4/2007 36
Involved Measures
ICDE 2007, Constantinople 18/4/2007 37
Cost model
I/O per secondI/O per stream tuple
ICDE 2007, Constantinople 18/4/2007 38
Loops of Mesh Join
Join
Queue Q
Hash H
π(k)
ω(1), ...., ω(k)
ω(k)hash
function
hash function
...
Buffer
Buffer
ptrs to ω(1)
ptrs to ω(k)
Join
Queue Q
Hash H
hash function
hash function
...
Buffer
Buffer
ptrs to ω(k)
ptrs to
)%(pRN
k
)(k
)(,),1( kkpRN
)1( pRNk
empty
empty