MIDDLEWARE SYSTEMSRESEARCH GROUP
Scaling Construction of Low Fan-out Overlays for Topic-based Publish/Subscribe Systems
Chen Chen 1
joint work with Roman Vitenberg 3, Hans-Arno Jacobsen 1,2
1 Department of Electrical and Computer Engineering2 Department of Computer Science
University of Toronto
3 Department of InformaticsUniversity of Oslo
ICDCS 2011 1
MIDDLEWARE SYSTEMSRESEARCH GROUP
Example: pub/sub
Interests: IBM
Interests: IBM
Interests: Microsoft
2
<Microsoft, price = 50>
<IBM, price = 100>
ICDCS 2011
MIDDLEWARE SYSTEMSRESEARCH GROUP
Pub/Sub
• A communication paradigm– Subscribers express their interests– Publishers disseminate messages
• Many applications and industry standards– Application integration, financial data dissemination, RSS feed distribution, business process management– WS Notifications, WS Eventing, OMGs’ Real-time Data Dissemination Service
• Topic-based pub/sub– TIBCO RV– Google’s GooPS
ICDCS 2011 3
MIDDLEWARE SYSTEMSRESEARCH GROUP
Two directions for pub/sub
Design of routing protocols
• The design of protocols so that publications and subscriptions are sent most efficiently across the overlay network.
• G. Li et al., ICDCS’08• M. Castro et al., JSAC’02
Construction of overlay• The construction of the
overlay topology such that network traffic is minimized.
• Chockler et al., PODC’07• Onus et al., INFOCOM’09
ICDCS 2011 4
MIDDLEWARE SYSTEMSRESEARCH GROUP
Desirable properties for overlays
• Low average node degree• Low maximum node degree• Low diameter• Topic-connectivity• Efficiency to construct• Adaptability to churn• Ease of distributed implementation
ICDCS 2011 5
V5
V1
{b,c,d}
V2
{a}
{b,d}
V4
{a,b}
V3
{a,c}
MIDDLEWARE SYSTEMSRESEARCH GROUP Our contributions
6
Previous greedy algorithm
High runtime cost
Full knowledge requirement
Centralized operation (difficult to decentralize)
Our divide-and-conquer algorithm
Low runntime cost
Partial knowledge requirement
Centralized operation (easy to decentralize)
4V T
ICDCS 2011
MIDDLEWARE SYSTEMSRESEARCH GROUP
Topic-connected overlay(TCO)
V5
{a,c}
V1
{b,c,d}
V2
{a}
{b,d}
V4
{a,b}
V3
V5
{a,c}
V2
{a}
V4
{a,b}
V1
{b,c,d}
{b,d}
V4
{a,b}
V3
An overlay G Suboverlay Ga istopic-connected
Suboverlay Gb isNOT topic-connected
ICDCS 2011 7
MIDDLEWARE SYSTEMSRESEARCH GROUP MinMax-TCO
V5
V1
{b,c,d}
V2
{a}
{b,d}
V4
{a,b}
V3
V5 has 3 edges
{a,c}
V5
V1
{b,c,d}
V2
{a}
{b,d}
V4
{a,b}
V3
V1 has 4 edges
{a,c}
ICDCS 2011 8
MIDDLEWARE SYSTEMSRESEARCH GROUP
ICDCS 2011 9
MinMax-TCO problem and GM-M algorithm[Onus, 2009]
• Minimum Maximum Degree Topic-Connected Overlay (MinMax-TCO) problem– Given a set of nodes V, set of topics T, and Interest: V T
{true, false}, construct a topic-connected overlay G with minimum maximum degree.
• Theorem: MinMax-TCO is NP-complete
• GM-M algorithm (MinMax-ODA)– always greedily adding an edge which 1) has the largest edge contribution, and 2) increases the maximum node degree minimally
– logarithmic approximation ratio – time complexity 4
V T log V T
MIDDLEWARE SYSTEMSRESEARCH GROUP Why divide-and-conquer
• GM-M’s runtime cost is expensive– time complexity– 487 minutes: |V|=1000, |T|=100, uniform distribution*
* each topic has an equal probability for all nodes that may be interested in that topic
• The number of nodes is the dominant factor
ICDCS 2011 10
4V T
To improve running time
Reduce the size of node set
Divide-and-conquer based on node set V
MIDDLEWARE SYSTEMSRESEARCH GROUP Divide-and-conquer (DC)
V12
V0
{c}
V6
{d}V9 {a,b,c
}V3
{d} {a,b,c}
V8
V11V2
{a}V5{a,b,d}
V14
{b,c,d}
{a,b,c}
{a,b,d}
V13
V1
V4
{c}
V10
V7
{c}{a,c,d}
{c}
{a}
ICDCS 2011
- Divide overlay based on V- Conquer each sub-TCO by GM-M- Combine via cross-TCO links
11
MIDDLEWARE SYSTEMSRESEARCH GROUP Challenges for divide
Node clusteringNodes with similar interests are
placed together• High runtime cost• Not trivial to decentralize• Outputs with varying sizes
Random partitioningEach node flips a coin and gets
assigned to one of the partitions• Fast• Easy to tune• Straightforward to decentralize
However, • May lose correlation among nodes
due to randomness• Maximum node degree is very
sensitive to random partitioning
ICDCS 2011 12
Divide the MinMax-TCO problem into several sub-overlay construction problems
MIDDLEWARE SYSTEMSRESEARCH GROUP Bad case for random partitioning
ICDCS 2011 13
vall
Va1
Vb1
Va2
Vb2 Vb3 Vb4
V1 V2 V3 V4 V5 V6 V7 V8
V1
V2
V3V4V5
V6
V7
V8
vall
Va1
Vb1
Va2
Vb2Vb3
Vb4
{t1, t2, t3, t4, t5, t6, t7, t8}
{t1, t2, t3, t4} {t5, t6, t7, t8}
{t1, t2} {t3, t4} {t5, t6} {t7, t8}
{t1
}{t2
}{t3
}{t4
}{t5
}{t6
}{t7
}{t8
}Random partitioning may increase the degrees of individual nodes by a factor of | T |
MIDDLEWARE SYSTEMSRESEARCH GROUP Poor performance of DC
for MinMax-TCO
ICDCS 2011 14
MIDDLEWARE SYSTEMSRESEARCH GROUP Pub/sub workloads
• The number of nodes |V|: from 1000 to 8000
• The number of topics |T|: from 100 to 1000
• The subscription size: from 50 to 150 on average
• Topic popularity– Uniform: [Chockler, 2007]– Zipf: feed popularity distribution in RSS [Liu, 2005]– Exponential: stock popularity in NYSE [Tock, 2005]
ICDCS 2011 15
MIDDLEWARE SYSTEMSRESEARCH GROUP
Learn from workloads
Observations• Increased maximum node degree occurs when a node
subscribes to a large number of topics• “Pareto 80-20” rule:
– most nodes subscribe to a relatively small number of topics – only a relatively small number of nodes might be interested
in a large number of topics
Basic idea
special treatment for those nodes interested in many topics
ICDCS 2011 16
MIDDLEWARE SYSTEMSRESEARCH GROUP Bulk nodes
Given (V,T,Int)
the bulk node set is a subset
such that
where Tv is the topic set subscribed by node v
and η is defined as bulk subscriber threshold
The lightweight node set is L = V – B
The bulk subscriber threshold η
can be determined based on historical results
ICDCS 2011 17
B V{ : }vB v V T
MIDDLEWARE SYSTEMSRESEARCH GROUP
Challenges for combine
Combine multiple sub-TCOs into one
by adding cross-TCO links as bridges
• Not all nodes need to participate• How to select node subsets for cross-TCO links?
– small : increasing node degrees– large : degrading time efficiency
ICDCS 2011 18
MIDDLEWARE SYSTEMSRESEARCH GROUP Representative set
Given a TCO (V,T,Int,E),
a representative set (rep set) is a subset of V that covers all V’s topics λ times.
ICDCS 2011 19
V5
V1
{b,c,d}
V2
{a}
{b,d}
V4
{a,b}
V3
A topic-connected overlay{v3,v5} is a 1-rep set which covers all topics {a,b,c,d}
V5
V1
{b,c,d}
V2
{a}
{b,d}
V4
{a,b}
V3
V5
V1
{b,c,d}
V2
{b,d}
V4
{a,b}
V3
{a,c}
{a,c}
{v1,v2,v3,v5} is a 2-rep set; {a,b,c,d} is covered twice.
{a}{a,c}
MIDDLEWARE SYSTEMSRESEARCH GROUP
Representative nodes• Representative nodes (rep-nodes)
– Represents the interests of all the nodes– Can function as bridges to determine cross-TCO links– Coverage factor λ : for tuning the size of rep set
• Observation For typical pub/sub workload and sufficiently large partitions, minimal rep sets tend to be several times smaller than the total number of nodes.
• How to find a minimal rep set Rλ for (V,T,Int)? – Linearly reducible to classic set cover problem: NP-complete– Greedy algorithm: always adding a node with the largest number of
topics that are not yet λ-covered• a logarithmic approximation ratio• efficiently implemented
ICDCS 2011 20
MIDDLEWARE SYSTEMSRESEARCH GROUP
Divide-and-Conquer with Bulk and Lightweight Rep-nodes (DCBR-M)
V0
V3
V6V12
V9
V15
V18
V19V20
V1
V4
V7
V13
V10
V16
V2
V5
V8
V14V11
V17
{a,c,h}
{b,c,d,e}{d,f,g,h
}
{c,e,h}
{a,d,e,g}
{a,c,e,f}
{a,e,f,g}
{a,c,d,e}
{a,d,f,g}
{b,d,e,f}
{b,d,e,g}
{a,e,f}
{c,d,g,h}
{b,f,h}
{b,d,e}
{a,c,g,h}
{a,d,e}
{a,c,e,g}
{a,b,c,e,f,g}
{a,b,c,d,f,g}
{a,b,c,e,f,g,h}
ICDCS 2011
MIDDLEWARE SYSTEMSRESEARCH GROUP Design of DCBR-M algorithm
• Different parameters for tuning the algorithm:– The bulk subscriber threshold η divide, combine
bulk nodes vs. lightweight nodes– The coverage factor λ combine
time efficiency vs. the quality of TCO – The number of lightweight partitions p divide, conquer
p = |L| (one node one partition): combine only
p = 1 (all node one partition): conquer only
• How to decentralize DCBR-M– Nodes autonomously organize themselves into random partitions– Different partitions construct inner edges in parallel– Different partitions compute rep sets in parallel– Bulk nodes and rep-nodes communicate and compute outer edges
ICDCS 2011 22
MIDDLEWARE SYSTEMSRESEARCH GROUP
Theoretical analysis of DCBR-M
• DCBR-M will generate a TCO whose maximum node degree is asymptotically the same as that of the TCO output by GM-M under the realistic assumption for typical pub/sub workloads.
• The running time of DCBR-M is
Considerable speedup when |B| and |R| are small
ICDCS 2011 23
4
4
3
LT B R
p
MIDDLEWARE SYSTEMSRESEARCH GROUP Evaluation for DCBR-M (1)
24ICDCS 2011
MIDDLEWARE SYSTEMSRESEARCH GROUP
Evaluation for DCBR-M (2)
ICDCS 2011 25
MIDDLEWARE SYSTEMSRESEARCH GROUP
Evaluation for DCBR-M (3)
26ICDCS 2011
MIDDLEWARE SYSTEMSRESEARCH GROUP Conclusion
ICDCS 2011 27
Running time max degree avg degree Required information
Potential to Decentralize
RingPT good poor: 168 poor: 92 full knowledge good
GM-M poor: 487 min good: 5 good: 3.88 full knowledge poor
DCBR-M
good: 13.6 sec good: 6 good: 4.29 partial knowledge good
MIDDLEWARE SYSTEMSRESEARCH GROUP
Backup
ICDCS 2011 28
MIDDLEWARE SYSTEMSRESEARCH GROUP
Related work
• Construction of the overlay– MinAvg-TCO, Chockler et al. PODC’2007– MinMax-TCO, Onus et al. Infocom’2009– Low-TCO, Onus et al. ICDCS’2010– DC for MinAvg-TCO, Chen et al. ICDCS’2010
• Design of routing protocols– G. Li et al. ICDCS’2008– M. Castro et al. JASC’2002
ICDCS 2011 29
MIDDLEWARE SYSTEMSRESEARCH GROUP
Minimal Number of Links
• A typical pub/sub system combines a number of protocols, many of which maintaining per-link state– A node must constantly monitor the availability of each of its
neighbors (heartbeats and keep-alive state)– If the links are maintained using TCP, there is the cost of
connection state for each link– The more links there are, the fewer topics can be routed over
each individual link, thereby diminishing cross-topic aggregation benefits
– If sequential-diff-based compression scheme is used, there is an extra cost associated with a history table
ICDCS 2011
MIDDLEWARE SYSTEMSRESEARCH GROUP DCBR-M vs DC
• MinMax-TCO vs MinAvg-TCOFundamentally different problems– Average node degree is a “global” property;
maximum node degree possess both “global” and “local” properties.
– DC for MinAvg-TCO does not directly apply to MinMax-TCO.– MinMax-TCO is more sensitive to divide, conquer and combine. – Different algorithm design, theoretical analysis, and experiments.
ICDCS 2011 31