1
J. He, G. Kesidis and D.J. Miller – The Pennsylvania State University In collaboration with K. Levitt, J. Rowe, S.F. Wu – The University of California at Davis Query clustering: Bernoulli model involving prior probabilities that an arbitrary query is in cluster j and probabilities that the a th keyword is used in the query generated based on cluster j. These quantities are learned iteratively by Expectation-Maximization with model-order selection (BIC). Experimental Set-Up Benchmark query forwarding algorithms: RW+Cache: next pod of the exactly matched cached query or choose a random one Closest: next pod of closest cached query Performance metrics: Forwarding - Average number of query hops, Query success probability Clustering - IG name accuracy (distance between a learned IG and the nearest true IG), IG number accuracy Network and query configuration: 256-KW dictionary, 50 IGs, 10-15 KWs per IG KW overlap among IGs, and there are a limited number of KWs in queries, leading to query-to-IG ambiguity. 256 pods, 1-2 users per pod, 3-4 neighbors per pod, 3 IGs per user Query rate of IGs follows Zipf distribution Cache size per pod: 300 queries Re-clustering interval: 25 cache touches Emulab experimental set-up: 16 Emulab nodes, each has 16 pods running on it, so there are 256 pods in total. 100Mbps link rate and no packet loss in link layer Each pod has two modules: Future Work Explore different IG removal/identification methods NLP issues: multiple definitions of each KW, KW misspelling, extraneous KWs, etc. When privacy not an issue, clarify interfacing with centralized OSNs or futuristic ICNs/CCNs Different query forwarding algorithms including privacy-preserving primitives Cooperative learning via sharing centroids Integrating our pod module to run on Emulab with Diaspora pod codebase. Latent Interest Group Discovery and Management by P2P OSNs Abstract Recently, specialized P2P architectures have been proposed to alleviate privacy concerns of centralized Online Social Networks (OSNs). To resolve anycast queries in P2P OSNs, clustering techniques are employed to discover emerging latent Interest Groups (IGs). In this poster, we give some performance results for our proposed approach based on Emulab experiments. Motivation Centralized online social network (OSN) architecture, e.g. Facebook: pros - centralized networking; cons - privacy including cookies and stored information in the central server including location, targeted advertising by the central OSN itself, and threats from third parties (governments, hackers). Privacy issues motivate consideration of a peer-to-peer (P2P) OSN architecture, e.g., Diaspora, Friendica. But currently unsophisticated networking among pods/superpeers. In particular, latent interest group (IG) discovery and anycasting. Our approach Assumptions for preliminary study: A standard dictionary used by all with λ uniquely defined keywords (KWs). All IGs are initially latent. Pods know their active local users (simply by checking local login data). A pod does not share “raw” user data for privacy concerns. No pod leaves/joins, i.e., static social graph. Interest groups: An IG is a set of users who actively share a common interest. Queries are coded as bit vectors in standard keyword space, e.g., q {0,1} λ . IG identified as the centroid of a cluster of queries. E.g., Euclidean distance between two queries or IG-centroids. Query forwarding: Upon receiving a query q, a pod takes the following steps (failing one, the next is tried): 1. Centralized lookup, e.g., via CCN; 2. Search local users; 3. Find the centroid, C, closest to q; 4. If q is closer to C than τ (< 1) times the distance between q and its second nearest centroid, then use the forwarding table at C; 5. Random Walk / Limited-Scope Flood So, initially forward by RW/LSF Query forwarding table: Forwarding a query q first matched to closest centroid C2, then to pod p 22 associated with the closest stored query q 22 within C’s cluster. Experimental Results EM-BIC uses 22% less hops than Closest, and 36% less than RW+Cache Success Prob.: EM-BIC and Closest (>96% ) vs. RW+Cache (~80%) Acknowledgements This work supported by NSF CNS grant 1152320. References Details of this work (with a description of related work and other simulated experimental results) was recently published by us in Proc. ASE/IEEE SOCIALCOM, Washington, DC, Sept. 2013. More distant IGs can be discovered when # of cache touches increases Monotonic improvement in clustering performance with cache touches Clustering 300 IGs can take as long as 200s .

J. He, G. Kesidis and D.J. Miller – The Pennsylvania State University In collaboration with K. Levitt, J. Rowe, S.F. Wu – The University of California

Embed Size (px)

Citation preview

Page 1: J. He, G. Kesidis and D.J. Miller – The Pennsylvania State University In collaboration with K. Levitt, J. Rowe, S.F. Wu – The University of California

J. He, G. Kesidis and D.J. Miller – The Pennsylvania State UniversityIn collaboration with K. Levitt, J. Rowe, S.F. Wu – The University of California at Davis

Query clustering: Bernoulli model involving prior probabilities that an arbitrary query is in cluster j and probabilities that the ath keyword is used in the query generated based on cluster j. These quantities are learned iteratively by Expectation-Maximization with model-order selection (BIC).

Experimental Set-Up

Benchmark query forwarding algorithms:• RW+Cache: next pod of the exactly matched

cached query or choose a random one• Closest: next pod of closest cached query

Performance metrics:• Forwarding - Average number of query hops,

Query success probability• Clustering - IG name accuracy (distance between

a learned IG and the nearest true IG), IG number accuracy

Network and query configuration: • 256-KW dictionary, 50 IGs, 10-15 KWs per IG• KW overlap among IGs, and there are a limited

number of KWs in queries, leading to query-to-IG ambiguity.

• 256 pods, 1-2 users per pod, 3-4 neighbors per pod, 3 IGs per user

• Query rate of IGs follows Zipf distribution• Cache size per pod: 300 queries• Re-clustering interval: 25 cache touches

Emulab experimental set-up: • 16 Emulab nodes, each has 16 pods running on it,

so there are 256 pods in total.• 100Mbps link rate and no packet loss in link layer• Each pod has two modules:

Future Work• Explore different IG removal/identification methods• NLP issues: multiple definitions of each KW, KW

misspelling, extraneous KWs, etc.• When privacy not an issue, clarify interfacing with

centralized OSNs or futuristic ICNs/CCNs• Different query forwarding algorithms including

privacy-preserving primitives• Cooperative learning via sharing centroids• Integrating our pod module to run on Emulab with

Diaspora pod codebase.

Latent Interest Group Discovery and Management by P2P OSNs

Abstract

Recently, specialized P2P architectures have been proposed to alleviate privacy concerns of centralized Online Social Networks (OSNs). To resolve anycast queries in P2P OSNs, clustering techniques are employed to discover emerging latent Interest Groups (IGs). In this poster, we give some performance results for our proposed approach based on Emulab experiments.

Motivation

Centralized online social network (OSN) architecture, e.g. Facebook:

• pros - centralized networking; • cons - privacy including cookies and stored

information in the central server including location, targeted advertising by the central OSN itself, and threats from third parties (governments, hackers).

• Privacy issues motivate consideration of a peer-to-peer (P2P) OSN architecture, e.g., Diaspora, Friendica.

• But currently unsophisticated networking among pods/superpeers.

• In particular, latent interest group (IG) discovery and anycasting.

Our approach

Assumptions for preliminary study: A standard dictionary used by all with λ uniquely defined keywords (KWs). All IGs are initially latent. Pods know their active local users (simply by checking local login data). A pod does not share “raw” user data for privacy concerns. No pod leaves/joins, i.e., static social graph.

Interest groups: An IG is a set of users who actively share a common interest. Queries are coded as bit vectors in standard keyword space, e.g., q ∈ {0,1}λ. IG identified as the centroid of a cluster of queries. E.g., Euclidean distance between two queries or IG-centroids.

Query forwarding: Upon receiving a query q, a pod takes the following steps (failing one, the next is tried):

1. Centralized lookup, e.g., via CCN;2. Search local users;3. Find the centroid, C, closest to q;4. If q is closer to C than τ (< 1) times the

distance between q and its second nearest centroid, then use the forwarding table at C;

5. Random Walk / Limited-Scope FloodSo, initially forward by RW/LSF

Query forwarding table: Forwarding a query q first matched to closest centroid C2, then to pod p22 associated with the closest stored query q22 within C’s cluster.

Experimental Results

• EM-BIC uses 22% less hops than Closest, and 36% less than RW+Cache

• Success Prob.: EM-BIC and Closest (>96% ) vs. RW+Cache (~80%)

Acknowledgements

This work supported by NSF CNS grant 1152320.

References

Details of this work (with a description of related work and other simulated experimental results) was recently published by us in Proc. ASE/IEEE SOCIALCOM, Washington, DC, Sept. 2013.

• More distant IGs can be discovered when # of cache touches increases

• Monotonic improvement in clustering performance with cache touches

• Clustering 300 IGs can take as long as 200s.