5
[ exploratory DSP ] Philippe Cudré-Mauroux and Karl Aberer Message Passing in Semantic Peer-to-Peer Overlay Networks P eer-to-peer (P2P) systems rely on machine-to-machine ad- hoc communications to offer services to a community. Contrary to the classical client-server architecture, P2P systems consider all peers, i.e., all nodes partici- pating in the network, as being equal. Hence, peers can at the same time act as clients consuming resources from the system, and as servers providing resources to the community. P2P appli- cations function on top of existing rout- ing infrastructures, typically on top of the IP network, and organize peers into logical and decentralized structures called overlay networks. In this column, we discuss explorato- ry research related to data management in P2P overlay networks. First, we dis- cuss the notions of unstructured and structured P2P overlay networks. Then, we discuss data management in such networks by introducing an additional layer to handle semantic heterogeneity and data integration. Finally, we present a method based on sum-product message passing to detect inconsistent informa- tion in this setting. P2P OVERLAY NETWORKS— ARCHITECTURE Increasingly on the Internet, applications are supported by sets of loosely connected machines operating without any form of central coordination; Internet telephony networks such as Skype [1] and file shar- ing applications like Gnutella [2] are two well-known examples of this trend. Contrary to the client-server setting, where applications are bound to sets of static servers identified by an IP address, these applications need ways of organizing the dynamic sets of machines providing the service. P2P overlay networks address this need and allow the management of virtual and decentralized networks created on top of the IP infrastructure. The virtual structure connecting all the peers operating in an overlay network can vary. In unstructured overlay net- works such as Gnutella, peers establish connections to a fixed number of other peers, creating a random graph of P2P connections. Requests originating from one peer are forwarded by the other peers in a cooperative manner, as depicted in Figure 1(a). This relatively simple and robust mechanism is, however, network- intensive, as it broadcasts all queries to all peers within a certain radius irrespec- tive of the content of the query. Structured overlay networks were introduced to alleviate network traffic while maximizing the probability of a query locating a specific peer. Peers in a structured overlay can for example be organized on a multidimensional torus [3] or into a virtual binary search tree, as promulgated by the P-Grid P2P system [4] and illustrated in Figure 1(b). Such systems provide hash-table functionali- ties on an Internet-like scale and are known as distributed hash tables (DHTs). They typically enable global search on shared data items in a totally decentral- ized way in O (log(N )) messages (i.e., packets sent from one peer to another), where N is the number of peers in the overlay. P2P OVERLAY NETWORKS— DATA MANAGEMENT SEMANTIC HETEROGENEITY P2P overlays originally dealt with very simple data and query models: only file names were shared and queries were composed of a single hash value or a key- word. Rapidly, several research efforts [4] tried to enrich overlay networks with more expressive models to support struc- tured data conforming to schemas. In 1053-5888/07/$25.00©2007IEEE [FIG1] Two P2P architectures: (a) an unstructured P2P overlay a la Gnutella: a query originating from a peer on the left-hand side of the figure is iteratively gossiped up to three times, and (b) a structured distributed hash-table a la P-grid: peers are organized into a virtual binary tree. 0 1 00 01 10 11 1 3 1 1 2 2 2 2 2 2 3 3 3 3 A Structured Distributed Hash Table An Unstructured Overlay (a) (b) Digital Object Identifier 10.1109/MSP.2007.904799 IEEE SIGNAL PROCESSING MAGAZINE [131] SEPTEMBER 2007

Message Passing in Semantic Peer-to-Peer Overlay Networks Ppeople.csail.mit.edu/pcm/papers/SignalProcessing.pdf ·  · 2009-02-09Message Passing in Semantic Peer-to-Peer Overlay

  • Upload
    donhu

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

[exploratory DSP]Philippe Cudré-Maurouxand Karl Aberer

Message Passing in Semantic Peer-to-PeerOverlay Networks

Peer-to-peer (P2P) systems relyon machine-to-machine ad-hoc communications to offerservices to a community.Contrary to the classical

client-server architecture, P2P systemsconsider all peers, i.e., all nodes partici-pating in the network, as being equal.Hence, peers can at the same time act asclients consuming resources from thesystem, and as servers providingresources to the community. P2P appli-cations function on top of existing rout-ing infrastructures, typically on top ofthe IP network, and organize peers intological and decentralized structurescalled overlay networks.

In this column, we discuss explorato-ry research related to data managementin P2P overlay networks. First, we dis-cuss the notions of unstructured andstructured P2P overlay networks. Then,we discuss data management in suchnetworks by introducing an additionallayer to handle semantic heterogeneityand data integration. Finally, we presenta method based on sum-product messagepassing to detect inconsistent informa-tion in this setting.

P2P OVERLAY NETWORKS—ARCHITECTUREIncreasingly on the Internet, applicationsare supported by sets of loosely connectedmachines operating without any form ofcentral coordination; Internet telephonynetworks such as Skype [1] and file shar-ing applications like Gnutella [2] are twowell-known examples of this trend.Contrary to the client-server setting,where applications are bound to sets ofstatic servers identified by an IP address,these applications need ways of organizing

the dynamic sets of machines providingthe service. P2P overlay networks addressthis need and allow the management ofvirtual and decentralized networks createdon top of the IP infrastructure.

The virtual structure connecting allthe peers operating in an overlay networkcan vary. In unstructured overlay net-works such as Gnutella, peers establishconnections to a fixed number of otherpeers, creating a random graph of P2Pconnections. Requests originating fromone peer are forwarded by the other peersin a cooperative manner, as depicted inFigure 1(a). This relatively simple androbust mechanism is, however, network-intensive, as it broadcasts all queries toall peers within a certain radius irrespec-tive of the content of the query.

Structured overlay networks wereintroduced to alleviate network trafficwhile maximizing the probability of aquery locating a specific peer. Peers in astructured overlay can for example beorganized on a multidimensional torus

[3] or into a virtual binary search tree, aspromulgated by the P-Grid P2P system[4] and illustrated in Figure 1(b). Suchsystems provide hash-table functionali-ties on an Internet-like scale and areknown as distributed hash tables (DHTs).They typically enable global search onshared data items in a totally decentral-ized way in O (log(N )) messages (i.e.,packets sent from one peer to another),where N is the number of peers in theoverlay.

P2P OVERLAY NETWORKS—DATA MANAGEMENT

SEMANTIC HETEROGENEITYP2P overlays originally dealt with verysimple data and query models: only filenames were shared and queries werecomposed of a single hash value or a key-word. Rapidly, several research efforts [4]tried to enrich overlay networks withmore expressive models to support struc-tured data conforming to schemas. In

1053-5888/07/$25.00©2007IEEE

[FIG1] Two P2P architectures: (a) an unstructured P2P overlay a la Gnutella: a queryoriginating from a peer on the left-hand side of the figure is iteratively gossiped up tothree times, and (b) a structured distributed hash-table a la P-grid: peers are organizedinto a virtual binary tree.

0 1

00 01 10 11

1

3

1

1

2

22

2

2

2

3

3

3

3

A Structured Distributed Hash TableAn Unstructured Overlay

(a) (b)

Digital Object Identifier 10.1109/MSP.2007.904799

IEEE SIGNAL PROCESSING MAGAZINE [131] SEPTEMBER 2007

data management, schemas are used asdeclarative models to define the structureof the data. Relational schemas, for exam-ple, organize data in n-ary relations (i.e.,tables), while XML schemas structuredata through hierarchies of elements.

Sharing (semi) structured data, suchas relational tuples or XML instances, ishowever intricate in decentralized andP2P networks. Rapidly, the problem ofsemantic heterogeneity surfaces, as dis-tinct sources might encode similar infor-mation using different schemas tostructure their data.

To tackle the problem of semanticheterogeneity and relate similar pieces of

information encoded using differentschemas, database systems take advan-tage of schema mappings. Mappings pro-vide a bridge between the definitions oftwo schemas. They allow to reformulatea structured query, such as an SQLquery, posed against one schema into anew query posed against a similarschema. A centralized component calleda mediator [5] would typically store aglobal schema that is related to all theschemas of the databases through map-pings. Thus, the mediator provides a cen-tral query interface that allowstransparent access to disparate and het-erogeneous database systems.

PEER DATA MANAGEMENTMaintaining a global schema integratingall data sources is not desirable indecentralized infrastructures such asP2P systems where peers are looselycoupled. Recently, a new breed ofdatabase systems called peer datamanagement systems (PDMSs) [6]emerged as an attempt to decentralizethe mediator architecture and allow thesystems to scale gracefully with thenumber of heterogeneous sources.

Instead of requiring the definition ofa global schema, PDMSs consider loose-ly structured networks of mappingsbetween pairs of schemas to iterativelydisseminate a query from one databaseto all the other related databases. A sim-ple example of a PDMS network is illus-trated in Figure 2, where four peerdatabases p1 . . . p4 are connectedthrough five schema mappings. Themappings, which are usually expressedas queries, can be used to reformulate aquery iteratively to disseminate thequery throughout the network. Once adatabase sends a query to its immediateneighbors through local mapping links,its neighbors (after processing thequery) in turn propagate the query totheir own neighbors, and so on and soforth until the query reaches all (or apredefined number of) databases. Theway the query spreads around the net-work mimics the way messages are rout-ed in an unstructured P2P system suchas that shown in Figure 1(a).

For a new peer database wishing tojoin an existing PDMS, the cost ofentry is minimal as for most P2P sys-tems: the new peer only has to definea few schema mappings between itsschema and the schemas of otherpeers already connected to the sys-tem. The peers can continue to han-dle their data using their ownschemas, but can query all the otherdatabases thanks to iterative queryreformulation through the mappings.In case of an intermediate node fail-ure, peers can reroute their queriesthrough different schema mappingsor create new mappings to circum-vent the offline peer and continue toquery distant sources.

[exploratory DSP] continued

[FIG2] A simple example of a peer data management system; four peer databases p areconnected through five pairwise schema mappings m, used to reformulate queriesoriginating from one database into corresponding queries at the other databases.

m23:/Creator ≡ /Author

m34:/Painting/Painter ≡ /Author[/Art="painting"]

p1 p3

p2

m14: /art/creatDate ≡ /Painting/CreatedOn/art/painter ≡ /Painting/Painter

m24: /Creator ≡ /Painting/CreatedOn

p4

m12: /art/painter ≡ /Creator

[FIG3] A semantic P2P overlay network: in many practical settings, the semanticmediation layer relating the schemas is independent from the organization of the peers,which is itself dissociated from the physical network structure of the machines.

IP Network

subnet"Physical"Network

StructuredOverlayLayer

SemanticMediation

Layer

127.143

127.144

127.145

34.109 35.142 38.143 45.123109.144

112.144117.122125.98

0001

0100

0011 00100101

0101

0110

0111

Schema A Schema C

Schema DSchema H

Schema ZMapping

Update (Key, Value)

Retrieve (Key)

Update (Data)SearchFor (Query)

Update (Schema)Update (Mapping)

IEEE SIGNAL PROCESSING MAGAZINE [132] SEPTEMBER 2007

SEMANTIC P2P OVERLAYNETWORKSIn practice, a single schema can be usedsimultaneously by many independentparties. Furthermore, some peers mightchoose more than one schema to struc-ture their data locally, as they realistical-ly might have to handle very diversepieces of information. Hence, the organi-zation of the schemas and mappings canoften be uncorrelated with the organiza-tion of the peers themselves. A model inwhich physical machines form a P2Poverlay network, which is itself inde-pendent of the semantic schema-to-schema overlay handling dataintegration, is illustrated in Figure 3.

In this figure, the base layer representsan IP network with different machinesidentified by IP addresses. The machinesself-organize into a virtual P2P overlaynetwork (middle layer), with its ownaddressing scheme. The overlay, which istypically structured, only supports verysimple operations (Retrieve, Update) onhash values. The upper layer consists ofthe structured data, schemas and seman-tic mappings relating the schemas. It iscreated using the overlay layer to collabo-ratively share all semantic information,allowing all peers at the overlay layer toquery and update the semantic mediationlayer, and reformulate queries followingdifferent semantic paths.

INCORRECT SEMANTIC MAPPINGSIn semantic P2P overlay networks,consistency or quality of the schemamappings stored in the semantic media-tion layer is one of the main concerns.Incorrect mappings generate incorrectreformulations of the queries, and thusincorrect results. As P2P systems targetlarge-scale, decentralized, and heteroge-neous environments where autonomousparties have full control on the design oftheir own schemas, it is not always possi-ble to create correct mappings betweentwo given schemas. As a result, the cre-ation of correct mappings is oftenprecluded. Therefore, in many cases anapproximate mapping relating two simi-lar but semantically slightly divergentconcepts might be more beneficial thanno mapping at all.

Many different techniques, based onclassification or data mining, can be usedto map schemas automatically and to eval-uate the correctness of the mappings [7].In what follows, we focus on the semanticmediation layer and propose a radicallynew technique that takes advantage of thedistributed nature of the schemas andmappings in the system to detect the map-pings whose semantics diverge from thesemantics of other mappings.

ANALYSIS OF RETURNING QUERIESWhen reformulating a query using seriesof mappings, cycles or redundant pathsof reformulations can occur (for examplegoing from p1 to p2 to p4 and back top1 in the simple semantic mediationlayer depicted in Figure 2). When such acycle is detected, a peer can compare theoriginal query q it received to the refor-mulated query q ′. Two cases can occur:

q ′ ≡ q: this occurs when the query,after having been reformulated ntimes through n mappings, still isequivalent to the original query q.Since this indicates a high level ofsemantic agreement along the cycle,we say that this represents positivefeedback f + on the mappings consti-tuting the cycle.q ′ �= q: this occurs when the query,after having been reformulated n timesthrough n mappings, returns to a peeras a semantically different query. As

this indicates some disagreement onthe semantics of the query along thecycle of mappings, we say that this rep-resents negative feedback f − on themappings constituting the cycle.

ANALYSIS OF THE SEMANTICMEDIATION LAYERSeries of mappings m1, m2 . . . m n canaccidentally compensate their respec-tive errors and actually create a seman-tically correct composite mappingm n ◦ m n−1 . . . ◦ m1 in the end.Assuming a probability � of two or moremapping errors being compensated alonga cycle we can determine the conditionalprobability of a cycle producing positivefeedback f +

� given the correctness of itsconstituting mappings m1, . . . , m n:

P( f +� |m1, . . . , mn) =⎧⎪⎪⎨

⎪⎪⎩

1 if all mappings correct0 if one mapping incorrect� if two or more

mappings incorrect.

Examining again Figure 2 posing aquery q = /art/painter at p1 and send-ing the query to p2, p4 and then back top1 would result in a returning queryq ′ = /art/creatDate. Obviously, thesemantics of the query have changedafter having been reformulated threetimes. The question is which mapping isresponsible for that change. It is difficult

[FIG4] Modeling (a) an undirected network of mappings as (b) a factor graph ; the nodesof the factor graph represent, from top to bottom, probability functions encapsulating apriori information on the schema mappings, variables for the semantic correctness ofschema mappings, probability functions relating schema mapping variables to feedbackvariables, and finally variables for the feedback that can be observed in the network.

p1

p2

p3

p4

m12 m23

m43m14

m24

m12m12 m23 m43m43 m14 m24m24

f1

f( )

f2 f3

(a) (b)

IEEE SIGNAL PROCESSING MAGAZINE [133] SEPTEMBER 2007

to give an answer for a single cycle.Instead, one would need to consider sev-eral cycles and their correlation throughtheir shared mappings.

Feedback can be obtained from threedifferent mapping cycles in the networkdepicted in Figure 2:

f 1� : m12 − m23 − m34 − m41

f 2� : m12 − m24 − m41

f 3� : m23 − m34 − m24.

One might construct a global proba-bilistic graphical model linking themappings and the feedback informationfrom the network. Figure 4 depicts afactor graph [8] for the small networkof Figure 2, containing (from top tobottom): five one-variable factors forthe prior probability functions on themappings, five mappings variables mij,three factors f( ) linking feedback vari-ables to mapping variables throughconditional probability functions(defined earlier), and finally three feed-back variables fk. Note that feedbackvariables are usually not independent(e.g., in Figure 4, all three feedbacksare correlated).

DISCOVERY OF ERRONEOUSMAPPINGSPosterior values on the semantic correct-ness of the mappings given the feedbackobserved in the network can be comput-ed using standard sum-product messagepassing techniques [8]. Such computa-tions can be performed globally, forexample by deriving a junction tree fromthe factor graph, or can be approximatedlocally by the peers using iterative mes-sage passing schemes. Given a local fac-tor graph containing only the mappingsit is responsible for, a peer can locallyupdate the believes on its mappings byreformulating the sum-product algo-rithm. Messages M from mapping vari-ables m to factors f( ) can be sentrepeatedly, both locally to the local factorf( ), and remotely to peers responsiblefor the other mappings connected to f( ).Using f( ) as a function returning theneighboring nodes of a given node in thefactor graph and

∑∼{xi} to denote a

summary over all but one variable xi :

local message from local mapping mi

to factor fj ( ) ∈ n(mi):

Mmi→ f j ( )(mi) =∏

f ( )∈n (mi)\{ fj ( )}× Mf ( )→mi

(mi)

remote message from local mappingmi to factor fk ( ) from peer p0 to peer pj

responsible for a mapping m ∈ n( fk( )):

Mp0→ fk ( )(mi) =∏

f ( )∈n (mi)\{ fk ( )}× Mf ( )→mi

(mi).

Messages for the mapping variables canthen be computed by combining bothlocal messages and remote messagesreceived from distant peers:

local message from factor fj ( ) tomapping variable mi :

Mfj ( )→mi(mi) =

∑∼{mi}

⎛⎝ fj ( )

∏pk∈n( fj ( ))

Mpk → fj ( )(pk)

×∏

ml∈n( fj ( ))\{mi}Mml → fj ( )(ml)

⎞⎠

posterior semantic correctness oflocal mapping mi :

P(mi|f ) = α

⎛⎝ ∏

f( )∈n(mi)

Mf( )→mi(mi)

⎞⎠

where α is a normalizing constant ensur-ing that the probabilities of all eventssum to one (i.e., making sure thatP(mi = correct ) + P(mi = incorrect )

= 1). Examining again the simplesemantic in Figure 2 and taking intoaccount the feedback given by the threecycles appearing in the network, thisanalysis converges after a few iterations.Without any a priori information on themappings, it successfully detects theinconsistencies related to the mappingbetween p2 and p4 (P(m24 = correct )

= 0.3) while assessing the other map-pings as correct (Pcorrect ) = 0.6).

Note that the allocation of responsi-bilities between the peers at the overlaylayer and the semantic information atthe mediation layer might be quite dif-ferent in the general case of a three-layered system such as the one depictedin Figure 3. The calculations pertainingto some variables or factors might beredundant (as above, where each feed-back variable is taken into considerationby several peers). Those computationscan be embedded in the normal behaviorof the network by piggybacking on thequery reformulation traffic [9]. Severallarge-scale experiments were conductedin order to determine the performanceof this technique [10]. The results arevery promising for a totally automatedtechnique, as the precision of the deci-sion process in detecting correct orincorrect mappings ranges 80–100% formost realistic settings.

CONCLUSIONSWe presented a method that exploits aprobabilistic model to assess the degreeof semantic correctness of schema map-pings in a semantic P2P overlay network.The method takes advantage of transitiveclosures of mapping operations to com-pare reformulated queries at the seman-tic mediation layer. Probabilistic valueson the correctness of the mappings arethen computed by iteratively passingmessages between the peers.

Information integration techniqueswere traditionally centered around globalschemas, perfect schema mappings, andstatic query rewritings. Those techniquesare today inappropriate to maximize theperformance of large-scale informationsystems that operate without any form ofcentral coordination, such as semanticP2P overlay networks. Data noise—emerging from uncertainty on distantdata or from a lack of coordinationamong the sources—plays an everincreasing role in structured data man-agement. Correctly handling that newtype of noise is an important challenge,that might well be tackled by techniquesrelating to signal processing and infor-mation theory, as we tried to outlinewith the message passing scheme pre-sented above.

[exploratory DSP] continued

IEEE SIGNAL PROCESSING MAGAZINE [134] SEPTEMBER 2007

IEEE SIGNAL PROCESSING MAGAZINE [135] SEPTEMBER 2007

ACKNOWLEDGMENTSThis work was supported by the SwissNSF National Competence Center inResearch on Mobile Information andCommunication Systems (grant 5005-67322) and by the EPFL Center for GlobalComputing as part of the European proj-ect NEPOMUK No FP6-027705.

AUTHORSPhilippe Cudré-Mauroux ([email protected]) is a senior researcher atEcole Polytechnique Fédérale de Lausanne(EPFL) in Switzerland. His researchinterests are in information managementin large-scale settings with an emphasis onsemantic overlay networks and peer datamanagement. He is a member of the IFIPWorking Group 2.6 on Databases.

Karl Aberer ([email protected]) is aprofessor at EPFL and the director of theSwiss National Research Center forMobile Information and CommunicationSystems. His research interests consistof the foundation, algorithms and infra-structures for distributed informationmanagement including peer-to-peeroverlay networks, semantic interoper-ability, trust management and applica-tions to scientific and sensor datamanagement.

REFERENCES[1] S. Guha, N. Daswani, and R. Jain, “An experimen-tal study of the Skype peer-to-peer VoIP system.”Available: http://saikat.guha.cc/pub/iptps06-skype/

[2] M. Ripeanu, “Peer-to-peer architecture casestudy: Gnutella network,” in Proc. Int. Conf. Peer-to-Peer Comput., 2001.

[3] S. Ratnasamy, P. Francis, M. Handley, R. Karp,and S. Shenker, “A scalable content-addressable net-work,” in Proc. ACM SIGCOMM, 2001.

[4] K. Aberer (ed.), “Special issue on peer to peerdata management,” in Proc. ACM SIGMOD Record,vol. 32, no. 3, 2003.

[5] G. Wiederhold, “Mediators in the architec-ture of future information systems,” IEEEComputer, vol. 25, no. 3, 1992.

[6] A.Y. Halevy, Z.G. Ives, D. Suciu, and I. Tatarinov,“Schema mediation in peer data management sys-tems,” in Proc. Int. Conf. Data Eng., 2003.

[7] E. Rahm and P.A. Bernstein, “A survey ofapproaches to automatic schema matching,” Int.J. Very Large Data Bases, vol. 10, no. 4, 2001.

[8] F.R. Kschischang, B.J. Frey, and H.-A. Loeliger,“Factor graphs and the sum-product algorithm,”IEEE Trans. Inform. Theory, vol. 47, no. 2, 2001.

[9] P. Cudré-Mauroux, K. Aberer, and A. Feher,“Probabilistic message passing in peer data man-agement systems,” in Proc. Int. Conf. Data Eng.,2006.

[10] P. Cudré-Mauroux, “Emergent Semantics:Rethinking Interoperability for Large-ScaleDecentralized Information Systems,” Ph.D. disserta-tion, 3690, EPFL, Lausanne, Switzerland, 2006. [SP]

RANK IN IEEE TOP 100 N TIMES (SEPTEMBER, 2006– IN TOP

FEBRUARY 2007) 100 SINCEABSTRACT FEB JAN DEC NOV OCT SEP JAN 2006

LOSSLESS WATERMARKING FOR This article presents a novel 97 2IMAGE AUTHENTICATION framework for lossless Celik, M.U.; Sharma, G.; Tekalp, A.M. watermarking, which enables IEEE Transactions on Image zero-distortion reconstruction of Processing, vol. 15, no. 4, Apr. 2006, the un-watermarked images pp. 1042–1049 upon verification.

IEEE SPS CONFERENCES

FULLY UTILIZED AND REUSABLE This conference article describes a 35 1ARCHITECTURE FOR FRACTIONAL new VLSI architecture for fractional MOTION ESTIMATION OF H.264/AVC motion estimation in the H.264/AVC Chen, T.C.; Huang, Y.W.; Chen, L.G. video compression standard, which IEEE International Conference on can also be used with other Acoustics, Speech, and Signal standards. Processing, vol. 5, May 2004, pp. 9–12

[SP]

[reader’s CHOICE] continued from page 10

TITLE, AUTHOR, PUBLICATION YEARIEEE SPS JOURNALS