Upload
donhu
View
216
Download
0
Embed Size (px)
Citation preview
[exploratory DSP]Philippe Cudré-Maurouxand Karl Aberer
Message Passing in Semantic Peer-to-PeerOverlay Networks
Peer-to-peer (P2P) systems relyon machine-to-machine ad-hoc communications to offerservices to a community.Contrary to the classical
client-server architecture, P2P systemsconsider all peers, i.e., all nodes partici-pating in the network, as being equal.Hence, peers can at the same time act asclients consuming resources from thesystem, and as servers providingresources to the community. P2P appli-cations function on top of existing rout-ing infrastructures, typically on top ofthe IP network, and organize peers intological and decentralized structurescalled overlay networks.
In this column, we discuss explorato-ry research related to data managementin P2P overlay networks. First, we dis-cuss the notions of unstructured andstructured P2P overlay networks. Then,we discuss data management in suchnetworks by introducing an additionallayer to handle semantic heterogeneityand data integration. Finally, we presenta method based on sum-product messagepassing to detect inconsistent informa-tion in this setting.
P2P OVERLAY NETWORKS—ARCHITECTUREIncreasingly on the Internet, applicationsare supported by sets of loosely connectedmachines operating without any form ofcentral coordination; Internet telephonynetworks such as Skype [1] and file shar-ing applications like Gnutella [2] are twowell-known examples of this trend.Contrary to the client-server setting,where applications are bound to sets ofstatic servers identified by an IP address,these applications need ways of organizing
the dynamic sets of machines providingthe service. P2P overlay networks addressthis need and allow the management ofvirtual and decentralized networks createdon top of the IP infrastructure.
The virtual structure connecting allthe peers operating in an overlay networkcan vary. In unstructured overlay net-works such as Gnutella, peers establishconnections to a fixed number of otherpeers, creating a random graph of P2Pconnections. Requests originating fromone peer are forwarded by the other peersin a cooperative manner, as depicted inFigure 1(a). This relatively simple androbust mechanism is, however, network-intensive, as it broadcasts all queries toall peers within a certain radius irrespec-tive of the content of the query.
Structured overlay networks wereintroduced to alleviate network trafficwhile maximizing the probability of aquery locating a specific peer. Peers in astructured overlay can for example beorganized on a multidimensional torus
[3] or into a virtual binary search tree, aspromulgated by the P-Grid P2P system[4] and illustrated in Figure 1(b). Suchsystems provide hash-table functionali-ties on an Internet-like scale and areknown as distributed hash tables (DHTs).They typically enable global search onshared data items in a totally decentral-ized way in O (log(N )) messages (i.e.,packets sent from one peer to another),where N is the number of peers in theoverlay.
P2P OVERLAY NETWORKS—DATA MANAGEMENT
SEMANTIC HETEROGENEITYP2P overlays originally dealt with verysimple data and query models: only filenames were shared and queries werecomposed of a single hash value or a key-word. Rapidly, several research efforts [4]tried to enrich overlay networks withmore expressive models to support struc-tured data conforming to schemas. In
1053-5888/07/$25.00©2007IEEE
[FIG1] Two P2P architectures: (a) an unstructured P2P overlay a la Gnutella: a queryoriginating from a peer on the left-hand side of the figure is iteratively gossiped up tothree times, and (b) a structured distributed hash-table a la P-grid: peers are organizedinto a virtual binary tree.
0 1
00 01 10 11
1
3
1
1
2
22
2
2
2
3
3
3
3
A Structured Distributed Hash TableAn Unstructured Overlay
(a) (b)
Digital Object Identifier 10.1109/MSP.2007.904799
IEEE SIGNAL PROCESSING MAGAZINE [131] SEPTEMBER 2007
data management, schemas are used asdeclarative models to define the structureof the data. Relational schemas, for exam-ple, organize data in n-ary relations (i.e.,tables), while XML schemas structuredata through hierarchies of elements.
Sharing (semi) structured data, suchas relational tuples or XML instances, ishowever intricate in decentralized andP2P networks. Rapidly, the problem ofsemantic heterogeneity surfaces, as dis-tinct sources might encode similar infor-mation using different schemas tostructure their data.
To tackle the problem of semanticheterogeneity and relate similar pieces of
information encoded using differentschemas, database systems take advan-tage of schema mappings. Mappings pro-vide a bridge between the definitions oftwo schemas. They allow to reformulatea structured query, such as an SQLquery, posed against one schema into anew query posed against a similarschema. A centralized component calleda mediator [5] would typically store aglobal schema that is related to all theschemas of the databases through map-pings. Thus, the mediator provides a cen-tral query interface that allowstransparent access to disparate and het-erogeneous database systems.
PEER DATA MANAGEMENTMaintaining a global schema integratingall data sources is not desirable indecentralized infrastructures such asP2P systems where peers are looselycoupled. Recently, a new breed ofdatabase systems called peer datamanagement systems (PDMSs) [6]emerged as an attempt to decentralizethe mediator architecture and allow thesystems to scale gracefully with thenumber of heterogeneous sources.
Instead of requiring the definition ofa global schema, PDMSs consider loose-ly structured networks of mappingsbetween pairs of schemas to iterativelydisseminate a query from one databaseto all the other related databases. A sim-ple example of a PDMS network is illus-trated in Figure 2, where four peerdatabases p1 . . . p4 are connectedthrough five schema mappings. Themappings, which are usually expressedas queries, can be used to reformulate aquery iteratively to disseminate thequery throughout the network. Once adatabase sends a query to its immediateneighbors through local mapping links,its neighbors (after processing thequery) in turn propagate the query totheir own neighbors, and so on and soforth until the query reaches all (or apredefined number of) databases. Theway the query spreads around the net-work mimics the way messages are rout-ed in an unstructured P2P system suchas that shown in Figure 1(a).
For a new peer database wishing tojoin an existing PDMS, the cost ofentry is minimal as for most P2P sys-tems: the new peer only has to definea few schema mappings between itsschema and the schemas of otherpeers already connected to the sys-tem. The peers can continue to han-dle their data using their ownschemas, but can query all the otherdatabases thanks to iterative queryreformulation through the mappings.In case of an intermediate node fail-ure, peers can reroute their queriesthrough different schema mappingsor create new mappings to circum-vent the offline peer and continue toquery distant sources.
[exploratory DSP] continued
[FIG2] A simple example of a peer data management system; four peer databases p areconnected through five pairwise schema mappings m, used to reformulate queriesoriginating from one database into corresponding queries at the other databases.
m23:/Creator ≡ /Author
m34:/Painting/Painter ≡ /Author[/Art="painting"]
p1 p3
p2
m14: /art/creatDate ≡ /Painting/CreatedOn/art/painter ≡ /Painting/Painter
m24: /Creator ≡ /Painting/CreatedOn
p4
m12: /art/painter ≡ /Creator
[FIG3] A semantic P2P overlay network: in many practical settings, the semanticmediation layer relating the schemas is independent from the organization of the peers,which is itself dissociated from the physical network structure of the machines.
IP Network
subnet"Physical"Network
StructuredOverlayLayer
SemanticMediation
Layer
127.143
127.144
127.145
34.109 35.142 38.143 45.123109.144
112.144117.122125.98
0001
0100
0011 00100101
0101
0110
0111
Schema A Schema C
Schema DSchema H
Schema ZMapping
Update (Key, Value)
Retrieve (Key)
Update (Data)SearchFor (Query)
Update (Schema)Update (Mapping)
IEEE SIGNAL PROCESSING MAGAZINE [132] SEPTEMBER 2007
SEMANTIC P2P OVERLAYNETWORKSIn practice, a single schema can be usedsimultaneously by many independentparties. Furthermore, some peers mightchoose more than one schema to struc-ture their data locally, as they realistical-ly might have to handle very diversepieces of information. Hence, the organi-zation of the schemas and mappings canoften be uncorrelated with the organiza-tion of the peers themselves. A model inwhich physical machines form a P2Poverlay network, which is itself inde-pendent of the semantic schema-to-schema overlay handling dataintegration, is illustrated in Figure 3.
In this figure, the base layer representsan IP network with different machinesidentified by IP addresses. The machinesself-organize into a virtual P2P overlaynetwork (middle layer), with its ownaddressing scheme. The overlay, which istypically structured, only supports verysimple operations (Retrieve, Update) onhash values. The upper layer consists ofthe structured data, schemas and seman-tic mappings relating the schemas. It iscreated using the overlay layer to collabo-ratively share all semantic information,allowing all peers at the overlay layer toquery and update the semantic mediationlayer, and reformulate queries followingdifferent semantic paths.
INCORRECT SEMANTIC MAPPINGSIn semantic P2P overlay networks,consistency or quality of the schemamappings stored in the semantic media-tion layer is one of the main concerns.Incorrect mappings generate incorrectreformulations of the queries, and thusincorrect results. As P2P systems targetlarge-scale, decentralized, and heteroge-neous environments where autonomousparties have full control on the design oftheir own schemas, it is not always possi-ble to create correct mappings betweentwo given schemas. As a result, the cre-ation of correct mappings is oftenprecluded. Therefore, in many cases anapproximate mapping relating two simi-lar but semantically slightly divergentconcepts might be more beneficial thanno mapping at all.
Many different techniques, based onclassification or data mining, can be usedto map schemas automatically and to eval-uate the correctness of the mappings [7].In what follows, we focus on the semanticmediation layer and propose a radicallynew technique that takes advantage of thedistributed nature of the schemas andmappings in the system to detect the map-pings whose semantics diverge from thesemantics of other mappings.
ANALYSIS OF RETURNING QUERIESWhen reformulating a query using seriesof mappings, cycles or redundant pathsof reformulations can occur (for examplegoing from p1 to p2 to p4 and back top1 in the simple semantic mediationlayer depicted in Figure 2). When such acycle is detected, a peer can compare theoriginal query q it received to the refor-mulated query q ′. Two cases can occur:
q ′ ≡ q: this occurs when the query,after having been reformulated ntimes through n mappings, still isequivalent to the original query q.Since this indicates a high level ofsemantic agreement along the cycle,we say that this represents positivefeedback f + on the mappings consti-tuting the cycle.q ′ �= q: this occurs when the query,after having been reformulated n timesthrough n mappings, returns to a peeras a semantically different query. As
this indicates some disagreement onthe semantics of the query along thecycle of mappings, we say that this rep-resents negative feedback f − on themappings constituting the cycle.
ANALYSIS OF THE SEMANTICMEDIATION LAYERSeries of mappings m1, m2 . . . m n canaccidentally compensate their respec-tive errors and actually create a seman-tically correct composite mappingm n ◦ m n−1 . . . ◦ m1 in the end.Assuming a probability � of two or moremapping errors being compensated alonga cycle we can determine the conditionalprobability of a cycle producing positivefeedback f +
� given the correctness of itsconstituting mappings m1, . . . , m n:
P( f +� |m1, . . . , mn) =⎧⎪⎪⎨
⎪⎪⎩
1 if all mappings correct0 if one mapping incorrect� if two or more
mappings incorrect.
Examining again Figure 2 posing aquery q = /art/painter at p1 and send-ing the query to p2, p4 and then back top1 would result in a returning queryq ′ = /art/creatDate. Obviously, thesemantics of the query have changedafter having been reformulated threetimes. The question is which mapping isresponsible for that change. It is difficult
[FIG4] Modeling (a) an undirected network of mappings as (b) a factor graph ; the nodesof the factor graph represent, from top to bottom, probability functions encapsulating apriori information on the schema mappings, variables for the semantic correctness ofschema mappings, probability functions relating schema mapping variables to feedbackvariables, and finally variables for the feedback that can be observed in the network.
p1
p2
p3
p4
m12 m23
m43m14
m24
m12m12 m23 m43m43 m14 m24m24
f1
f( )
f2 f3
(a) (b)
IEEE SIGNAL PROCESSING MAGAZINE [133] SEPTEMBER 2007
to give an answer for a single cycle.Instead, one would need to consider sev-eral cycles and their correlation throughtheir shared mappings.
Feedback can be obtained from threedifferent mapping cycles in the networkdepicted in Figure 2:
f 1� : m12 − m23 − m34 − m41
f 2� : m12 − m24 − m41
f 3� : m23 − m34 − m24.
One might construct a global proba-bilistic graphical model linking themappings and the feedback informationfrom the network. Figure 4 depicts afactor graph [8] for the small networkof Figure 2, containing (from top tobottom): five one-variable factors forthe prior probability functions on themappings, five mappings variables mij,three factors f( ) linking feedback vari-ables to mapping variables throughconditional probability functions(defined earlier), and finally three feed-back variables fk. Note that feedbackvariables are usually not independent(e.g., in Figure 4, all three feedbacksare correlated).
DISCOVERY OF ERRONEOUSMAPPINGSPosterior values on the semantic correct-ness of the mappings given the feedbackobserved in the network can be comput-ed using standard sum-product messagepassing techniques [8]. Such computa-tions can be performed globally, forexample by deriving a junction tree fromthe factor graph, or can be approximatedlocally by the peers using iterative mes-sage passing schemes. Given a local fac-tor graph containing only the mappingsit is responsible for, a peer can locallyupdate the believes on its mappings byreformulating the sum-product algo-rithm. Messages M from mapping vari-ables m to factors f( ) can be sentrepeatedly, both locally to the local factorf( ), and remotely to peers responsiblefor the other mappings connected to f( ).Using f( ) as a function returning theneighboring nodes of a given node in thefactor graph and
∑∼{xi} to denote a
summary over all but one variable xi :
local message from local mapping mi
to factor fj ( ) ∈ n(mi):
Mmi→ f j ( )(mi) =∏
f ( )∈n (mi)\{ fj ( )}× Mf ( )→mi
(mi)
remote message from local mappingmi to factor fk ( ) from peer p0 to peer pj
responsible for a mapping m ∈ n( fk( )):
Mp0→ fk ( )(mi) =∏
f ( )∈n (mi)\{ fk ( )}× Mf ( )→mi
(mi).
Messages for the mapping variables canthen be computed by combining bothlocal messages and remote messagesreceived from distant peers:
local message from factor fj ( ) tomapping variable mi :
Mfj ( )→mi(mi) =
∑∼{mi}
⎛⎝ fj ( )
∏pk∈n( fj ( ))
Mpk → fj ( )(pk)
×∏
ml∈n( fj ( ))\{mi}Mml → fj ( )(ml)
⎞⎠
posterior semantic correctness oflocal mapping mi :
P(mi|f ) = α
⎛⎝ ∏
f( )∈n(mi)
Mf( )→mi(mi)
⎞⎠
where α is a normalizing constant ensur-ing that the probabilities of all eventssum to one (i.e., making sure thatP(mi = correct ) + P(mi = incorrect )
= 1). Examining again the simplesemantic in Figure 2 and taking intoaccount the feedback given by the threecycles appearing in the network, thisanalysis converges after a few iterations.Without any a priori information on themappings, it successfully detects theinconsistencies related to the mappingbetween p2 and p4 (P(m24 = correct )
= 0.3) while assessing the other map-pings as correct (Pcorrect ) = 0.6).
Note that the allocation of responsi-bilities between the peers at the overlaylayer and the semantic information atthe mediation layer might be quite dif-ferent in the general case of a three-layered system such as the one depictedin Figure 3. The calculations pertainingto some variables or factors might beredundant (as above, where each feed-back variable is taken into considerationby several peers). Those computationscan be embedded in the normal behaviorof the network by piggybacking on thequery reformulation traffic [9]. Severallarge-scale experiments were conductedin order to determine the performanceof this technique [10]. The results arevery promising for a totally automatedtechnique, as the precision of the deci-sion process in detecting correct orincorrect mappings ranges 80–100% formost realistic settings.
CONCLUSIONSWe presented a method that exploits aprobabilistic model to assess the degreeof semantic correctness of schema map-pings in a semantic P2P overlay network.The method takes advantage of transitiveclosures of mapping operations to com-pare reformulated queries at the seman-tic mediation layer. Probabilistic valueson the correctness of the mappings arethen computed by iteratively passingmessages between the peers.
Information integration techniqueswere traditionally centered around globalschemas, perfect schema mappings, andstatic query rewritings. Those techniquesare today inappropriate to maximize theperformance of large-scale informationsystems that operate without any form ofcentral coordination, such as semanticP2P overlay networks. Data noise—emerging from uncertainty on distantdata or from a lack of coordinationamong the sources—plays an everincreasing role in structured data man-agement. Correctly handling that newtype of noise is an important challenge,that might well be tackled by techniquesrelating to signal processing and infor-mation theory, as we tried to outlinewith the message passing scheme pre-sented above.
[exploratory DSP] continued
IEEE SIGNAL PROCESSING MAGAZINE [134] SEPTEMBER 2007
IEEE SIGNAL PROCESSING MAGAZINE [135] SEPTEMBER 2007
ACKNOWLEDGMENTSThis work was supported by the SwissNSF National Competence Center inResearch on Mobile Information andCommunication Systems (grant 5005-67322) and by the EPFL Center for GlobalComputing as part of the European proj-ect NEPOMUK No FP6-027705.
AUTHORSPhilippe Cudré-Mauroux ([email protected]) is a senior researcher atEcole Polytechnique Fédérale de Lausanne(EPFL) in Switzerland. His researchinterests are in information managementin large-scale settings with an emphasis onsemantic overlay networks and peer datamanagement. He is a member of the IFIPWorking Group 2.6 on Databases.
Karl Aberer ([email protected]) is aprofessor at EPFL and the director of theSwiss National Research Center forMobile Information and CommunicationSystems. His research interests consistof the foundation, algorithms and infra-structures for distributed informationmanagement including peer-to-peeroverlay networks, semantic interoper-ability, trust management and applica-tions to scientific and sensor datamanagement.
REFERENCES[1] S. Guha, N. Daswani, and R. Jain, “An experimen-tal study of the Skype peer-to-peer VoIP system.”Available: http://saikat.guha.cc/pub/iptps06-skype/
[2] M. Ripeanu, “Peer-to-peer architecture casestudy: Gnutella network,” in Proc. Int. Conf. Peer-to-Peer Comput., 2001.
[3] S. Ratnasamy, P. Francis, M. Handley, R. Karp,and S. Shenker, “A scalable content-addressable net-work,” in Proc. ACM SIGCOMM, 2001.
[4] K. Aberer (ed.), “Special issue on peer to peerdata management,” in Proc. ACM SIGMOD Record,vol. 32, no. 3, 2003.
[5] G. Wiederhold, “Mediators in the architec-ture of future information systems,” IEEEComputer, vol. 25, no. 3, 1992.
[6] A.Y. Halevy, Z.G. Ives, D. Suciu, and I. Tatarinov,“Schema mediation in peer data management sys-tems,” in Proc. Int. Conf. Data Eng., 2003.
[7] E. Rahm and P.A. Bernstein, “A survey ofapproaches to automatic schema matching,” Int.J. Very Large Data Bases, vol. 10, no. 4, 2001.
[8] F.R. Kschischang, B.J. Frey, and H.-A. Loeliger,“Factor graphs and the sum-product algorithm,”IEEE Trans. Inform. Theory, vol. 47, no. 2, 2001.
[9] P. Cudré-Mauroux, K. Aberer, and A. Feher,“Probabilistic message passing in peer data man-agement systems,” in Proc. Int. Conf. Data Eng.,2006.
[10] P. Cudré-Mauroux, “Emergent Semantics:Rethinking Interoperability for Large-ScaleDecentralized Information Systems,” Ph.D. disserta-tion, 3690, EPFL, Lausanne, Switzerland, 2006. [SP]
RANK IN IEEE TOP 100 N TIMES (SEPTEMBER, 2006– IN TOP
FEBRUARY 2007) 100 SINCEABSTRACT FEB JAN DEC NOV OCT SEP JAN 2006
LOSSLESS WATERMARKING FOR This article presents a novel 97 2IMAGE AUTHENTICATION framework for lossless Celik, M.U.; Sharma, G.; Tekalp, A.M. watermarking, which enables IEEE Transactions on Image zero-distortion reconstruction of Processing, vol. 15, no. 4, Apr. 2006, the un-watermarked images pp. 1042–1049 upon verification.
IEEE SPS CONFERENCES
FULLY UTILIZED AND REUSABLE This conference article describes a 35 1ARCHITECTURE FOR FRACTIONAL new VLSI architecture for fractional MOTION ESTIMATION OF H.264/AVC motion estimation in the H.264/AVC Chen, T.C.; Huang, Y.W.; Chen, L.G. video compression standard, which IEEE International Conference on can also be used with other Acoustics, Speech, and Signal standards. Processing, vol. 5, May 2004, pp. 9–12
[SP]
[reader’s CHOICE] continued from page 10
TITLE, AUTHOR, PUBLICATION YEARIEEE SPS JOURNALS