Upload
jada-quinlan
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
1
P2P Data Management, 2006, S. Abiteboul 1/89
Information Management in P2PSerge AbiteboulINRIA-Futurs and Univ. Paris 11
2
Introduction
3
P2P Data Management, 2006, S. Abiteboul 3/89
Success stories at the time of the Internet bubble
Google: management of Web pages
Mapquest: management of maps
Amazone: book catalogue
eBay: product catalogue
Napster (emule, bearshare, etc.): music database
Flickr: picture database
Wikipedia: dictionary
del.icio.us: annotations
In France: Meetic: dating databaseKelkoo: comparative shopping
They are all aboutpublishing some database
4
P2P Data Management, 2006, S. Abiteboul 4/89
The trend is towards peer-to-peerand interactivity
P2P: A large and varying number of computers cooperate to solve some particular task without any centralized authority
Goal: build an efficient, robust, scalable system based (typically) on inexpensive, unreliable computers distributed in a wide area network
seti@home; kazaa; cabal
Switch from centralized servers to communities and syndication
Interaction and Web 2.0
Motivations: Social, organizational
5
P2P Data Management, 2006, S. Abiteboul 5/89
Information management in a P2P network
Private terminology: data ring
Information is heterogeneous, distributed, replicated, dynamic
Which info: Data + meta-data + knowledge + services
Peers are heterogeneous, autonomous and possibly mobile• From sensors to PDA to mainframe
Typically very large number of peers
Variety of requirements: QoS, performance, security, etc.
6
P2P Data Management, 2006, S. Abiteboul 6/89
Acknowledgement
Xyleme: Scalable XML warehousing
Sophie Cluet, Guy Ferran (Xyleme) & many others
ActiveXML: Language for P2P data management
Omar Benjelloun (Google), Ioana Manolescu, Tova Milo (Tel Aviv) & many others
KadoP: P2P scalable XML indexing
Ioana Manolescu, Nicoleta Preda & others
Data Ring: Infrastructure for P2P data management
Alkis Polyzotis (UC Santa Cruz)
7
P2P Data Management, 2006, S. Abiteboul 7/89
Outline
1. Introduction – the data ring
2. Calculus for P2P data management (ActiveXML)
3. Algebra for P2P data management (ActiveXML algebra)
4. Indexing in P2P (KadoP)
5. Conclusion
Goal of the tutorial: present issues and technology on p2p information management
Warning: it is very biased – it is not a survey
8
P2P Data Management, 2006, S. Abiteboul 8/89
Outline
1. Introduction – the data ring
2. Calculus for P2P data management (ActiveXML)
3. Algebra for P2P data management (ActiveXML algebra)
4. Indexing in P2P (KadoP)
5. Conclusion
9
1. Introduction – the data ring
10
P2P Data Management, 2006, S. Abiteboul 10/89
The information in a data ring
Data: tuples, collections, documents, relations…
Services: data sources, possibly some processing
Meta-data about resources: attribute/values pairs, annotations
Ontologies to explain data and metadata
View definitions
Data integration information, e.g., mappings between ontologies
Physical data: Indices and materialized views
11
P2P Data Management, 2006, S. Abiteboul 11/89
Functionalities of the data ring
Storage, persistence, replication
Indexing, caching, querying, updating, optimization
Schema management, access control
Fault tolerance, self tuning, monitoring
Resource discovery, history, provenance, annotations, multi-linguism,
Semantic enrichment, uncertain data
Each functionality may be achieved by a peer or by the network
12
P2P Data Management, 2006, S. Abiteboul 12/89
And now, what is a peer?
A mainframe database
A file system
Web server
A PC
A PDA
A telephone
A sensor
A home appliance
A car
A manufacturing tool
A telecom equipment
A toy
Another data ring
Any connected device or software with some information to share
A net address and some names of resources (e.g. document or service)
13
P2P Data Management, 2006, S. Abiteboul 13/89
Advantages and disadvantages of P2P
Scaling
Performance• Optimization of parallelism• Avoid bottleneck• Replication
Availability • Replication
Cost• Avoid the cost of server• Share operational cost
Dynamicity• add/remove new data sources
Complexity
Performance• Cost for complex queries
• Communication cost
Availability• Peers can leave
Consistency maintenance • Difficult to support transaction
Quality• Difficult to guarantee quality
14
P2P Data Management, 2006, S. Abiteboul 14/89
Crash course on Web standards
Data exchange format = XML • Labeled ordered trees
• Its main asset = XML schema
• There is much more
Distributed computing protocol = Web services• SOAP Simple Object Access Protocols
• WSDL Web Service Definition Language
• UDDI Universal Description, Discovery and Integration
• BPEL Business Process Execution Language
Query languages = XPath and XQuery• Declarative query language for XML + full-text + update language
Knowledge representation = Owl or RDF/S
SOAPWSDL
XML
XqueryXpath
OwlRDFS
15
Information used to live in islands but with the Web, this is changing:
uniform access to information… …the dream for distributed data management
16
P2P Data Management, 2006, S. Abiteboul 16/89
Do you like the standards?
It is the wrong question!
Correct questions: What can you do with it? What is missing?
Is Xquery the ultimate query language for the Web? No
• It is a language for querying centralized XML
• We will see what it is missing
We will not talk much about semantics
17
P2P Data Management, 2006, S. Abiteboul 17/89
Automatic and distributed management of the data ring
No centralized server
No information administrator (no info manager)
Most users are non-experts• E.g., scientists
Requirements• Ease of deployment (zero-effort)
• Ease of administration (zero-effort)
• Ease of publication (epsilon-effort)
• Ease of exploitation (epsilon-effort)
• Participation in community building notably via annotations
Happy database admin
18
P2P Data Management, 2006, S. Abiteboul 18/89
What should be made automatic
Self-statistics from the monitoring of the data ring• In particular, define the statistics that are needed*
Self-tuning based on the self-statistics• Choose the most appropriate organization
• Decide to install access structures: indexes, views, etc.
• Control replication of data and services
Self-healing• Recovery from errors
• E.g., replacement of a failing Web service
And automatic file management
19
P2P Data Management, 2006, S. Abiteboul 19/89
Any hope?
Technology exists (database self-tuning, machine learning, etc.)
But self-tuning for databases has advanced very slowly
Why can this work?
1. There is no alternative (for db, this was just a cool gadget)
2. KISS (keep it simple stupid!)
3. The power of parallelism
This is assuming lots of machine have free cycles (true) and bandwidth is generous (not always true)
20
P2P Data Management, 2006, S. Abiteboul 20/89
Distributed access control
Goal: Control access to ring resources
Access to resources is based on access rights (ACL)
Who is controlling ACLs?
A node manages ACLs for a collection of distributed resources• Easy but against the spirit and possible bottleneck
The network manages access control• Anybody can get the data
• The data is published with encryption and signatures; only nodes with proper access rights can perform reads/writes
• Some techniques exist
21
P2P Data Management, 2006, S. Abiteboul 21/89
Monitoring
What is monitored?
Web service calls and database updates
The Web– Web pages– RSS feed
What is produced?
A stream of events– As a continuous service– As a RSS feed– As a Web site/page
Info-surveillance
Self-statistics and tracing
Basis for error diagnosis
22
P2P Data Management, 2006, S. Abiteboul 22/89
Streams are everywhere
In query processing
In indexing (KadoP)
In recursive queries (AXML-QSQ)
In messaging, monitoring and pub/sub
That is why we will use an algebra over streams of trees and not simply an algebra over trees
23
P2P Data Management, 2006, S. Abiteboul 23/89
Example: Edos distribution system
A system for the management of Linux distribution
Joint work with Mandriva Software and U. Tel Aviv
Community of open-source developers: thousands
System releases: about 10 000 software packages + metadata
Functionalities• Query the metadata
• Query subscription
• Retrieve packages
• Publish a new release or update an existing one
24
P2P Data Management, 2006, S. Abiteboul 24/89
Exemple: WebContent
WebContent: an ANR platform for the management of web content
Web surveillance• Business, technical, web watching…
Participation of Gemo• WP3: knowledge
• WP5: P2P content management
Partners: CEA, EADS, Thales, Bongrain, Xyleme, Exalead, many research groups (UVSQ, Grenoble, Paris 6, etc.)
25
P2P Data Management, 2006, S. Abiteboul 25/89
Taxonomy of such applications
Parameters• Number of peers and quantity of data
• How volatile the peers are
• The query/update workload
• The functionalities that are desired
Edos: peers and documents in thousands, mostly append for updates, peers not too volatile
An extreme: Google search engine in P2P for billions of documents using millions of hyper volatile peers
Mostly interested in the first case
26
P2P Data Management, 2006, S. Abiteboul 26/89
Thesis
XQuery is fine for local XML processing and publishing
Not sufficient for distributed data management
The success of the relational model, i.e., of tables on a server :1. A logic for defining tables
2. An algebra for describing query plans over tables
By analogy, we need for trees in a P2P system1. A logic for defining distributed tree data and data services
2. An algebra for describing query plans over these
Proposal: ActiveXML logic and algebra
27
P2P Data Management, 2006, S. Abiteboul 27/89
Outline
1. Introduction – the data ring
2. Calculus for P2P data management (ActiveXML)
3. Algebra for P2P data management (ActiveXML algebra)
4. Indexing in P2P (KadoP)
5. Conclusion
28
2. Active XML:a logic for distributed data management
29
P2P Data Management, 2006, S. Abiteboul 29/89
The basis
AXML is a declarative language for distributed information management and an infrastructure to support the language in a P2P framework
Simple idea: XML documents with embedded service calls
Intensional data• Some of the data is given explicitly whereas for some, its definition (i.e.
the means to acquire it when needed) is given
Dynamic data• If the data sources change, the same document will provide different
information
30
P2P Data Management, 2006, S. Abiteboul 30/89
Example(omitting syntactic details)
<resorts state=‘Colorado’> <resort> <name> Aspen </name> <sc> Unisys.com/snow(“Aspen”) </sc> <depth unit=“meter”>1</depth> <hotels ID=AspHotels > …. Yahoo.com/GetHotels(<city name=“Aspen”/>) </hotels> </resort> …</resorts>
May contain callsto any SOAP web service :
• e-bay.net, google.com…to any AXML web services
• to be defined
31
P2P Data Management, 2006, S. Abiteboul 31/89
Marketing Philosophy
Manon: What’s the capital of Brazil?Dad: Let’s ask Wikipedia.com!
Manon: How do I get a cheap ticket to Galapagos?Dad: Let’s place a subscription on LastMinute.com!
Manon: What are the countries in the EC?Dad: France, Germany, Holland, Belgium, and hum… Let’s ask YouLists.com for more!
Active answer = intensional and dynamic and flexible
Embedding calls in data is an old idea in database
32
P2P Data Management, 2006, S. Abiteboul 32/89
Active XML peer
Peer-to-peer architecture
Each Active XML peer • Repository: manages Active XML data • Web client: calls the services inside a document
• Web server: provides (parameterized) queries/updates over the repository as web services
Exchange of AXML instead of XML
AXMLpeerso
ap
33
P2P Data Management, 2006, S. Abiteboul 33/89
What is an AXML peer?
PC• Now open source ObjectWeb – queries in OQL
Peer on a mass storage system• eXist (open source XML database) queries in XQuery• Xyleme queries in XyQL
PDA or cell phone• Persistence in file system and XPATH
On going: the entire network• Data is stored in a P2P network - KadoP
More: java card, a relational database…
34
P2P Data Management, 2006, S. Abiteboul 34/89
A key issue: call activation
When to activate the call?• Explicit pull mode: active databases
• Implicit pull mode: deductive databases
• Push mode: query subscription
What to do with its result?
How long is the returned data valid? • Mediation and caching
Where to find the arguments?• Under the service call: XML,XPATH or a service call
35
P2P Data Management, 2006, S. Abiteboul 35/89
Hi John, what is the phone number of the Prime Minister of France?• Find his name at whoswho.com then look in the phone dir• Look in the yellow pages for deVillepin’s in phone dir of www.gov.fr• (33) 01 56 00 01
Another key issue: what to send?
Send some AXML tree t• As result of a query or as parameter of a call
The tree t contains calls, do we have to evaluate them? • If I do, I may introduce service calls, do we have to evaluate all
these calls before transmitting the data?
36
P2P Data Management, 2006, S. Abiteboul 36/89
Active XMLcool idea – complex problems
Blasphemous claim:
Active XML is the proper paradigm for data exchange!
Not XML + not XQuery
Brings to a unique setting
distributed db, deductive db, active db, stream data
warehousing, mediation
This is unreasonable? Yes!
Plenty of works ahead… to make it work
But first, the algebra
37
P2P Data Management, 2006, S. Abiteboul 37/89
Outline
1. Introduction – the data ring
2. Calculus for P2P data management (ActiveXML)
3. Algebra for P2P data management (ActiveXML algebra)• Query processing
• Query optimization
4. Indexing in P2P (KadoP)
5. Conclusion
38
3. Active XML algebra
39
P2P Data Management, 2006, S. Abiteboul 39/89
Motivation
Relational model: centralized tables
optimization: algebraic expression and rewriting
Active XML model: distributed trees
optimization: algebraic expression and rewriting
Distributed query optimization based on algebraic rewriting of Active XML trees
Based on experiences with AXML optimization
40
P2P Data Management, 2006, S. Abiteboul 40/89
Active XML peers
We focus on positive AXML • Set-oriented data
• Positive/monotone services
Services = tree-pattern-query-with-join queries
Services produce streams • Optimized by a local query optimizer
• Evaluated by a local query processor
• Out of our scope
π
join
π
outputstream
inputstream
inputstream
Localqueryprocessing
41
P2P Data Management, 2006, S. Abiteboul 41/89
The problem
An AXML system • A set of peers
• For each peer a set of documents and services
• Extensional data is distributed
• Intensional data (knowledge) is distributed
– Defined using query services (TPQJ queries)
– These services are generic: any peer can evaluate a query
A query q to some peer
Evaluate the answer to q with optimal response time
42
P2P Data Management, 2006, S. Abiteboul 42/89
(AXML) algebraic expressions:
AXML algebra
AXML logic
l
E1 E2 En…
s@p
E1 E2 En…d@p
send@p
#n@p’ E1
eval@p
E1
receive@p
E1
Each such expression lives at some peerIncludes the AXML trees
43
P2P Data Management, 2006, S. Abiteboul 43/89
Algebraic expressions annotations
Executing service call:
Terminated service call:
Subtletyq@p(5): definition of intensional data
eval(q@p(5)): request to evaluate it; during query optimization
q@p(5): query is being evaluated; during query processing
q@p(5): query evaluation is complete
44
P2P Data Management, 2006, S. Abiteboul 44/89
Evaluation rules: local rules
for l ≠ sc, s ≠ send, receive
l
t1 t2 tn…
eval@p l
eval@p
t1
eval@p
t2
eval@p
tn…
→
tn
s@p
t1 t2 …
eval@p ●s@p
eval@p
t1
eval@p
t2
eval@p
tn…
→ E1
eval@p
eval@p E1
eval@p→
45
P2P Data Management, 2006, S. Abiteboul 45/89
Evaluation rules: transfer rules
Site p asks p’ to do the work and send the result to p
s@p’
t1 t2 …
eval@p
#x@p
s@p’
t1 t2 …
receive@p
#x@p
s@p’
t1 t2 …
eval@p’
newRoot()@p’
send@p’
#x@p
→ &
46
P2P Data Management, 2006, S. Abiteboul 46/89
Evaluation rules: more transfer rules
When a query is evaluated, results appear
They are sent to the place that requested them
Also some rules for eof
→ &
#x@p
Z
…
send@p’
eval@p’
●s@p’
…
#x@p
Z
…
send@p’
eval@p’
●s@p’
…
#x@p
Z
…
47
P2P Data Management, 2006, S. Abiteboul 47/89
Evaluation
Reminder: setting• An AXML system
• A request to evaluate query q at peer p – eval@p( q )
Rewrite the trees in peer workspaces until termination of the process
Results
For positive XML, this process converges… to a possibly infinite state
This process computes the answer to q
May be fairly inefficient: need for optimization!
48
Optimization
More rewrite rules to evaluate a query more efficiently
49
P2P Data Management, 2006, S. Abiteboul 49/89
Query optimization
Well-known optimization techniques for distributed data management
Pushing selections
Semijoin reducers
Horizontal, vertical, hybrid decomposition
Recursive query processing and query-subquery
Some “specific” AXML optimizations
Pushing queries over service calls
Lazy service call evaluation …
Optimizing subscription management
All are captured by the algebraic framework
50
P2P Data Management, 2006, S. Abiteboul 50/89
Example: pushing selections
Same rule applies if d@p2 is replaced by a continuous query
q
eval@p
d@p2
q1@p
eval@p
@p2(q2@p2(d@p2))
→ q1
eval@p
(q2(d@p2))
→ q1@p
eval@p
eval@p2
(q2(d))
→
Suppose q ≡ q1((q2)):
51
P2P Data Management, 2006, S. Abiteboul 51/89
Example: interleaving of processing and optimization
At peer i: di = ri di+1
Query at p1: (d1)
1. (d1) → (r1) (d2)eval@p1((r1) (d2)) → eval@p1((r1)) eval@p1((d2))eval@p1((r1)) → ●(r1) (starts streaming data)
2. (d2) → (r2) (d3); (r2) starts streaming data
3. (d3) → (r3) (d4) …
(r1) (r2) (r3) (r4)
…
52
P2P Data Management, 2006, S. Abiteboul 52/89
Transfer and load balancing rules
Peer p1 delegates the evaluation of E to p2
E
eval@p1
#x@p1
→
eval@p2
send@p2
#x@p1
newRoot@p2()
eval@p1
#x@p1
send@p1
E
53
P2P Data Management, 2006, S. Abiteboul 53/89
Transfer and load balancing rules
Peer p1 delegates the evaluation of E to p2
eval(E)
#x@p1
→
send@p2
#x@p1
newRoot@p2()
eval@p1
#x@p1
send@p1
eval@p2
E
54
P2P Data Management, 2006, S. Abiteboul 54/89
Transfer and load balancing rules
Peer p1 delegates the evaluation of E to p2
eval(E)
#x@p1
→
newRoot@p2()#x@p1
send@p2
#x@p1 eval@p2
E
&
55
P2P Data Management, 2006, S. Abiteboul 55/89
Transfer and load balancing rules
Peer p1 delegates the evaluation of E to p2
eval(E)
#x@p1
→
newRoot@p2()#x@p1
send@p2
#x@p1eval(E)
&
56
P2P Data Management, 2006, S. Abiteboul 56/89
Transfer and load balancing rules
Peer p1 delegates the evaluation of E to p2
#x@p1
eval(E) →
newRoot@p2()#x@p1
send@p2
#x@p1eval(E)
&eval(E)
57
P2P Data Management, 2006, S. Abiteboul 57/89
Transfer and load balancing rules
Peer p1 delegates the evaluation of E to p2
E
eval@p1
#x@p1
→
eval@p2
send@p2
#x@p1
newRoot@p2()
eval@p1
#x@p1
send@p1
E
←
58
P2P Data Management, 2006, S. Abiteboul 58/89
Back to interleaved execution and optimization
Data transfers reducedMore work for p1: merging all the streams
(r1) (r2) (r3) (r4)
…
(r1) (r2) (r3) (r4) …
Hierarchical stream merging
… …
Repeated transfers
59
P2P Data Management, 2006, S. Abiteboul 59/89
Example: Horizontal and vertical decomposition
A relation d over ABC that is split both horizontally and vertically d = (d1 d2) d3
d1 = B<5 (d') and d2 = B>=5 (d')
d', d1, d2 over AB and d3 over BC; each di is at a peer pi
Consider the query B=0@p(d) → B=0@p( (B<5 (d') B>=5 (d'))) d3@p3 )→ B=0 @p( d1@p1 d3@p3 )→ @p (#x@preceive(d1@p1), #y@preceive(d3@p3)) & send@p1(#x@p; B=0@p1(d1@p1) ) & send@p3(#y@p; d3@p3)
60
P2P Data Management, 2006, S. Abiteboul 60/89
Common sub-expression elimination
eval@p(E), #x@preceive@p(E) →
eval@p(#x@p), #x@preceive@p(E)
eval@p
E
#x@p
E
receive@p
eval@p #x@p
E
receive@p#x@p→
& &
61
P2P Data Management, 2006, S. Abiteboul 61/89
In spite of the two calls to s@p, the function is evaluated only once
Common sub-expression elimination
eval@p’
q@p’
s@p s@p
q@p’
#x@p’
s@preceive@p’
s@p
eval@p’
newRoot()@p
send@p
#x@p’ s@p
→ &
newRoot()@p’
send@p’
#y@p’ #x@p’
q@p’
#x@p’
receive@p’
s@p
#y@p’
receive@p’
s@p
newRoot()@p
send@p
#x@p’ s@p
→& &
62
P2P Data Management, 2006, S. Abiteboul 62/89
Example: recursive query processing
Using a pseudo Datalog syntax
s1@p($x, $y) d2@p'($x, $z), s2@p'($z, $y)
s2@p'($x, $y) d1@p($x, $z), s1@p($y, $z)
After rewriting:
on p : #x@p receive@p(q1@p'(d2@p', s2@p') )
root@p send@p(#y@p', q2@p( d1@p, #x@p) )
on p' : root@p' send@p'(#x@p,
q1@p'(d2@p', #y@p' receive@p'(s2@p') ) )
63
P2P Data Management, 2006, S. Abiteboul 63/89
q@any: where q is a query• Any peer that has some query processor for q can do it
f@any: where f is a processing service call• Example: decryption or gene comparison
q over a P2P collection
Generic and global services
eval@p
q @ coll
eval@p
q @→ index
q
eval@p
q@p1→eval@p
q@p2
64
P2P Data Management, 2006, S. Abiteboul 64/89
The AXML algebra – conclusion
Captures distributed XML query processing/optimization
Based on a communication model a la CCS
Algebraic – stream-oriented
Orthogonal to the local XML query optimizer
Orthogonal to the network support (DHT, small world etc.)
What is not yet available? A cost model and heuristics
65
P2P Data Management, 2006, S. Abiteboul 65/89
Outline
1. Introduction – the data ring
2. Calculus for P2P data management (ActiveXML)
3. Algebra for P2P data management (ActiveXML algebra)
4. Indexing in P2P (KadoP)
5. Conclusion
66
4. P2P XML indexing and query processing
67
P2P Data Management, 2006, S. Abiteboul 67/89
Efficient evaluation of tree-pattern-queries
Many optimization techniques
We are interested here in distributed query evaluation/optimization
1) We consider XML indexing
2) Holistic twig join that is based on indexing
3) P2P indexing
4) P2P query processing
5) Optimizing P2P indexing
68
P2P Data Management, 2006, S. Abiteboul 68/89
XML indexing: structural identifiers
1
2
3 5
6
7
84
6
6
6
8
8
8
X ancestor of Y <=>pre(X) < pre(Y) andpost(X) ≥ post(Y)
X parent of Y <=>X ancestor of Y andlevel(X) = level(Y) - 1Structural IDs = Prefix-Postfix
A
B
D E
C
F
“John” G
0
1 1
22 2
3
443
-Level
69
P2P Data Management, 2006, S. Abiteboul 69/89
Holistic Twig Join
Input a document and a tree pattern query
Find the bindings of the query in the document
Holistic = holistique
(le tout et pas juste les parties)
Twig = brindille
Join = you know…
Sounds like Harry Potter?
70
P2P Data Management, 2006, S. Abiteboul 70/89
Query evaluation over a document
Ids for A(1,8,0)…
Ids for D
Ids for “John”
“John”
Ids for C
Ids are sorted in lexicographical orderGoals is to find “matching Ids”
A
DC
71
P2P Data Management, 2006, S. Abiteboul 71/89
The Holistic Twig Join Algorithma
b
c
a (3,5
c (4,5)
a (5,5)
c (6,8)
c (7,8)
a (8,8)
b (10,11)
c (11,11)
b (12,14)
b (13,14)
c (14,14)
a (16,17)
c (17,17)
b (19,22)
b (20,21)
c (22,22)
c (23,25)
b (24,25)
c (25,25)
a (2,8) c (9,14) a (15,17) a (18,25)
c (21,21)
r (1,25)0
1
2
3
4
level
72
P2P Data Management, 2006, S. Abiteboul 72/89
Legend:
c11c9
a
c8c7
a6
b3
c6c4
The Holistic Twig Join Algorithm
a1
a2
a3
a4
a5 a7
b1 b2 b4
b5
c1 c2
c3
c5c9
b
c
(a7, b4, c8), (a7, b5, c8),
Sc
Sb
Sa
a7
b4
c8
b6
b5
c10
(a7, b4 ,c9)
b6
c11
(a7 ,b6 ,c11)
Head of the streamFind the match for the query sub-tree determined by this node !!!The ID is present also in the stack
Stacks
This is the end
73
P2P XML processing
74
P2P Data Management, 2006, S. Abiteboul 74/89
XML indexing in Xyleme
History• 1999: INRIA research project • 2000: Creation of a spin-off• 2006: About 25 people
Technology• A scalable XML repository• A content warehouse • On a cluster of Linux PC
XML query processing • Twig join• Index is distributed• Keyword-based vs. document
based
LAN
Put(C;[d,p,6,6,1])Put(“John”;[d,p,3,1,2])
hash(C)
hash(“John)
75
P2P Data Management, 2006, S. Abiteboul 75/89
Query processing over a distributed collection
Ids for A(p12,d456, 1,7,0)…
Ids for D
Ids for John
A
DC
“John”
Ids for C
Ids include peerId and docIdIds are sorted in lexicographical orderGoals is to find “matching Ids” in the collection
76
P2P Data Management, 2006, S. Abiteboul 76/89
XML indexing in KadoP
Use structural Ids
Publish them via a DHT• Distributed Hash Table
• Peers come and go
• Locate(k):: log(n) messages to fing the peer in charge of key k
• Put(k,v)
• Get(k): retrieves all the values for k
We use Pastry • We also tried P2PSim and JXTA
DHT
put(C;[d,p,6,6,1])
put(“John”;[d,p,3,1,2])
hash(C)
hash(“John)
posting for C
put(C;[d,p,6,6,1])
77
P2P Data Management, 2006, S. Abiteboul 77/89
XML query processing in KadoP
Given a tree pattern query Q
Evaluate an index query indexQ to locate the peers that can provide some answers
• indexQ is a twig join
Ship Q to these peers and evaluate it there
If indexQ is imprecise, many false positive• Example: ship Q to all peers (maximal parallelism)
• Example: Instead of structural Ids, just use (label/word,peerId,docId)
78
P2P Data Management, 2006, S. Abiteboul 78/89
KadoP architecture
KadoP peerpublish & query
KadoP Engine
DHT locate, put, get & delete Index
Indexing
Queryprocessing
ActiveXMLengine
Web interface
Semantic layerExternalLayer
LogicalLayer
PhysicalLayer
79
P2P Data Management, 2006, S. Abiteboul 79/89
Some technical issues
Our goal: manage millions of documents with a large number of peers
First experiments were a disaster
Replace the index storage of the DHT in a FS by storage in a database (Berkeley DB)
Extend the API of the DHT to have Append and not only Read/Write
Extend the API of the DHT to have a streaming exchange of postings• Useful because the XML algebra works better with streams
Now it scales but there is the issue of long postings
80
P2P Data Management, 2006, S. Abiteboul 80/89
The issue of long postings: Google in P2P
Using keyword distribution
Suppose • Peer for Ullman is in Europe• Peer for XML is in US
we have to ship one long posting between US and Europe
For a large number of users, we absorb all the bandwidth of Internet backbone
Need for replication
Even for thousands of peers, the exchange of long postings is an issue
DHTUllmanxml
Ullmann & xml?
81
P2P Data Management, 2006, S. Abiteboul 81/89
Intensional indexing in KadoP
long postingh(Name)
Distributed B-tree
Long posting = bad response time
1. No long posting
2. get h(name) then parallel fetch
3. Possibility to optimize further
f(docId55..docId75)
may be it does not match
no need to call f
h(Name)
f g h i
82
P2P Data Management, 2006, S. Abiteboul 82/89
More optimization
• Standard for P2P keyword search– Gap compression and adaptive set intersection
• Standard distributed query optimization techniques– Ship smallest list – Load balancing– Caching– Replication
• Semi-join techniques notably Bloom semi-join
83
P2P Data Management, 2006, S. Abiteboul 83/89
Outline
1. Introduction – the data ring
2. Calculus for P2P data management (ActiveXML)
3. Algebra for P2P data management (ActiveXML algebra)
4. Indexing in P2P (KadoP)
5. Conclusion
84
6. Conclusion
85
P2P Data Management, 2006, S. Abiteboul 85/89
Logic for distributed data management• Opinion: XQuery is a language for local XML management
• Proposal: ActiveXML
Algebraic foundation of distributed query optimization• Proposal: ActiveXML algebra
P2P (Active) XML indexing• KadoP is now being tested and we are working on optimization
Software• ActiveXML is open-source – see activexml.net
• KadoP soon will be – already available upon request
• EDOS distribution system as well
Conclusion
86
P2P Data Management, 2006, S. Abiteboul 86/89
Lots of related work and related systems
This is going very fast in system devepments
Structured P2P nets: Pastry, Chord
Content delivery net: Coral, Akamai
XML repositories: Xyleme, DBMonet
Multicas systemst: Avalanche, Bullet
File sharing systems: BitTorrent, Kazaa
Pub/Sub systems: Scribe, Hyper
Distributed storage systems: OceanStore, GoogleFS
Etc.
Fundamental research is somewhat left behind
87
P2P Data Management, 2006, S. Abiteboul 87/89
Issues
P2P query optimization
P2P access control
P2P archiving
P2P self tuning
P2P monitoring
P2P knowledge management [SomeWhere]
Also: analysis and verification of these systems• E.g., termination, error detection, diagnosis
88
P2P Data Management, 2006, S. Abiteboul 88/89
Find your own topic
Pick your favorite problem for data or knowledge management and study it in a P2P setting
with gigabytes of data and thousands of peers
If you find it boring, consider it
with terabytes of data and millions of peers
89
P2P Data Management, 2006, S. Abiteboul 89/89
Merci
Merci