28
Laboratoire LIP6 The Gedeon Project: Data, Metadata and Databases Yves DENNEULIN LIG laboratory, Grenoble ACI MD

Laboratoire LIP6 The Gedeon Project: Data, Metadata and Databases Yves DENNEULIN LIG laboratory, Grenoble ACI MD

Embed Size (px)

Citation preview

Laboratoire LIP6

The Gedeon Project: Data, Metadata and Databases

Yves DENNEULINLIG laboratory, Grenoble

ACI MD

Context and goals● Heterogeneous metadata management on grids

Clusters of clusters ● High-level queries using metadata● Easy and flexible deployment and

configuration● Minimal overhead● Various interfaces● Initial target application domains

Biocomputing (lots of metadata, few data) Microscopic imaging (lots of data data, few

metadata)

intergicielGEDEON

GrappeBioInfointergiciel

GEDEONintergicielGEDEON

Grille

Requète

Résultat

séquencesproproétaires

swissprot

TrEMBL

The Gedeon middleware Metadata management on lightweight grids

● Records of (attribute,value) pairs stored in files Flexible requests

● Can be combined through scripting Various interfaces

● Command line (tools)● Libraries● Virtual FS (legacy applications support)

Deployment “à la carte”● Composition of various data sources

Performances● Dedicated I/O library● Semantic caching

Outline

1. General architecture

a.Gedeon internal structure

b.Composition of various data sources

2. Practical use

3. « dual » cache

Conclusion

Example of a deploymentQuery Interface(API, FS, GUI, ...)

Local proxy

Interconnect middleware Interconnect middleware

Local proxy Local proxy Local proxy

Interconnect

Client

Servers« close »

to the client

Storage sites

cache

cache cach

e

cach

e

cache

cache cache cache

Gedeon components● Gedeon Kernel

fuple● I/O Library● Evaluate the queries

lowerG● Operators to compose bases● Remote access

● Interface API lowerG Virtual FS

● Cache

application

vSGF

lowerG

fuple

network

cach

e

fuple

network

lowerG

Local proxy

What inside the sources?

● Records of pairs attribute/value

Id

classifA

classifB

457

Bacteria

Clostridia

taille 26

ref

Record

Example of composition of sources

client

+

J

Metadata can belocal or copies

site S1site S2

site S3

RR

...

Union

enreg. A1

enreg. A2

enreg. A3

enreg. A4

+

enreg. B1

enreg. B2

enreg. B3

enreg. B4

...

...

enreg. A1

enreg. A2

enreg. A3

enreg. A4

enreg. B1

enreg. B2

enreg. B3

enreg. B4Unify storage space

+Parallel evaluation

Round Robin

RR

Fault Tolerance

client

Source 1

Source 2

Round Robin

RR

Load Balancing

client

Source 1

Source 2

client

...

Join operatorId

A1

A2

457

v1

v2

A3 v3

Id

A1

A2

458

v4

v5

A3 v6

J

Id

...

Id

An

457

vAn1

Id

An

458

vAn2

...

Id

A1

A2

457

v1

v2

A3 v3

Id

A1

A2

458

v4

v5

A3 v6

An vAn1

An vAn2

Enrich a source withanother

Outline

1. General architecture

a.Gedeon internal structure

b.Composition of various data sources

2. Practical use

3. « dual » cache

Conclusion

Tools 1/2

● Libraries● CLI● Operations

sort projection select index ...

Tools 2/2

sort(attr='taille')

● Examples sort$> cat mesmeta.g | fsort 'taille' > trie_taille.g

index

create_idx(attr='Id')

.Id.idx

.Id.idx

.Id.idx

search_idx('Id', 'P0123')

Language for the requests

● Simple ($, type control with the operators)

● Regular expressions

● Of the second order

Select expression

Id

classifB

459

Bacteria

taille 47

Id

classifA

460

Fermicutes

Select$Id>459

Id

classifA

460

Fermicutes

Id

classifA

classifB

457

Bacteria

Clostridia

taille 26

Select using regexpId

classifA

classifB

Id

classifB

457

Bacteria

Clostridia

459

Bacteria

taille 26

taille 47

Id

classifA

460

Fermicutes

Select$classifB==/.*a$/

Id

classifA

classifB

457

Bacteria

Clostridia

taille 26

Id

classifB

459

Bacteria

taille 47

Select using 2nd order logicId

classifA

classifB

Id

classifB

457

Bacteria

Clostridia

459

Bacteria

taille 26

taille 47

Id

classifA

460

Fermicutes

Select$/classif[AB]/==Bacteria

&&$taille>=36

Id

classifB

459

Bacteria

taille 47

Virtual FS interface

● Just a specific file-oriented interface● Data and metadata can be anywhere in the grid● Definition of logical directories

Ex: cd '$classifB==|.*a$|' « and » between directories 1 filename =value of a metadata: logical view

/fs_virt/$classifB==|.*a$|> ls457 459/fs_virt/$classifB==|.*a$|> cat *>/tmp/mater/fs_virt/$classifB==|.*a$|>

Outline

1. General architecture

a.Gedeon internal structure

b.Composition of various data sources

2. Practical use

3. « dual » cache

Conclusion

Dual cache (1)

● 2 cooperative caches cache of requests (R, {id,...})

-> save computing power cache of data (id, {attr,...})

-> save bandwidth● Semantic cache

Can evaluate a query using the data in the cache Can generate a remainder to complement the data

cached

Example

● Refinement of a request1)'$OC==/Eukaryota/'

-> (R, Lid={id1,id2, ...})2)'$OC==/Eukaryota/ && $year>=1998'

Select(*Lid, '$year>=1998')

Dual cache (2)

● Distributed semantic cache Typically used inside communities

● Lots of common requests No location constraints

● Members of the community can be geographically scattered

● Distributed data cache Minimize time and data transfer Cooperation between close, from a topological point

of view, sites

Dual cache (3)

Grenoble

ServersServers

Rennes

Dual cache

Query cache

Object cache

Semantic locality

Community Eukaryota

Community Archaea

Geographic locality

Dual cache (4)

● Work in progress on the notion of distance Find geographical proximity Find common interests between communities

● Create hybrid communities based on their requests

● Could be used to change the cache parameters Manual and/or automatic

Conclusion

● A data integration middleware Handling of metadata

● Distributed and modular Deployment can be done according to

architectural/organisational constraints● Definition of a dual cache infrastructure

Reflect both organisational use● Prototype in use

Packaging and documentation needed

Questions?