Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
AccessingExis+ngDistributedScienceArchivesasRDFModels
AlasdairJGGray1
NormanGray2
IadhOunis1
1Compu+ngScience,UniversityofGlasgow2PhysicsandAstronomy,UniversityofLeicester
Outline
• Mo+va+ngscienceproblems• Adataintegra+onapproach• RDFandSPARQL• Extrac+ngscien+ficdata• Performanceresults
• Conclusions
SearchingforBrownDwarfs
• Datasets:– NearInfrared,2MASS/UKInfraredDeepSkySurvey
– Op+cal,APMCAT/SloanDigitalSkySurvey
• Complexcolour/mo+onselec+oncriteria
• Similarproblems– HaloWhiteDwarfs
DeepFieldSurveys
• Observa+onsinmul+plewavelengths– RadiotoX‐Ray
• Searchingfornewobjects– Galaxies,stars,etc
• Requirescorrela+onsacrossmanycatalogues– ISO– Hubble– SCUBA– etc
TheProblem
Locate and combine relevant data
• Heterogeneouspublishers– Archivecentres– Researchlabs
• Heterogeneousdata– Rela+onal– XML
– Files
VirtualObservatory
GenericDataIntegra+onApproach
• Heterogeneoussources– Autonomous– Localschemas
• Homogeneousview– Mediatedglobalschema
• Mapping– LAV:local‐as‐view– GAV:global‐as‐view
GlobalSchema
Query1 Queryn
DB1
Wrapper1
DBk
Wrapperk
DBi
Wrapperi
Mappings
P2PDataIntegra+onApproach
• Heterogeneoussources– Autonomous– Localschemas
• Heterogeneousviews– Mul+pleschemas
• Mapping– Betweenpairsofschema– Networkoflinks
• Requirecommonintegra+ondatamodel
Schema1
DB1
Wrapper1
DBk
Wrapperk
DBi
Wrapperi
Schemaj
Mappings
Query1 Queryn
ResourceDescrip+onFramework(RDF)
• W3Cstandard
• Designedasametadatadatamodel
• Makestatementsaboutresources
• Containsseman+cdetails
• Idealforlinkingdistributeddata
#foundIn
#Sun
TheSun#name
#MilkyWay
MilkyWay#name
TheGalaxy#name
IAU:Starrdf:type
IAU:BarredSpiral
rdf:type
Reasoning
• InfersknowledgefromRDF(S)statements– Subclassing
• OWL:extendswhatcanbeexpressed– Inversepredicates– Equivalence– etc
#Sun
IAU:Starrdf:type
TheSun#name
#MilkyWay
MilkyWay#name
#foundIn
TheGalaxy#name
IAU:BarredSpiral
rdf:type
IAU:Galaxyrdfs:subClassOf
rdf:type
#contains
SPARQL
• Declara+vequerylanguage– Selectreturneddata• Graphortuples• A_ributestoreturn
– Describestructureofdesiredresults– Filterdata
• W3Cstandard
• Syntac+callysimilartoSQL
FindthenameofthegalaxywhichcontainsastarwiththenameTheSun
SELECT ?galName WHERE { ?gal a IAU:Galaxy ; #name ?galName . ?star a IAU:Star ; #name ?starName ; #foundIn ?gal . FILTER REGEX(?starName, “The Sun”)
}
QueryingRDFwithSPARQL
#Sun
IAU:Starrdf:type
TheSun#name
#MilkyWay
MilkyWay#name
#foundIn
TheGalaxy#name
IAU:BarredSpiral
rdf:type
IAU:Galaxyrdfs:subClassOf
rdf:type
FindthenameofthegalaxywhichcontainsastarwiththenameTheSun
QueryingRDFwithSPARQL
#Sun
IAU:Starrdf:type
TheSun#name
#MilkyWay
MilkyWay#name
#foundIn
TheGalaxy#name
IAU:BarredSpiral
rdf:type
IAU:Galaxyrdfs:subClassOf
rdf:type
?galName
TheGalaxy
Milky Way
Integra+ngUsingRDF
• Dataresources– ExposeschemaanddataasRDF
– NeedaSPARQLendpoint• Allowsmul+ple
– Accessmodels– Storagemodels
• Easytorelatedatafrommul+plesources
Rela+onalDB
RDF/Rela+onalConversion
XMLDB
RDF/XMLConversion
CommonModel(RDF)
Mappings
SPARQLquery
AccessingRela+onalSourcesasRDF
DataDump
• DatastoredasRDF– Originalrela+onalsourceis
replicated
– Datacanbecomestale
• Na+veSPARQLquerysupport
• Exis+ngRDFstores– Jena– Seasame
On‐the‐flyTransla3on
• Datastoredasrela+ons• Na+veSQLsupport
– Highlyop+misedaccessmethods
• SPARQLqueriesmustbetranslated
• Exis+ngtransla+onsystems– D2RQ/D2RServer– SquirrelRDF
SystemHypothesis
Itisviabletoperformon‐the‐flyconversionsfromexis+ngsciencearchivestoRDFtofacilitatedataaccessfromadatamodelthatascien+stisfamiliarwith
Rela+onalDB
RDF/Rela+onalConversion
XMLDB
RDF/XMLConversion
CommonModel(RDF)
Mappings
SPARQLquery
TestData
• SuperCOSMOSScienceArchive(SSA)– DataextractedfromscansofSchmidtplates– Storedinarela+onaldatabase– About4TBofdata,detailing6.4billionobjects– Fairlytypicalofastronomicaldataarchives
• Schemadesignedusing20realqueries• Personalversioncontains– Dataforaspecificregionofthesky– About0.1%ofthedata,~500MB
AnalysisofTestData
• About500MBinsize• Organisedin14Rela+ons– Numberofa_ributes:2–152• 4rela+onswithmorethan20a_ributes
– Numberofrows:3–585,560
– Twoviews• Complexselec+oncriteriainview
RealScienceQueries
Query5
Findtheposi+onsand(B,R,I)magnitudesofallstar‐likeobjectswithindeltamagof0.2ofthecoloursofaquasarofredshij2.5<z<3.5
SELECT TOP 30 ra, dec, sCorMagB, sCorMagR2, sCorMagI
FROM ReliableStars WHERE (sCorMagB-sCorMagR2 BETWEEN 0.05 AND 0.80) AND (sCorMagR2-sCorMagI BETWEEN -0.17 AND 0.64)
AnalysisofTestQueries
QueryFeature QueryNumbers
Arithme+cinbody 1‐5,7,9,12,13,15‐20
Arithme+cinhead 7‐9,12,13
Ordering 1‐8,10‐17,19,20
Joins(includingself‐joins) 12‐17,19
Rangefunc+ons(e.g.Between,ABS) 2,3,5,8,12,13,15,17‐20
Aggregatefunc+ons(includingGroupBy) 7‐9,18
Mathfunc+ons(e.g.power,log,root) 4,9,16
Trigonometryfunc+ons 8,12
Negatedsub‐query 18,20
Typecas+ng(e.g.Radianstodegrees) 7,8,12
Serverfunc+ons 10,11
ExpressivityofSPARQL
Features
• Select‐project‐join• Arithme+cinbody• Conjunc+onanddisjunc+on• Ordering• Stringmatching
• Externalfunc+oncalls (extensionmechanism)
Limita3ons
• Rangeshorthands• Arithme+cinhead• Mathfunc+ons
• Trigonometryfunc+ons• Subqueries• Aggregatefunc+ons• Cas+ng
AnalysisofTestQueries
QueryFeature QueryNumbers
Arithme+cinbody 1‐5,7,9,12,13,15‐20
Arithme+cinhead 7‐9,12,13
Ordering 1‐8,10‐17,19,20
Joins(includingself‐joins) 12‐17,19
Rangefunc+ons(e.g.Between,ABS) 2,3,5,8,12,13,15,17‐20
Aggregatefunc+ons(includingGroupBy) 7‐9,18
Mathfunc+ons(e.g.power,log,root) 4,9,16
Trigonometryfunc+ons 8,12
Negatedsub‐query 18,20
Typecas+ng(e.g.radianstodegrees) 7,8,12
Serverfunc+ons 10,11
Expressiblequeries:1,2,3,5,6,14,15,17,19
ExperimentalSetup
• Machine– IntelCore2Duo2.4GHz– 2GBRAM
– WindowsXP– Java1.5
• Sojware– MySQL5.0.51a– D2RQ0.5.1– SquirrelRDF0.1
• Only4queriescompletedwithin2hours
PerformanceResults
ANewApproach
• Exploitqueryenginesanddatastructureofunderlyingdatasources
• Aiduserquerygenera+onbyexplainingsourcedatamodelintermsofknowndatamodel
• Dataextractedinna+vemodel
Rela+onalDB
XMLDB
SQL XQuery
ExplicatorMappingsModels
Explain
Conclusions
• SPARQL:Notexpressiveenoughforscience• QueryConverters:Poorperformance• Proposednewapproach– RDFtounderstanddatamodels– Na+vequeryenginesfordataextrac+on
RDF Rela3onal
Raggeddata Structureddata
Smalltomediumdatavolumes Largedatavolumes
Reasoningoverthedata Extrac+ngspecificdata