Upload
gladys-marshall
View
222
Download
6
Tags:
Embed Size (px)
Citation preview
Finding Haystacks with Needles: Ranked Search for Data Using Geospatial and
Temporal Characteristics
V.M. MeglerDavid MaierPortland State University
Acknowledgements: This work is supported by NSF award OCE-0424602. We thank the staff of the Center for Coastal Margin Observation and Prediction for their support. We also thank the students and professionals who were willing to take part in the user study.
2
Haystacks
• Many environmental sensors deployed in last decade• Each sensor collects environmental observations
– Sometimes many per second
• Each observation has: – a time; – a location; – observed variables
• Observational data stored in many formats, many datasets
3
Needles
• Scientists at CMOP name “finding data relevant to their research” as one of their biggest problems2
• Example query: – “Any observations near the Astoria bridge in June 2009”
2. Center for Coastal Margin Observation and Prediction RIG Meeting, July 15 2010
OriginalObservations
Bounding Box
Needle
May … June
4
Problem: Finding Haystacks that Contain Needles
• Problem: Which datasets contain relevant data?– Many scientific datasets have no metadata
– Many scientific datasets not indexed
• Potential solution: extract simple dataset bounds, perform Boolean search– But: many false positives
Our Approach:
1. Create hierarchical metadata to represent dataset contents
2. Query over metadata
3. Rank query results
May … June June … July
5
Current Approaches / Related Work (1)
• Search via data visualization– Given a specific dataset and data ranges,
display the (large amount of) data
– Most common approach so far
• But: How does the scientist identify relevant datasets and ranges for visualization?
Example of visualization approach[Howe et al. 2009]
6
• Metadata search– Text search of manually-added metadata
• E.g. “Salinity, Columbia River”
– Boolean search on time and location (rare)• Some advanced geoportals provide spatial tests:
– E.g. dataset intersects or completely contains query area
• But: – Boolean search: No matches: no results (1)– Search results not ranked (2)
(1) (2)
Current Approaches / Related Work (2)
7
Current Approaches / Related Work (3)• In Information Retrieval:
– Ranked retrieval of unstructured text documents
• But text retrieval techniques not suited to searching the contents of scientific datasets
Asynchronous Indexing Interactive Query
DocumentCache
Documents
Indexes
RankedResults
Parsing
Scoring and Ranking
Feature Extraction
User Query Interface
8
Research Questions
How can we rank datasets?
Does the ranking approach resonate with users?
What features should we extract from scientific datasets …
… that would allow us to perform real-time search over the extracted features?
Spatial and temporal features selected for initial case study
9
Research Contributions
Proposed a mental model of how scientists perceive dataset similarity for space and time characteristics
Tested mental model in a user study
Developed hierarchical metadata to represent dataset contents
Extracting features at multiple granularities
Developed a prototype query engine with real-time response
10
Space-Time Ranking: Mental Model
• Example Query: “Observations within ½ km of point ‘P’, in June 2009”• Each dataset A, B, … represented by its time extent A(t), B(t), … and
its geospatial extent A(g), B(g), …
• Relative “weight” of space to time given by the “range” of each query term
Query T
A(t)
B(t)
D(t)E(t)C(t)
P
Query G
F(g)
Time
B(g)
SpaceC
(g) D(g) H(g)
K(g)
J(g)
Too far Far Not Close Close Quite close Here Quite close Close Not Close Far Too far
January February March April May June July August September October November
E(g)A(g) r
F(t)
4.5 km 3.5 km 2.5 km 1.5km 1.5 km 2.5 km 3.5 km 4.5 km
11
Scoring Datasets (1)• Score each dataset using formulae that quantify the model
• Given a geospatial query G, calculate spatial-relevance score dGs for dataset d
• Spatial relevance is approximated by: – ½ (min distance + max distance) / radius
– Apply scoring function to the result
P
Query G
r
D(g)
Max distance
Min distance
X
K(g)
X
Min distance
Max distance
dGs
dGs
A(g)
12
Scoring Datasets (2)
• Given a time query T, calculate a time-relevance score dTs for dataset d
• Calculated scores can range from 100 for an exact match to query terms to negative numbers for datasets “too far” from query
Scoring Function S
ß 0 à 10r 10r5r5r
Query Q
A(t)100
B(t)95F(t)
75
D(t): 25
E(t)-25
15r 15r
50
100
Distance
Sco
re
13
Ranking Datasets
• Overall relevance score dscore for each dataset d is composed using the geospatial and temporal scores:
• Datasets are then ranked by decreasing relevance score.
2/)( TsGsscore ddd
14
Ranking
• Tested relevance ranking with a user study:– Proposed relevance measure appears to approximate user
expectations– Relevance-measure “tuning” may further improve match with user
expectations• “Closest edge” has more weight than “centroid” or “farthest edge”
• Scoring/ranking approach assumes appropriate indexes over which to operate– Query terms should relate to indexed features – Features represent metadata used to describe dataset content
15
Creating Metadata: Extracting Features for Space and Time
Geometry Mintime Maxtime Parent
May 2009, Point Sur
Polygon [bounding box]
5/19/2009 6/10/2009 <null>
May 2009, Point Sur, 2009-05-19
Polyline(p1, p2, p3, p4)
5/19/2009, 00:00
5/19/2009, 23:59
May 2009, Point Sur
May 2009, Point Sur, 2009-05-19, Segment 1
Line(p1, p2) 5/19/2009, 00:00
5/19/2009, 06:14
May 2009, Point Sur, 2009-05-19
May 2009, Point Sur, 2009-05-19, Segment 2
Line(p2, p3) 5/19/2009, 06:15
5/19/2009, 14:23
May 2009, Point Sur, 2009-05-19
May 2009, Point Sur, 2009-05-19, Segment 3
Line(p3, p4) 5/19/2009, 14:24
5/19/2009, 15:01
May 2009, Point Sur, 2009-05-19
….
DNH Metadata Table
• Transform observations into features – Extract at multiple granularities
– Model features as “footprints”
– E.g.: 1 million observations over 3 weeks
Original CruiseObservations
Bounding Box(derived)
Line per day(derived)
Individual line segments (derived)
May … June
16
Metadata: Adaptive Hierarchy
2010-07, 08
201
0-0
8 (part)
20
10
-07
20
09
-10
20
09
-09
20
09
-08
20
09
-06
200
9-0
5 (part)
2009-05-17 – 2009-11-212008-02-19 – 2008-08-20
20
08
-08
20
08
-05
20
08
-02
Data files (directly downloadable); bottom level of metadata hierarchy
Parent (lifetime) metadata record
20
09
-11
2007-10-30 ... 2010-08-12
20
07
-11
20
09
-07
200
7
Second level of metadata hierarchy
• Multiple depths of hierarchy are accommodated simultaneously
• Curation decision(s) made once per kind of data/dataset
Fixed Stations:• 1 location• Time: months-decades• Observations: millions• Download format: NetCDF• Hierarchy: 3 levels
Water samples:• 1-3 observations per location• Time: minutes• Download format: CSV• Hierarchy: 1 level
WS 942-945: 2010-05-25 13:58-14:01
WS 946: 2010-05-25 14:53
WS 947-948: 2010-05-25 15:19-15:22
WS 949-950: 2010-05-26 08:14-08:20
Parent metadata record(s): 1 per water-sample location
17
2010-07, 08
2010-08 (part)
2010-07
2009-10
2009-09
2009-08
2009-06
2009-05 (part)
2009-05-17 – 2009-11-212008-02-19 – 2008-08-20
2008-08
2008-05
2008-02
Query: 2009-06-01
– 2009-07-31
Data files (directly downloadable); bottom level of metadata hierarchy
Parent (lifetime) metadata record
User query
2009-11
2007-10-30 ... 2010-08-12
2007-11
2009-07
2007
Second level of metadata hierarchy
-25
5
-85
-85 -53 -25
88
22
99100
10094
8474
66
-20
-18 -24
Scoring using Hierarchical Metadata• Hierarchical
metadata allows fast access to data at multiple scales or granularities
18
System Components
Scoring & Ranking
Metadata Creation
MetadataRepository
ObservationRepository
Analysis Programs
User Interface
SensorObservationProcessing
New: Components of Data Near Here
Task: Analysis
Task: Search
Google Maps
Task: QA, Data Curation
Existing Components
19
The Prototype: “Data Near Here”
Extracted metadata for ¼ billion observations 15,500 metadata records
Developed an interactive user interface: Demo Accepts spatial and temporal query terms Ranks datasets by decreasing score Provides real-time response
20
Conclusion
Our research demonstrates methods for:
Ranking scientific datasets in response to a spatio-temporal query
Automatically extracting hierarchical metadata from scientific datasets …
… and searching over the extracted features
Providing real-time response times for queries over ¼ billion observations in a multi-terabyte data repository
21
Current Research
Evaluation of metadata scalability
Add elevation / depth: 4-dimensional search 2+1+1 versus 3+1
Add additional search criteria: Observational variables … “with oxygen below 3 mg/liter, where Myrionecta Rubra are present”
Backup Material
23
References1. Geospatial One Stop (GOS), http://gos2.geodata.gov/wps/portal/gos.2. Global Change Master Directory Web Site, http://gcmd.nasa.gov/.3. The Google Maps Javascript API V3, http://code.google.com/apis/maps/
documentation/javascript/.4. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press New York (1999).5. Douglas, D.H., Peucker, T.K.: Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica. 10, 2,
112–122 (1973).6. Egenhofer, M.J.: Toward the semantic geospatial web. Proceedings of the 10th ACM international symposium on Advances in geographic information systems.
pp. 1–4 (2002).7. Evans, M.P.: Analysing Google rankings through search engine optimization data. Internet Research. 17, 1, 21–37 (2007).8. Goodchild, M.F., Zhou, J.: Finding geographic information: Collection-level metadata. GeoInformatica. 7, 2, 95–112 (2003).9. Goodchild, M.F.: The Alexandria Digital Library Project: Review, Assessment, and Prospects, http://www.dlib.org/dlib/may04/goodchild/05goodchild.html,
(2004).10. Goodchild, M.F. et al.: Sharing Geographic Information: An Assessment of the Geospatial One-Stop. Annals of the AAG. 97, 2, 250-266 (2007).11. Grossner, K.E. et al.: Defining a digital earth system. Transactions in GIS. 12, 1, 145–160 (2008).12. Herring, J.R. ed: OpenGIS® Implementation Standard for Geographic information - Simple feature access - Part 1: Common architecture, (2010).13. Hey, T., Trefethen, A.: e-Science and its implications. Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and
Engineering Sciences. 361, 1809, 1809 (2003).14. Hey, T., Trefethen, A.E.: The Data Deluge: An e-Science Perspective. Grid Computing: Making the Global Infrastructure a Reality (eds F. Berman, G. Fox and T.
Hey). pp. 809-824 John Wiley & Sons, Ltd, Chichester, UK (2003).15. Hill, L.L. et al.: Collection metadata solutions for digital library applications. J. of the American Soc. for Information Science. 50, 13, 1169–1181 (1999).16. Howe, B. et al.: Scientific Mashups: Runtime-Configurable Data Product Ensembles. Scientific and Statistical Database Management. pp. 19–36 (2009).17. Kobayashi, M., Takeda, K.: Information retrieval on the web. ACM Comput. Surv. 32, 144–173 (2000).18. Lewandowski, D.: Web searching, search engines and Information Retrieval. Information Services and Use. 25, 3, 137-147 (2005).19. Lord, P., Macdonald, A.: e-Science Curation Report, http://www.jisc.ac.uk/uploaded_documents/e-ScienceReportFinal.pdf, (2003).20. Manning, C.D. et al.: An introduction to information retrieval. Cambridge University Press (2008).21. Maron, M.E., Kuhns, J.L.: On relevance, probabilistic indexing and information retrieval. Journal of the ACM (JACM). 7, 3, 216–244 (1960).22. Miller, C.C.: A Beast in the Field: The Google Maps mashup as GIS/2. Cartographica. 41, 3, 187-199 (2006).23. Miller, H.J., Wentz, E.A.: Representation and Spatial Analysis in Geographic Information Systems. Annals of the AAG. 93, 3, 574-594 (2003).24. Montello, D.: The geometry of environmental knowledge. Theories and methods of spatio-temporal reasoning in geographic space. 136–152 (1992).25. Perlman, E. et al.: Data Exploration of Turbulence Simulations Using a Database Cluster. Proceedings of the 2007 ACM/IEEE conference on Supercomputing-
Volume 00. pp. 1–11 (2007).26. Sharifzadeh, M., Shahabi, C.: The spatial skyline queries. Proc. of VLDB. p. 762 (2006).27. Stolte, E., Alonso, G.: Efficient exploration of large scientific databases. Proc. of VLDB. p. 633 (2002).
24
Example Sensor Types and Associated Data Characteristics
Water SamplesTime: “point in time”Location: x,y,z pointQuantity: hundredsObservations per: 1
Cruises Time: weeksLocation: hundreds of milesQuantity: ~ 4 per yearObservations per: millions
Gliders, Autonomic Unmanned Vehicles Time: hours / daysLocation: miles; x,y,zQuantity: 10s per yearObservations per: million
Fixed Stations Time: decadesLocation: fixed x,y; variable zQuantity: 20Observations per: millions
25
User Study
• Two populations, each of size 20: – “Scientists”: CMOP oceanographers and microbiologists
– “Non-scientists”: primarily Information Technology professionals & grad. students
• Format: paper questionnaire, pair-wise comparisons to query terms– 3 choices: A is closer; B is closer; same
• Categories of comparisons:– Time
– Space: points, lines and polygons; comparisons across shapes
– Time and space combined
• Example cross-shape comparisons:
X
A
B
X
AB
X
A
BX
BA
26
User Study: Sample Finding• Finding: Ordinal responses are independent of:
– Type of question (time only, space only, time and space combined)
– Shape (point, line, polyline, polygon)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
% Agreement w ith Relevance Measure 2 ("very close" cases removed*)
% P
op
ula
tio
n S
elf
-Ag
ree
me
nt
Non-Scientists
Scientists
* “very close” < 0.2 radius difference in distance
27
Creating Metadata: Time
Geometry Mintime Maxtime Parent
Station MAB1 Point(x,y) 1/1/2007 11/30/2010 <null>
Station MAB1, 2010
Point(x,y) 1/1/ 2010 11/30 2010 Station MAB1
Station MAB1, 2009
Point(x,y) 1/1/ 2009 12/31/ 2009 Station MAB1
…
Station MAB1, 2009, January
Point(x,y) 1/1/2009 1/31/2009 Station MAB1, 2009
Station MAB1, 2009, February
Point(x,y) 2/1/2009 2/28/2009 Station MAB1, 2009
…
• Metadata derived via extract / summarization of source observations
• Example: Station MAB1
DNH Metadata Table
Station MAB1
– A single physical location (point)
– Observations batched into files
• 1 dataset per month• 1 dataset per year
• Example searches:– 2006
– March – August, 2009
– February, 2009
28
Scoring Formula: TimeQtmin
Qtmax
Qtp
Time Query
dtmindtmax
Dataset d
• Given time query Q, calculate time-relevance score dts for dataset d
(1)
(2a)
(4)
dtmin dtmax
dtmindtmax (1)
(4)
(2)
)(
1)2/|(|
,||2
)1|(|)1|(|
,||2
)1|(|
,||2
)1|(|
,0
)/()2(
)/()2(
2/)(
)2/)((
minmaxmaxminmaxmin
maxmaxminminminmax
2min
2max
maxmaxminminminmax
2min
maxmaxminminminmax
2max
maxmaxminmin
minmaxminmaxmaxmax
minmaxminmaxmin
minmax
minminmaxminmin
TdistTs
TTTTRR
TTTTRR
RR
TTTTRR
R
TTTTRR
R
TTTT
Tdist
TTTTTR
TTTTT
TT
TTTTR
dsd
QdQddd
QdQddd
dd
QdQddd
d
QdQddd
d
QdQd
d
QQQQdd
QQQQd
QQQdd
PSU and IBM Confidential
dtmin
(3)dtmax
(3)
(2b)
29
Scoring Formula: Geospace
GC
Geospatial Query
dgmin
dgmax
Dataset d
• Given geospatial query G, calculate geospatial-relevance score dgs for dataset d:
GR
• Overall relevance score dscore for dataset d:
dgmin
dgmax
(1)
(2)
2/)( TsGsscore ddd
(3)
PSU and IBM Confidential
dgmax
dgmin