Finding Haystacks with Needles: Ranked Search for Data Using Geospatial and Temporal Characteristics V.M. Megler David Maier Portland State University

Finding Haystacks with Needles: Ranked Search for Data Using Geospatial and

Temporal Characteristics

V.M. MeglerDavid MaierPortland State University

Acknowledgements: This work is supported by NSF award OCE-0424602. We thank the staff of the Center for Coastal Margin Observation and Prediction for their support. We also thank the students and professionals who were willing to take part in the user study.

2

Haystacks

• Many environmental sensors deployed in last decade• Each sensor collects environmental observations

– Sometimes many per second

• Each observation has: – a time; – a location; – observed variables

• Observational data stored in many formats, many datasets

3

Needles

• Scientists at CMOP name “finding data relevant to their research” as one of their biggest problems2

• Example query: – “Any observations near the Astoria bridge in June 2009”

2. Center for Coastal Margin Observation and Prediction RIG Meeting, July 15 2010

OriginalObservations

Bounding Box

Needle

May … June

4

Problem: Finding Haystacks that Contain Needles

• Problem: Which datasets contain relevant data?– Many scientific datasets have no metadata

– Many scientific datasets not indexed

• Potential solution: extract simple dataset bounds, perform Boolean search– But: many false positives

Our Approach:

1. Create hierarchical metadata to represent dataset contents

2. Query over metadata

3. Rank query results

May … June June … July

5

Current Approaches / Related Work (1)

• Search via data visualization– Given a specific dataset and data ranges,

display the (large amount of) data

– Most common approach so far

• But: How does the scientist identify relevant datasets and ranges for visualization?

Example of visualization approach[Howe et al. 2009]

6

• Metadata search– Text search of manually-added metadata

• E.g. “Salinity, Columbia River”

– Boolean search on time and location (rare)• Some advanced geoportals provide spatial tests:

– E.g. dataset intersects or completely contains query area

• But: – Boolean search: No matches: no results (1)– Search results not ranked (2)

(1) (2)

Current Approaches / Related Work (2)

7

Current Approaches / Related Work (3)• In Information Retrieval:

– Ranked retrieval of unstructured text documents

• But text retrieval techniques not suited to searching the contents of scientific datasets

Asynchronous Indexing Interactive Query

DocumentCache

Documents

Indexes

RankedResults

Parsing

Scoring and Ranking

Feature Extraction

User Query Interface

8

Research Questions

How can we rank datasets?

Does the ranking approach resonate with users?

What features should we extract from scientific datasets …

… that would allow us to perform real-time search over the extracted features?

Spatial and temporal features selected for initial case study

9

Research Contributions

Proposed a mental model of how scientists perceive dataset similarity for space and time characteristics

Tested mental model in a user study

Developed hierarchical metadata to represent dataset contents

Extracting features at multiple granularities

Developed a prototype query engine with real-time response

10

Space-Time Ranking: Mental Model

• Example Query: “Observations within ½ km of point ‘P’, in June 2009”• Each dataset A, B, … represented by its time extent A(t), B(t), … and

its geospatial extent A(g), B(g), …

• Relative “weight” of space to time given by the “range” of each query term

Query T

A(t)

B(t)

D(t)E(t)C(t)

P

Query G

F(g)

Time

B(g)

SpaceC

(g) D(g) H(g)

K(g)

J(g)

Too far Far Not Close Close Quite close Here Quite close Close Not Close Far Too far

January February March April May June July August September October November

E(g)A(g) r

F(t)

4.5 km 3.5 km 2.5 km 1.5km 1.5 km 2.5 km 3.5 km 4.5 km

11

Scoring Datasets (1)• Score each dataset using formulae that quantify the model

• Given a geospatial query G, calculate spatial-relevance score dGs for dataset d

• Spatial relevance is approximated by: – ½ (min distance + max distance) / radius

– Apply scoring function to the result

P

Query G

r

D(g)

Max distance

Min distance

X

K(g)

X

Min distance

Max distance

dGs

dGs

A(g)

12

Scoring Datasets (2)

• Given a time query T, calculate a time-relevance score dTs for dataset d

• Calculated scores can range from 100 for an exact match to query terms to negative numbers for datasets “too far” from query

Scoring Function S

ß 0 à 10r 10r5r5r

Query Q

A(t)100

B(t)95F(t)

75

D(t): 25

E(t)-25

15r 15r

50

100

Distance

Sco

re

13

Ranking Datasets

• Overall relevance score dscore for each dataset d is composed using the geospatial and temporal scores:

• Datasets are then ranked by decreasing relevance score.

2/)( TsGsscore ddd

14

Ranking

• Tested relevance ranking with a user study:– Proposed relevance measure appears to approximate user

expectations– Relevance-measure “tuning” may further improve match with user

expectations• “Closest edge” has more weight than “centroid” or “farthest edge”

• Scoring/ranking approach assumes appropriate indexes over which to operate– Query terms should relate to indexed features – Features represent metadata used to describe dataset content

15

Creating Metadata: Extracting Features for Space and Time

Geometry Mintime Maxtime Parent

May 2009, Point Sur

Polygon [bounding box]

5/19/2009 6/10/2009 <null>

May 2009, Point Sur, 2009-05-19

Polyline(p1, p2, p3, p4)

5/19/2009, 00:00

5/19/2009, 23:59

May 2009, Point Sur

May 2009, Point Sur, 2009-05-19, Segment 1

Line(p1, p2) 5/19/2009, 00:00

5/19/2009, 06:14

May 2009, Point Sur, 2009-05-19


Line(p2, p3) 5/19/2009, 06:15

5/19/2009, 14:23

May 2009, Point Sur, 2009-05-19


Line(p3, p4) 5/19/2009, 14:24

5/19/2009, 15:01

May 2009, Point Sur, 2009-05-19

….

DNH Metadata Table

• Transform observations into features – Extract at multiple granularities

– Model features as “footprints”

– E.g.: 1 million observations over 3 weeks

Original CruiseObservations

Bounding Box(derived)

Line per day(derived)

Individual line segments (derived)

May … June

16

Metadata: Adaptive Hierarchy

2010-07, 08

201

0-0

8 (part)

20

10

-07

20

09

-10

20

09

-09

20

09

-08

20

09

-06

200

9-0

5 (part)

2009-05-17 – 2009-11-212008-02-19 – 2008-08-20

20

08

-08

20

08

-05

20

08

-02

Data files (directly downloadable); bottom level of metadata hierarchy

Parent (lifetime) metadata record

20

09

-11

2007-10-30 ... 2010-08-12

20

07

-11

20

09

-07

200

7

Second level of metadata hierarchy

• Multiple depths of hierarchy are accommodated simultaneously

• Curation decision(s) made once per kind of data/dataset

Fixed Stations:• 1 location• Time: months-decades• Observations: millions• Download format: NetCDF• Hierarchy: 3 levels

Water samples:• 1-3 observations per location• Time: minutes• Download format: CSV• Hierarchy: 1 level

WS 942-945: 2010-05-25 13:58-14:01

WS 946: 2010-05-25 14:53

WS 947-948: 2010-05-25 15:19-15:22

WS 949-950: 2010-05-26 08:14-08:20

Parent metadata record(s): 1 per water-sample location

17

2010-07, 08

2010-08 (part)

2010-07

2009-10

2009-09

2009-08

2009-06

2009-05 (part)

2009-05-17 – 2009-11-212008-02-19 – 2008-08-20

2008-08

2008-05

2008-02

Query: 2009-06-01

– 2009-07-31

Data files (directly downloadable); bottom level of metadata hierarchy

Parent (lifetime) metadata record

User query

2009-11

2007-10-30 ... 2010-08-12

2007-11

2009-07

2007

Second level of metadata hierarchy

-25

5

-85

-85 -53 -25

88

22

99100

10094

8474

66

-20

-18 -24

Scoring using Hierarchical Metadata• Hierarchical

metadata allows fast access to data at multiple scales or granularities

18

System Components

Scoring & Ranking

Metadata Creation

MetadataRepository

ObservationRepository

Analysis Programs

User Interface

SensorObservationProcessing

New: Components of Data Near Here

Task: Analysis

Task: Search

Google Maps

Task: QA, Data Curation

Existing Components

19

The Prototype: “Data Near Here”

Extracted metadata for ¼ billion observations 15,500 metadata records

Developed an interactive user interface: Demo Accepts spatial and temporal query terms Ranks datasets by decreasing score Provides real-time response

http://jord.research.pdx.edu/~vmegler/dnh/dnhmap.php

20

Conclusion

Our research demonstrates methods for:

Ranking scientific datasets in response to a spatio-temporal query

Automatically extracting hierarchical metadata from scientific datasets …

… and searching over the extracted features

Providing real-time response times for queries over ¼ billion observations in a multi-terabyte data repository

21

Current Research

Evaluation of metadata scalability

Add elevation / depth: 4-dimensional search 2+1+1 versus 3+1

Add additional search criteria: Observational variables … “with oxygen below 3 mg/liter, where Myrionecta Rubra are present”

Backup Material

23

References1. Geospatial One Stop (GOS), http://gos2.geodata.gov/wps/portal/gos.2. Global Change Master Directory Web Site, http://gcmd.nasa.gov/.3. The Google Maps Javascript API V3, http://code.google.com/apis/maps/

documentation/javascript/.4. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press New York (1999).5. Douglas, D.H., Peucker, T.K.: Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica. 10, 2,

112–122 (1973).6. Egenhofer, M.J.: Toward the semantic geospatial web. Proceedings of the 10th ACM international symposium on Advances in geographic information systems.

pp. 1–4 (2002).7. Evans, M.P.: Analysing Google rankings through search engine optimization data. Internet Research. 17, 1, 21–37 (2007).8. Goodchild, M.F., Zhou, J.: Finding geographic information: Collection-level metadata. GeoInformatica. 7, 2, 95–112 (2003).9. Goodchild, M.F.: The Alexandria Digital Library Project: Review, Assessment, and Prospects, http://www.dlib.org/dlib/may04/goodchild/05goodchild.html,

(2004).10. Goodchild, M.F. et al.: Sharing Geographic Information: An Assessment of the Geospatial One-Stop. Annals of the AAG. 97, 2, 250-266 (2007).11. Grossner, K.E. et al.: Defining a digital earth system. Transactions in GIS. 12, 1, 145–160 (2008).12. Herring, J.R. ed: OpenGIS® Implementation Standard for Geographic information - Simple feature access - Part 1: Common architecture, (2010).13. Hey, T., Trefethen, A.: e-Science and its implications. Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and

Engineering Sciences. 361, 1809, 1809 (2003).14. Hey, T., Trefethen, A.E.: The Data Deluge: An e-Science Perspective. Grid Computing: Making the Global Infrastructure a Reality (eds F. Berman, G. Fox and T.

Hey). pp. 809-824 John Wiley & Sons, Ltd, Chichester, UK (2003).15. Hill, L.L. et al.: Collection metadata solutions for digital library applications. J. of the American Soc. for Information Science. 50, 13, 1169–1181 (1999).16. Howe, B. et al.: Scientific Mashups: Runtime-Configurable Data Product Ensembles. Scientific and Statistical Database Management. pp. 19–36 (2009).17. Kobayashi, M., Takeda, K.: Information retrieval on the web. ACM Comput. Surv. 32, 144–173 (2000).18. Lewandowski, D.: Web searching, search engines and Information Retrieval. Information Services and Use. 25, 3, 137-147 (2005).19. Lord, P., Macdonald, A.: e-Science Curation Report, http://www.jisc.ac.uk/uploaded_documents/e-ScienceReportFinal.pdf, (2003).20. Manning, C.D. et al.: An introduction to information retrieval. Cambridge University Press (2008).21. Maron, M.E., Kuhns, J.L.: On relevance, probabilistic indexing and information retrieval. Journal of the ACM (JACM). 7, 3, 216–244 (1960).22. Miller, C.C.: A Beast in the Field: The Google Maps mashup as GIS/2. Cartographica. 41, 3, 187-199 (2006).23. Miller, H.J., Wentz, E.A.: Representation and Spatial Analysis in Geographic Information Systems. Annals of the AAG. 93, 3, 574-594 (2003).24. Montello, D.: The geometry of environmental knowledge. Theories and methods of spatio-temporal reasoning in geographic space. 136–152 (1992).25. Perlman, E. et al.: Data Exploration of Turbulence Simulations Using a Database Cluster. Proceedings of the 2007 ACM/IEEE conference on Supercomputing-

Volume 00. pp. 1–11 (2007).26. Sharifzadeh, M., Shahabi, C.: The spatial skyline queries. Proc. of VLDB. p. 762 (2006).27. Stolte, E., Alonso, G.: Efficient exploration of large scientific databases. Proc. of VLDB. p. 633 (2002).

24

Example Sensor Types and Associated Data Characteristics

Water SamplesTime: “point in time”Location: x,y,z pointQuantity: hundredsObservations per: 1

Cruises Time: weeksLocation: hundreds of milesQuantity: ~ 4 per yearObservations per: millions

Gliders, Autonomic Unmanned Vehicles Time: hours / daysLocation: miles; x,y,zQuantity: 10s per yearObservations per: million

Fixed Stations Time: decadesLocation: fixed x,y; variable zQuantity: 20Observations per: millions

25

User Study

• Two populations, each of size 20: – “Scientists”: CMOP oceanographers and microbiologists

– “Non-scientists”: primarily Information Technology professionals & grad. students

• Format: paper questionnaire, pair-wise comparisons to query terms– 3 choices: A is closer; B is closer; same

• Categories of comparisons:– Time

– Space: points, lines and polygons; comparisons across shapes

– Time and space combined

• Example cross-shape comparisons:

X

A

B

X

AB

X

A

BX

BA

26

User Study: Sample Finding• Finding: Ordinal responses are independent of:

– Type of question (time only, space only, time and space combined)

– Shape (point, line, polyline, polygon)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

% Agreement w ith Relevance Measure 2 ("very close" cases removed*)

% P

op

ula

tio

n S

elf

-Ag

ree

me

nt

Non-Scientists

Scientists

* “very close” < 0.2 radius difference in distance

27

Creating Metadata: Time

Geometry Mintime Maxtime Parent

Station MAB1 Point(x,y) 1/1/2007 11/30/2010 <null>

Station MAB1, 2010

Point(x,y) 1/1/ 2010 11/30 2010 Station MAB1

Station MAB1, 2009

Point(x,y) 1/1/ 2009 12/31/ 2009 Station MAB1

…

Station MAB1, 2009, January

Point(x,y) 1/1/2009 1/31/2009 Station MAB1, 2009

Station MAB1, 2009, February

Point(x,y) 2/1/2009 2/28/2009 Station MAB1, 2009

…

• Metadata derived via extract / summarization of source observations

• Example: Station MAB1

DNH Metadata Table

Station MAB1

– A single physical location (point)

– Observations batched into files

• 1 dataset per month• 1 dataset per year

• Example searches:– 2006

– March – August, 2009

– February, 2009

28

Scoring Formula: TimeQtmin

Qtmax

Qtp

Time Query

dtmindtmax

Dataset d

• Given time query Q, calculate time-relevance score dts for dataset d

(1)

(2a)

(4)

dtmin dtmax

dtmindtmax (1)

(4)

(2)

)(

1)2/|(|

,||2

)1|(|)1|(|

,||2

)1|(|

,||2

)1|(|

,0

)/()2(

)/()2(

2/)(

)2/)((

minmaxmaxminmaxmin

maxmaxminminminmax

2min

2max

maxmaxminminminmax

2min

maxmaxminminminmax

2max

maxmaxminmin

minmaxminmaxmaxmax

minmaxminmaxmin

minmax

minminmaxminmin

TdistTs

TTTTRR

TTTTRR

RR

TTTTRR

R

TTTTRR

R

TTTT

Tdist

TTTTTR

TTTTT

TT

TTTTR

dsd

QdQddd

QdQddd

dd

QdQddd

d

QdQddd

d

QdQd

d

QQQQdd

QQQQd

QQ

QQQdd

PSU and IBM Confidential

dtmin

(3)dtmax

(3)

(2b)

29

Scoring Formula: Geospace

GC

Geospatial Query

dgmin

dgmax

Dataset d

• Given geospatial query G, calculate geospatial-relevance score dgs for dataset d:

GR

• Overall relevance score dscore for dataset d:

dgmin

dgmax

(1)

(2)

2/)( TsGsscore ddd

(3)

PSU and IBM Confidential

dgmax

dgmin