Upload
ian-turton
View
1.911
Download
1
Tags:
Embed Size (px)
Citation preview
Geographic Information Retrieval from Disparate Data SourcesIan Turton, Anuj Jaiswal, Mark Gahegan
GeoVISTA Center, School of Geography, Pennsylvania State University
ijt1,arj135,[email protected]
Summary
Information Retrieval? Geographic? Disparate Data Sources? Does it work? Semantics and Ontologies, do they help? Further work? Conclusions
Information Retrieval
Information retrieval (IR) is the science of searching for information in documents, searching for documents themselves, searching for metadata which describe documents, or searching within databases, whether relational stand-alone databases or hypertextually-networked databases such as the World Wide Web.
Wikipedia
OR more simply
Is there some way I can avoid reading all 19,000 of those articles about measles and still sound like I know what I’m talking about at the next conference?
Geography
Well we all know that geography is important. Depending on who you ask more than 80% of
all information contains a geographic element.
Explicit: Has a map coordinate
Implicit: Has a place name
Disparate Data Sources
Large collections of text containing implicit geographic references about Avian Flu and Measles: PubMed abstracts News Feeds (RSS) WHO incident reports
Building the System
Acquire data Extract geographic information Extract semantic and ontological information Present in a form that allows easy exploration
by users.
Acquire Data
First extract abstracts from PubMed http://eutils.ncbi.nlm.nih.gov/entrez/eutils/ ((avian OR bird) AND (influenza OR flu)) OR
H5N1 Returns a structured XML file with citation
data and abstract for selected papers. Process XML into PostGIS database
Extract Geographic Entities
Use FactXtractor (http://julian.mine.nu/snedemo.html)
Uses GATE to detect and extract Named Entities and Entity Relationships
Usually finds People, Places and Organizations
Returned as an OWL encoded ontology In this case we just make use of places
<rdf:RDF xml:base="http://ist.psu.edu/sna/ontology#"> <owl:Class rdf:ID="Location"/><owl:Class rdf:ID="Organization"/><owl:Class rdf:ID="Person"/><owl:DatatypeProperty rdf:ID="counts"/> <Location rdf:ID="Africa"> <counts>1</counts> <mentioned_in> <_Article rdf:ID="InputString0">
</_Article> </mentioned_in> </Location> <Location rdf:ID="Asia"> <counts>1</counts> <mentioned_in rdf:resource="#InputString0"/> </Location> <Location rdf:ID="Vietnam"/> <Location rdf:ID="South_East"/> <Location rdf:ID="Europe"> <counts>1</counts> <mentioned_in rdf:resource="#InputString0"/> </Location></rdf:RDF>
GeoLocation
Converting a place name into a location State College, PA -> (40.7934, -77.86) Call the GeoNames web service to carry out
a gazetteer lookup on the name.
Disambiguation
Which London did you mean?
Types of Ambiguity
Geo/Geo London, UK vs London, Ontario South Wales, UK vs New South Wales, Au Paris, France vs Paris, Texas
Geo/Non Geo Washington, DC vs George Washington Van, Turkey vs delivery van West Nile, Egypt vs West Nile Virus
Sort of Ambiguous avian A/Mallard/Pennsylvania/10218/84 (H5N2) influenza
virus strains
Disambiguating Multiple PlacesChoose A if A is a Political Entity and B is not,Choose B if B is a Political Entity and A is not,Choose A if A is a Region and B is not,Choose B if B is a Region and A is not,Choose A if A is an Ocean and B is not,Choose B if B is an Ocean and A is not,Choose A if A is a Populated Place and B is not,Choose B if B is a Populated Place and A is not,Choose A if A's population is greater than B's,Choose B if B's population is greater than A's,Choose A if A is an Administrative Area and B is not,Choose B if B is an Administrative Area and A is not,Choose A if A is a Water Feature and B is not,Choose B if B is a Water Feature and A is not,Choose A.
Solving Geo/Non Geo Ambiguity Stop word lists – hand crafted by experience Province, valley, way, hill, Children, Children's, new, cross, red,
clinic, general, côte, ii, iii, bas, pays, chem, northern region, eastern region, central region, southern region, region, off, square, census, islands, city, district, park, USA, State, Virology, Microbiology, Immunology, Medical, Science, Employee, Surveillance, Disease, Biochemistry, Prevention, for, and, mail, natl, dept, dev, agr, Rural, inst, mil, med, coll, Internal, Publ, Bur, Hosp, Jude, Childrens, Chai, yan, Virol, Dis, Div, Enter, Cent, lab, Univ, res, ist, prevent, roc, prod, Roche, vet, castle, peak, stat, garden, Atl, Anim, mar, queen, central, Director, LAT, AC-EIA, register, north, east, south, west, northern, southern, eastern, western
Concept Extraction
Automatically extract keywords or tags from article abstracts by Selecting keywords which exceed a preset
frequency. Passing text through Yahoo! tagging service,
returns key phrases using latent semantic indexing.
Store everything in a big database Open up PostGIS and stuff in all the data
keyed by article id. Article
Citation data – authors, title, abstract, journal, volume, issue, etc
Places Name, Country, Latitude, Longitude, etc
Concepts Key phrase or word
Provide Intuitive Front End for Users Tag Cloud
Popularized on many web 2.0 sites such as Flickr, del.icio.us, citeUlike.org etc.
Place Cloud
Author Cloud
Choose a tag
Choose a place
Select a child of the place
Tag limited by place
Implementation
Initially implemented as a java servlet using JDBC link to PostGIS
Reimplemented using Ruby on Rails in last week using ActiveRecord to PostGIS
In page mapping OpenLayers WMS map client to GeoServer over PostGIS.
Semantics and Ontologies
Geographic ontology is provided by GeoNames semantic web service.
A query allows the look up of parent, children and nearby features for most features.
Results are cached in PostGIS database to save processing time and load on server.
WordNet Ontology
Conclusions
It is possible to construct a useful system to ingest arbitrary text and extract place names.
A sufficiently good automated location disambiguation system can be built for a specific domain to process 80-90% of places correctly.
Semantic expansion and narrowing of searches appears useful in early experiments.
Providing users with a familiar, and highly linked, interface seems to aid exploration of the document space.