20
Extracting Metadata for Spatially-Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh

Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh

Embed Size (px)

Citation preview

Page 1: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh

Extracting Metadata for Spatially-Aware Information Retrieval on the

InternetClough, Paul

University of Sheffield, UK

Presented By

Mayank Singh

Page 2: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh

Overview :

• The importance of the experiment.• Introduction to SPIRIT and GATE.• Techniques employed – Geo Parsing and Geo

Coding.• Pros• Cons• What it leads to.

Page 3: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh

The importance of the experiment:• A novel system.

• Geospatial information extraction from the Web documents.

• Annotating the retrieved documents with the spatial data.

• Using the annotated documents to power a working GIR system.

Page 4: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh

How does it work (summary)

Extracting geospatial references from document involves:– Identifying geographic references

– Assigning them spatial co-ordinates

– Factors influencing the above:

speed, reliability, flexibility and multilingualism.

Page 5: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh

Introduction to SPIRIT

• Spatial Information Retrieval on the Internet

• The main aim of the project is to create tools and

techniques to help people find information that

relates to specified geographical locations.

Page 6: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh

1TB crawl of about 9million web documents focused

on UK, Germany, France and Switzerland. Support

of Ontology of places.

Relevance ranking of web documents catering to

needs of:• Documents referring some place of interest• Digital geospatial resources

Page 7: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh

GATE

It’s a java suite for tasks related to Natural Language

Processing and particularly useful and widely used in

the area of Information Extraction. ANNIE (A

nearly-new Information Extraction system) is the

highlight of this experiment which is employed by

SPIRIT.

Page 8: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh

ANNIE

• Tokenizer• Gazetter• Sentence splitter• Part-of-speech tagger• Named-Entity transducer

Page 9: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh

Spatial Markup

Sources of Spatial markup:• OS – Ordnance Survey (UK, point)• TGN – Getty Thesaurus of Geographical names

(Global, point) • SABE – Seamless administrative boundaries of

Europe (Europe, polygon)

Page 10: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh

Geo-Parsing

• Named-Entity Recognition – lists + rules• List lookup inefficient• First gazetter lookup then use of contextual

evidence to realize this.• JAPE (Java Patterns Annotation Engine) – rules

defined w.r.t terms of entities identified within GATE.

• Rules are language independent (using Systran system)

Page 11: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh

Hurdles faced

• Filtering out commonly used words – specially which are used in a non-geographical sense.

• Using person-name list to filter out ambiguity between places and names.

Page 12: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh

Geo-Coding

• Gazetter lookup to assign co-ordinates

• Removing ambiguity in place names: by feature hierarchy and feature type provided by OS.

• Actual grounding done by SABE and OS.

• TGN used to resolve global ambiguity.

Page 13: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh

Experimental Setup

• Total annotated collection of about 8.8million pages

• 22 out of top 50 domains from Europe• About 1.6 million doc containing 5-10 unique

footprints selected. Further 10% chosen from this and then those only from UK (130)

• All geographic names (1864) manually identified and stored as benchmark

Page 14: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh

Geo-parsing Results

SPIRIT + SABE + OS:• Correct – 1340• Missing – 479• False Hits – 596• Precision – 0.6966• Recall – 0.7820• F1 – 0.7184

Page 15: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh

Geo-Coding Results

• TGN ineffective due to global scope – 1021 found, 68% ambiguous.

• UK SABE good – 942 found, 11% ambiguous.

• 1137 places assigned a UID correctly. That is not only correct geo sense but resource order too.

Page 16: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh

Conclusions

• Promising as success rate of 89% is there.• Geo-parsing can be improved by enhancing

gazetter matching methods and filtering of non-geographic entries

• Geo-coding can be improved by finding better methods for combining geog. resources.

Page 17: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh

Pros

• Novel system and high success rate.

• Towards a geospatial search engine.

• Spatial markup resources in abundance.

Page 18: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh

Cons

• Ambiguity (geographical) • Matching correct geographical sense.• Large overhead required to build such systems.• Inherent NLP problems.

Page 19: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh

What it all leads to

• Creating geographical ontology to assist in GIR (Challenges and Resources for Evaluating Geographical IR - Bruno Martins, Mário J. Silva and Marcirio Silveira Chaves Faculdade de Ciências da Universidade de Lisboa 1749-016 Lisboa, Portugal)

• More focused Local and topical search (Urban Web Crawling - Dirk Ahlers OFFIS Institute for Information Technology Oldenburg, Germany; Susanne Boll University of Oldenburg Germany)

Page 20: Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh

References

• Extracting Metadata for Spatially-Aware Information Retrieval on the Internet - Clough, Paul

• GATE - http://gate.ac.uk/overview.html

• SPIRIT - http://www.geo-spirit.org/project_full.html

• Challenges and Resources for Evaluating Geographical IR - Bruno Martins, Mário J. Silva and Marcirio Silveira Chaves Faculdade de Ciências da Universidade de Lisboa 1749-016 Lisboa, Portugal

• Urban Web Crawling - Dirk Ahlers OFFIS Institute for Information Technology Oldenburg, Germany; Susanne Boll University of Oldenburg Germany