Web Crawler Agent (WCA) Presented by Kirk Martinez University of Southampton

Web Crawler Agent(WCA)

Presented by

Kirk Martinez

University of Southampton

Introduction

• WCA searches for missing information (fragments) on the Web

• WCA structures information into ontology “place_of_birth” (Person,Place)

• Techniques used: NLP (Natural Language Processing), Information extraction, relation extraction, question answering

OverviewOntology

work_in (Alonso, ‘Granada’)date_of_birth (Rembrandt, ?)

class - relation - class

Person - work_in - PlacePerson - date_of_birth - Place

…..

Ontology instance

web Web CrawlerAgent

date_of_birth (Rembrandt, ?)

missing instance

searchextract “15-July-1606”

as answer

Start Ontotriple

Is it something like “Google”?

• Search “date_of_birth” (when Rembrandt was born) with Google

Searching information with Google

• The “old” Web Search (eg Google) is good for getting documents but NOT for extracting concise answers – (e.g. “15-July-1606”)

• No analysis to “understand” the documents (e.g. “Rembrandt” can mean “hotel” or “bookstore”)

Information extraction on the Web

• data may be low quality and repeated– e.g. Seurat Georges’s date of death– 29, March 1891(http://www.ibiblio.org/wm/paint/auth/seurat/)

– 19, March 1891 (http://www.rickdoble.net/influence/20seurat.htm)

• WCA depends on:– Well-structured sentences and documents– Good named-entity recognisers

Web crawler agentsearches the Web for

the missing value

Future work

• verification

• performance

• autonomous

Documents

Web Crawler Agent (WCA) Presented by Kirk Martinez University of Southampton