Human Language Technologies for the Semantic Web
Department of Computer Science,University of Sheffield
Fabio Ciravegna and Yorick Wilks
F. Ciravegna- AKT Town Meeting April 2003
Language Technologies
• Goal– Building systems able to process Natural
Language in its written or spoken form
• Methodology– Use of Language Analysis
• Technologies (examples):• Information Extraction from Text• Question Answering • Text Generation
F. Ciravegna- AKT Town Meeting April 2003
HLT for Kn. Management
• Use of HLT for Knowledge– Acquisition – Retrieval– Publication
• Main benefits– Cost Reduction– Time needed for KM– Improving knowledge accessibility
• Accessing/Diffusing/Understanding
F. Ciravegna- AKT Town Meeting April 2003
HLT in AKT for KM
acquisition retrieval publishing
Text mining
Information Extraction from Text
Text Generation
F. Ciravegna- AKT Town Meeting April 2003
HLT for Semantic Web
• Use of HLT for:– Document annotation– Information integration from different
sources
• Benefit– Reduce annotation needs– Retrieve and integrate dispersed
information
F. Ciravegna- AKT Town Meeting April 2003
Information Extraction
• Textual documents are pervasive (e.g. Web) – Contained knowledge cannot be queried,
therefore cannot be• Used by automatic systems• Easily managed by humans
• IE can identify information in documents– e.g. to populate a database– e.g. to annotate documents
• Method: natural language analysisWordsInformationKnowledge
IE tasks
Named Entities Template Elements
Template Relations
Scenario Template
WASHINGTON, D.C. (October 5, 1999) - nQuest Inc. today announced that Paul Jacobs, former Vice-President of E-Commerce at SRA International, has joined the company's executive management team as president.
nQuest Inc. Paul Jacobs.SRA International
Company: nQuest Inc. Date: today InPerson: Paul JacobsInRole: president
Company: SRA InternationalOutPerson: Paul JacobsOutRole: Vice-President of E-Commerce,
F. Ciravegna- AKT Town Meeting April 2003
IE Tools @ Sheffield
• GATE: – General Architecture for Language
Engineering– Used to integrate HLT modules
• Annie:– Rule-based Named Entity Recogniser– Download at www.gate.ac.uk
• Amilcare:– Adaptive IE system– Portable using examples– www.nlp.shef.ac.uk/amilcare
F. Ciravegna- AKT Town Meeting April 2003
IE Tools @ Sheffield (2)
• Melita: – Annotation tool – supported by adaptive IE (Amilcare)– Learns how to annotate– www.aktors.org/technologies/melita/
• Lasie– IE system for complex event extraction– Manual rule development– www.dcs.shef.ac.uk/research/groups/nlp/funded/
lasie.html
F. Ciravegna- AKT Town Meeting April 2003
•An architecture•A macro-level organisational picture for LE software systems.
• A framework•for programmers, GATE is an object-oriented class library that implements the architecture.
• A development environment•for language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction.
• Free software (LGPL). Mature robust software (in development since 1995). •Comes with…
• Some free components... ...and wrappers for other people's components • Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc.
GATE is…
F. Ciravegna- AKT Town Meeting April 2003
Some users…
At time of writing a representative fraction of GATE users includes: • Longman Pearson publishing, UK; • BT Exact Technologies, UK;• Merck KgAa, Germany; • Canon Europe, UK; • Knight Ridder (the second biggest US news publisher); • BBN Technologies, US;• Sirma AI Ltd., Bulgaria; • Resco AB, Sweden/Finland/Germany;• Glaxo Smith Kline Plc: drug-based navigation of Medline abstracts• Master Foods NV: extraction of commodities events from news• the American National Corpus project, US; • Imperial College, London, the University of Manchester, Queen Mary
College, UMIST, the University of Karlsruhe, Vassar College, ISI / the University of Southern California and a large number of other UK, US and EU Universities;
• the Perseus Digital Library project, Tufts University, US.
F. Ciravegna- AKT Town Meeting April 2003
GATE and Content Extraction
ANNIE - Open-source IE system in GATE, providing modules needed for content extraction– Pre-processing– Named entity recognition– Coreference resolution
• ANNIE handles proper names, pronouns, and nominals
• Easy-to-use pattern-action rule language to enable customisation and postprocessing of the IE results
• Contact Hamish Cunningham ([email protected])
F. Ciravegna- AKT Town Meeting April 2003
Amilcare Active annotation for the Semantic Web
• Tool for adaptive IE from Web-related texts– Specifically designed for document annotation– Trains with a limited amount of examples– Effective on different text types
• From free texts to rigid docs (XML,HTML, etc.)
– Tools for:• Normal user
– Able to annotate a corpus
• Amilcare Expert– Able to optimise experiments
• IE Expert– Able to edit rules
– Uses Annie for preprocessing up to Named Entity Recognition
[Ciravegna – IJCAI 2001]
F. Ciravegna- AKT Town Meeting April 2003
Implementation details
• 100% Java• External Interfaces:
– API for use from other programs– GUI for manual training
• Requirements:– 10M on HD– Up to 300M RAM
• Contact Fabio Ciravegna ([email protected])
F. Ciravegna- AKT Town Meeting April 2003
Users• Integrated with SW annotation tools:
– MnM (Open Univ.) – Ontomat (Karlsruhe Univ.) – Melita (Sheffield Univ.)
• Users:– Merck (D), – ISOCO (SP), – Quinary (I), – Ontoprise (D)– University College Dublin (IE), – 2 departments of CNRS (F)– University of Trier (D), – University of Texas (Austin, USA)
F. Ciravegna- AKT Town Meeting April 2003
Document Annotation
• Many application areas require document annotation (enrichment)– Knowledge Management
• Protocol analysis in industry (Kingston 94)
• Italian police: 100 annotators/6 pages a day each– Semantic Web (Staab00, Motta02, Ciravegna02)
• Annotation is generally manual– Expensive– Inefficient – Difficult– Tedious & Tiring
• Error prone (15-30% inter-annotator disagreement)– Never ending
F. Ciravegna- AKT Town Meeting April 2003
Melita• Document annotation tool
– Use adaptive IE engine to support annotation
• IE System:– Trains while users annotate– Provides preliminary annotation for new documents
• Advantages– Annotates trivial or previously seen cases – Focuses slow/expensive user activity on unseen cases– Validating extracted information
• Simpler & less error prone • Speeds up corpus annotation
– Learns how to improve capabilities
F. Ciravegna- AKT Town Meeting April 2003
Annotation with IE
User Annotates
Trains on annotated corpus
Bare TextBare Text
AnnotationComparison
Retrains using errors, missing tags and mistakes
Annotates
F. Ciravegna- AKT Town Meeting April 2003
Bare Text User
Corrects
Annotates
Uses corrections to retrain
Annotation with Suggestions
F. Ciravegna- AKT Town Meeting April 2003
Cooperation:is IE a Useful Support?
CMU Seminars TASK Test:250 texts (Amilcare report the best IE results ever)
Location
0
20
40
60
80
100
0 20 40 60 80 100 120 140
training examples
Precision Recall F-measure
Speaker
0
20
40
60
80
100
0 20 40 60 80 100 120 140
training examples
Precision Recall F-measure
Stime
0
20
40
60
80
100
0 20 40 60 80 100 120 140
training examples
Precision Recall F-measure
Etime
0
20
40
60
80
100
0 20 40 60 80 100 120 140
training examples
Precision Recall F-measure
F. Ciravegna- AKT Town Meeting April 2003
Integrating Information
• Information is available over the Web– Dispersed– In textual format
• IE as basis for retrieval and integration of information – Unsupervised learning using
• The redundancy of the web
• Available Repositories– Collections of documents/data– Known services (e.g. databases, digital libraries, search
engines)
to bootstrap learning and produce simple high precision IE applications
F. Ciravegna- AKT Town Meeting April 2003
Mining Web Sites
• Extracting knowledge from CS Web sites
NamePositionEmail/TelephoneInvolvement in projectsPublicationsCo-workers
Person:
•Information distributed•Challenges
•Retrieving information•Integrating Information•Largely unsupervised by user
F. Ciravegna- AKT Town Meeting April 2003
Mining Web sites
People and Projectnames
HomePageSearch
Project/People name lists and hyperlinksBasket:
• Annotates known names• Trains on annotations to discover
the HTML structure of the page• Recovers all names and hyperlinks
• Mines the site looking for Project and People names
• Uses •Generic patterns•Annie•Citeseer for likely bigrams
F. Ciravegna- AKT Town Meeting April 2003
Mining Web sites
Projects/People Web pages
HomePageSearch
Extracts personal data•Addresses•Tel number•Email address•…
Project/People name lists and hyperlinksBasket:Name lists and hyperlinks Personal data People and ProjectsBasket:
F. Ciravegna- AKT Town Meeting April 2003Name lists and hyperlinks Personal data People and ProjectsBasket:
HomePageSearch
People Publications
Mining Web sites
• Annotates known papers• Trains on annotations to
discover the HTML structure• Recovers co-authoring
information
Name lists and hyperlinks Personal data Co-authoring informationPeople and ProjectsBasket:
F. Ciravegna- AKT Town Meeting April 2003
Paper discovery
F. Ciravegna- AKT Town Meeting April 2003
Focus on people
F. Ciravegna- AKT Town Meeting April 2003
User Role
• Providing:– A URL– List of services (e.g. Google)
• Train wrappers using examples
– some examples of fillers (e.g. projects)
• In case, correcting intermediate results
F. Ciravegna- AKT Town Meeting April 2003
Rationale
• Large collections (e.g. Web) contain redundant information– Redundancy can be used to bootstrap learning
• Mining the Web for information– Learned patters
• Integration of information – Multiple evidence
• Different strategies with different reliability• Scruffy works!
– User corrections of data in case
F. Ciravegna- AKT Town Meeting April 2003
Conclusion
• In AKT we are using HLT (IE) for:– Helping in document annotation– Integrating information from different
sources
• Benefit:– Reduce annotation needs– Retrieve and integrate dispersed
information• Minimum user intervention