37
Domain-specific Web Corpora and their Applications Gregor Erbach Saarland University Project COLLATE (funding: BMBF 01 IN A01 B)

12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Embed Size (px)

Citation preview

Page 1: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Domain-specific Web Corpora and their Applications

Gregor Erbach

Saarland University

Project COLLATE

(funding: BMBF 01 IN A01 B)

Page 2: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Outline

Part I: Web Corpora

Part II: Applications of Web Corpora

Part III: LT-World Web Corpus

Part IV: Research in COLLATE

Page 3: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Part I: Web Corpora

1. Formal Properties of the Web

2. Web Corpus

3. Document and Hyperlink Database

4. TREC web track

Page 4: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Formal Properties of the Web

• Hypertext/Hypermedia• Directed graph with cycles• Edges = hyperlinks• Nodes = documents ???• Nodes often have internal tree structure (HTML,

XML)

Page 5: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Web Corpus

A web corpus consists of• a database of documents• a database of hyperlinks

Page 6: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Document Database

Information for each document:– URL/URN

– Full Text (possibly with linguistic annotation such as POS, named entities, phrases)

– Full Text Index

– Metadata• Author, Language, Date, MIME type … (Dublin Core)

• Category, Abstract, Keywords, Type of Page …

Page 7: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Fields of Hyperlink Database

• source anchor URL

• source anchor position on web page (percentage)

• source anchor position in document structure (HTML element path)

• source anchor type (text or image)

• source anchor text and context

• target anchor URL

• target anchor position on web page

• target anchor MIME type

Page 8: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Derived Properties of Hyperlinks

• Same document?• Same server?• Same 2nd/3rd level domain?• Ascending of descending in directory structure• Source is within a list of links• Navigation link (up, previous, next …)

Page 9: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

TREC web track

• Construction of a web corpus (WT10g) according to the following criteria:

– Broadly representative of web data in general

– Many inter-server links

– Contains all available pages from a set of servers

– Contains an interesting set of meta-data

– Contains few binary, non-English or duplicate documents

– Size: 10 GB

P. Bailey, N. Craswell and D. Hawking. Engineering a multi-purpose test collection for Web retrieval experiments. IP&M, to appear.

Page 10: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Part II: Applications of Web Corpora

1. Web Mining

2. Information Retrieval

3. Clustering and Categorisation

4. Summarisation

5. Discovery of Relations

6. Terminology Extraction

7. Information Extraction

8. Ontology Learning

Page 11: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Useful Methods

• Machine Learning and Data Mining• Natural Language Processing• Information Retrieval• Ontologies and Semantic Web• Bibliometrics (citation analysis ~ link analysis)

Page 12: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Web Mining

• Web Content Mining– Discovery of terminology, acronyms, concepts

• Web Structure Mining– Discovery of relations, communities …

• Web Usage Mining– Discovery of navigation patterns

Page 13: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Information Retrieval

• Usage of hyperlinks for determining popularity of web pages

• Hub and authority pages• Widely used: Google PageRank• Mixed results in TREC web track

Jon M. Kleinberg (1997) Authoritative Sources in a Hyperlinked Environment. Journal of the ACM

Sergey Brin, Lawrence Page (1998) The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems

Page 14: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Clustering

• Standard clustering algorithms form clusters by iteratively grouping documents/clusters, according to a distance measure

• Content-based methods measure distance by counting terms/concepts (often TF/IDF)

• Connectivity-based distance measures make use of hyperlinks

Page 15: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Categorisation

• Categorisation algorithms determine the membership of a document in a pre-defined thematic category

• Content-based categorisation methods measure distance from a representative of the category

• Connectivity-based distance measures are based on the assumption that certain types of hyperlinks lead to documents of the same category

Page 16: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Summarisation / Keyword Extraction

• Source anchor text has been used to generate short summaries of target web pages.

Page 17: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Discovery of Relations

• Hyperlink structure reflects relations between web resources (e.g. between personal homepage, project page, organisation page)

• Relations can be discovered by content-based methods and by connectivity-based methods

Page 18: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Terminology Extraction

• Content-based: extraction of domain terminology by statistical analysis (TF/IDF …) and/or phrasal chunking

• Applicability of connectivity-based methods?

Page 19: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Information Extraction

• Automatic extraction of meta-data• Extraction of named entities for concept-based

indexing• Extraction of templates/relations for relation-based

indexing, and question answering

Page 20: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Ontology Learning

• Extraction of candidates by frequency of occurrence in similar contexts

• Usage of textual clues (“such as”, “sogar” …)• Applicability of connectivity-based methods?

• Definition and acronym mining

Page 21: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Part III: LT-World Web Corpus

1. Content of LT World

2. Ontology

3. Hyperlinking within LT World

4. Construction of the corpus

Page 22: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

LT World: Idea and Context

• The virtual information center is a comprehensive WWW-based information and knowledge service for the entire area of language technology.

• LT World is a “virtual” center in the sense that most information will physically remain with their creators or with other service providers.

• The virtual information center has been online since October 2001 under the name „LT World“ for „Language Technology World“ (www.lt-world.org)

Page 23: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach
Page 24: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Virtual Information Center - LT World

• Information and Knowledge– Technical and Scientific Information

• Players and Teams– Persons, Projects, Organisations

• Resources and Results– Research Systems, Commercial Products

• Communication and Events– News, Conferences

Page 25: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

LT World Ontology

Publications

Products Projects People

Layer 2: Specific Ontologies

Corpora etc.

Layer 1: Dublin Core

Layer 3: Ontology for CL & LT

Page 26: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

LT World Ontology

• Dimensions– Linguality (monolingual, multilingual, cross-language)

– Application

– Computational/mathematical methods

– Linguistic Models / Theories

– Level of linguistic description/processing

– Technologies

– Language(s)

• Ontology is modelled in RDF with Protégé 2000

Page 27: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

LT World: Coverage

• 99 topic nodes• 300 NLP tools and products• 1800 people• 850 organisations• 500 projects

Page 28: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Data Acquisition Process

• Manual collection, categorization and annotation of URLs by students and staff

• Sources: conference proceedings and journals, lists of links on the web,

• Self-registration and correction of data by users of the service

• Technical/scientific information in topic nodes has been provided by domain experts

Page 29: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

LT World: Topic Nodes

Topic nodes are the main information unit of the Area “Knowledge and Information”. They are organized in a shallow slightly multidimensional hierarchy following the chapter plan of the second edition of the Language Technology Survey.

Example of the shallow hierarchy:Information Extraction

• Named Entity Recognition

• Terminology Extraction

• Relation Extraction

• Answer Extraction

Page 30: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Information for each Topic

• Name

• Acronyms

• aka‘s, Term Translations

• Short Definition

• Overview Article (from HLT Survey)

• Topic Websites

• R&D Prototypes/Products

• Projects

• People

• Literature

Page 31: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Hyperlinking between Sections

Page 32: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Corpus Construction

• Start from URLs in LT-World collection• Expand document set by recursively following outgoing

hyperlinks using a webspider (e.g., GNU wget)• Expand document set by following incoming hyperlinks

(“link” query to search engine)• Expand document set by search engine queries with

domain terminology• Construct document database and link database• (Filter out irrelevant documents)• Publish Corpus

Page 33: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Part IV: Research Directions

Categorisation / Information Extraction

Discovery of Relations for Hyperlinking

Other

Page 34: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Categorisation and Information Extraction

• Research objectives – find method for categorising documents according to

LT-World ontology

– find method for extraction of meta-information

• Compare and combine content-based and connectivity-based methods

• If successful, it will contribute to semi-automatic extension of the coverage of LT-World

Page 35: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Discovery of Relations

• Objective: develop method for finding pairs of related documents, e.g. personal page – organisation page.

• Content-based and connectivity-based methods are applicable

• If successful, it will enable a significant improvement of LT-World (resource discovery, resource annotation)

Page 36: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Other

• Objective: compare and combine content-based and connectivity-based clustering methods

• Applications:1. Information Retrieval

2. Clustering

3. Summarisation

4. Terminology Extraction

5. Ontology Learning

Page 37: 12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach

Conclusion

• Main research interest: comparison and combination of content-based and connectivity-based methods

• Main application impact: going from a set of “seed” web pages to a domain-specific information system