33
JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged EU Automatic Eurovoc Descriptor Assignment JRC Workshop, Ispra, 16/17 September 2004 http://www.jrc.cec.eu.int/langtech

JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

Embed Size (px)

Citation preview

Page 1: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 1

Multilingual text analysis applications based on automatic Eurovoc indexing

Ralf Steinberger

Addressing the Language Barrier Problem in the Enlarged EU Automatic Eurovoc Descriptor Assignment

JRC Workshop, Ispra, 16/17 September 2004

http://www.jrc.cec.eu.int/langtech

Page 2: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 2

Applications mentioned so far

• Thesaurus indexing (summarise main concepts of document)– Fully automatic– Interactive – Monolingual and cross-lingual

• Document retrieval– Monolingual and cross-lingual

Eurovoc indexing can be used for MUCH MORE …

Page 3: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 3

Main goals of JRC’s Language Technology (LT) activity

• Gather potentially user-relevant documents

• Analyse texts in various languages – extract information from texts (Eurovoc)– identify similarity between documents (Eurovoc)– Classify documents (Eurovoc)

• Visualise contents– of individual documents (Eurovoc)– of whole document collections (Eurovoc)

Page 4: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 4

Eurovoc indexing as part of a tool set

Page 5: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 5

(Cross-lingual) document similarity calculation

EnglishEnglishTextText

Resolution on radio-

active waste

SpanishSpanishTextText

Resolución sobre los residuos

radioactivos

6621020304

52160104

monolingual

Page 6: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 6

(Multilingual) text classification

• Most current approaches to text classification are monolingual

Category 1 Category 2 Category 3

EsEs EsFr Es

• Text classification, via Eurovoc, is multilingual

Page 7: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 7

(Multilingual) document map© Cartia’s ThemeScape

Page 8: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 8

‘Translation Spotting’

Why?• To test document similarity calculation• To compile a collection of parallel texts (for the training and testing of

other multilingual text analysis applications)• To detect cross-lingual document plagiarism

Page 9: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 9

‘Translation Spotting’ - Results Task: find Spanish translations of English source document in a

parallel text collection

DS considering the length of documents

DS correcting the monolingual bias (83%)

Simple document similarity (DS)

Page 10: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 10

• To organise unknown document collections• Algorithm:

–Find pairs of texts that are most similar–Group them in one cluster, repeat the operation until only one cluster

remains

(Multilingual) clustering of documents

90%

80%

75%

40%

10%

Page 11: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 11

Building a (multilingual) cluster tree

Page 12: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 12

Application to (multilingual) news analysis

EMM system in JRC’s Web Technology sector retrieves about 20,000 news articles per day in ~20 languages (4000 articles in English) (http://emm.jrc.it)

• Cluster related news stories and identify duplicates (news topic identification)

• Identify keywords, people’s names, place names, main sentences (information extraction)

• Find related news stories over time (news topic tracking)

• Find related news stories in other languages (cross-lingual topic tracking mainly via Eurovoc and place names)

Page 13: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 13

Detection of the major news of the day (EMM)

Page 14: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 14

Establish Links to Related News over time

Page 15: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 15

Establish links to related news in other languages

Page 16: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 16

Subject-specific summarisation (1)

Title: "Resolution on the 10th anniversary of the Chernobyl accident"

Eurovoc descriptors:

Page 17: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 17

Subject-specific summarisation (2)

Eurovoc descriptors:

Page 18: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 18

Further JRC LT applications

• Recognition and translation of:

– Place names; + visualisation

– People’s names; + retrieval of images and further information

– Dates

– Products • Recognition of text language

Page 19: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 19

Place name recognition / Cross-lingual display

Page 20: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 20

Place name recognition / Visualisation

18 references (Boston, American, America, New York)

11 references (Vietnam)

5 references (Iraq)+ 1 reference to Sweden(Andre Heinz(…) Swedish based environmental consultant)

Page 21: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 21

Place name recognition / Disambiguation

Requires disambiguation• 14 Paris’, 7 Birminghams• cities called ‘And’, ‘Annan’• name variants (exonyms)

Zoom on Europe

Page 22: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 22

Recognising names, places, … - News navigation

Top-mentioned personalities En/Fr news

26 July 2004

Page 23: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 23

Automatic recognition of name variants

Page 24: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 24

Automatic link to online encyclopaedia

Page 25: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 25

News clusters mentioning a person

Page 26: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 26

Persons talked about in same news clusters

Page 27: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 27

Countries talked about in same news clusters

Page 28: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 28

Frequent keywords for these news clusters

Page 29: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 29

Recognising products and product groups

Sample text

Page 30: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 30

Recognising products and product groups

Identified products

Page 31: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 31

Recognising products and product groups

Cross-lingual display of products found

Page 32: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 32

Page 33: JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

Multilingual Information Extraction– Language recognition (demo)– Keywords (monolingual; cross-lingual)– Geographical place names (intro; new EU languages; demo)– Products and product groups (slides; demo JRC, demo CIS)– Names of people (demo news names, demo recognition,

related names, Cyrillic/Greek fuzzy name matching, demo fuzzy matching)– Dates (demo recognition)– Terminology extraction– Summarisation (standard sentence extraction; subject-specific summarisation)

Cross-lingual navigation and classification– Document similarity (monolingual; cross-lingual; translation spotting)– Bottom-up document clustering; topic detection (demo news analysis)– Classification (multi-monolingual and cross-lingual; pre-classification clustering)– Relevance-ranking of documents (slides)– News topic tracking (monolingual historical; cross-lingual; demo news analysis)– Navigate text collections via people, countries, keywords, clusters, across languages (slides; demo news names).

Visualisation of textual contents– Individual documents (document profile)– Whole document collections (document map)– Geographical information (maps; animated maps, demo)– Clustering (ascii, star, tree), key-word-in-context (KWIC), search, …

Further tools– Document Gathering (Lang-Tech crawler; WT’s EMM system)– Document format conversion (PDF, MS-Word, PS, HTML, XML)– Character set conversion (UTF-8, ISO-Latin, HTML, …)

Projects IDoRA for OLAF (slides) Cross-lingual Indexing

(EUROVOC) Breaking News –

Detection and Visualisation (BNDV / State-of-the-World)

SVM for Text Classification Modus Operandi Ad-hoc analyses (REACH,

AM, INFSO project proposals, ADMIN job descriptions, ENV Public Consultation Sustainable Development)

JRC Introduction

Multilingual and crosslingual text analysis