JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier

JRC-Ispra, 16.09.04, Slide 1

Multilingual text analysis applications based on automatic Eurovoc indexing

Ralf Steinberger

Addressing the Language Barrier Problem in the Enlarged EU Automatic Eurovoc Descriptor Assignment

JRC Workshop, Ispra, 16/17 September 2004

http://www.jrc.cec.eu.int/langtech


Applications mentioned so far

• Thesaurus indexing (summarise main concepts of document)– Fully automatic– Interactive – Monolingual and cross-lingual

• Document retrieval– Monolingual and cross-lingual

Eurovoc indexing can be used for MUCH MORE …


Main goals of JRC’s Language Technology (LT) activity

• Gather potentially user-relevant documents

• Analyse texts in various languages – extract information from texts (Eurovoc)– identify similarity between documents (Eurovoc)– Classify documents (Eurovoc)

• Visualise contents– of individual documents (Eurovoc)– of whole document collections (Eurovoc)


Eurovoc indexing as part of a tool set


(Cross-lingual) document similarity calculation

EnglishEnglishTextText

Resolution on radio-

active waste

SpanishSpanishTextText

Resolución sobre los residuos

radioactivos

6621020304

52160104

monolingual


(Multilingual) text classification

• Most current approaches to text classification are monolingual

Category 1 Category 2 Category 3

EsEs EsFr Es

• Text classification, via Eurovoc, is multilingual


(Multilingual) document map© Cartia’s ThemeScape


‘Translation Spotting’

Why?• To test document similarity calculation• To compile a collection of parallel texts (for the training and testing of

other multilingual text analysis applications)• To detect cross-lingual document plagiarism


‘Translation Spotting’ - Results Task: find Spanish translations of English source document in a

parallel text collection

DS considering the length of documents

DS correcting the monolingual bias (83%)

Simple document similarity (DS)


• To organise unknown document collections• Algorithm:

–Find pairs of texts that are most similar–Group them in one cluster, repeat the operation until only one cluster

remains

(Multilingual) clustering of documents

90%

80%

75%

40%

10%


Building a (multilingual) cluster tree


Application to (multilingual) news analysis

EMM system in JRC’s Web Technology sector retrieves about 20,000 news articles per day in ~20 languages (4000 articles in English) (http://emm.jrc.it)

• Cluster related news stories and identify duplicates (news topic identification)

• Identify keywords, people’s names, place names, main sentences (information extraction)

• Find related news stories over time (news topic tracking)

• Find related news stories in other languages (cross-lingual topic tracking mainly via Eurovoc and place names)


Detection of the major news of the day (EMM)


Establish Links to Related News over time


Establish links to related news in other languages


Subject-specific summarisation (1)

Title: "Resolution on the 10th anniversary of the Chernobyl accident"

Eurovoc descriptors:


Subject-specific summarisation (2)

Eurovoc descriptors:


Further JRC LT applications

• Recognition and translation of:

– Place names; + visualisation

– People’s names; + retrieval of images and further information

– Dates

– Products • Recognition of text language


Place name recognition / Cross-lingual display


Place name recognition / Visualisation

18 references (Boston, American, America, New York)

11 references (Vietnam)

5 references (Iraq)+ 1 reference to Sweden(Andre Heinz(…) Swedish based environmental consultant)


Place name recognition / Disambiguation

Requires disambiguation• 14 Paris’, 7 Birminghams• cities called ‘And’, ‘Annan’• name variants (exonyms)

Zoom on Europe


Recognising names, places, … - News navigation

Top-mentioned personalities En/Fr news

26 July 2004


Automatic recognition of name variants


Automatic link to online encyclopaedia


News clusters mentioning a person


Persons talked about in same news clusters


Countries talked about in same news clusters


Frequent keywords for these news clusters


Recognising products and product groups

Sample text



Identified products



Cross-lingual display of products found


Multilingual Information Extraction– Language recognition (demo)– Keywords (monolingual; cross-lingual)– Geographical place names (intro; new EU languages; demo)– Products and product groups (slides; demo JRC, demo CIS)– Names of people (demo news names, demo recognition,

related names, Cyrillic/Greek fuzzy name matching, demo fuzzy matching)– Dates (demo recognition)– Terminology extraction– Summarisation (standard sentence extraction; subject-specific summarisation)

Cross-lingual navigation and classification– Document similarity (monolingual; cross-lingual; translation spotting)– Bottom-up document clustering; topic detection (demo news analysis)– Classification (multi-monolingual and cross-lingual; pre-classification clustering)– Relevance-ranking of documents (slides)– News topic tracking (monolingual historical; cross-lingual; demo news analysis)– Navigate text collections via people, countries, keywords, clusters, across languages (slides; demo news names).

Visualisation of textual contents– Individual documents (document profile)– Whole document collections (document map)– Geographical information (maps; animated maps, demo)– Clustering (ascii, star, tree), key-word-in-context (KWIC), search, …

Further tools– Document Gathering (Lang-Tech crawler; WT’s EMM system)– Document format conversion (PDF, MS-Word, PS, HTML, XML)– Character set conversion (UTF-8, ISO-Latin, HTML, …)

Projects IDoRA for OLAF (slides) Cross-lingual Indexing

(EUROVOC) Breaking News –

Detection and Visualisation (BNDV / State-of-the-World)

SVM for Text Classification Modus Operandi Ad-hoc analyses (REACH,

AM, INFSO project proposals, ADMIN job descriptions, ENV Public Consultation Sustainable Development)

JRC Introduction

Multilingual and crosslingual text analysis

Documents

JRC-Ispra, 16.09.04, Slide 1 Multilingual text analysis applications based on automatic Eurovoc indexing Ralf Steinberger Addressing the Language Barrier