Upload
jameson-mailey
View
217
Download
3
Tags:
Embed Size (px)
Citation preview
JRC-Ispra, 16.09.04, Slide 1
Multilingual text analysis applications based on automatic Eurovoc indexing
Ralf Steinberger
Addressing the Language Barrier Problem in the Enlarged EU Automatic Eurovoc Descriptor Assignment
JRC Workshop, Ispra, 16/17 September 2004
http://www.jrc.cec.eu.int/langtech
JRC-Ispra, 16.09.04, Slide 2
Applications mentioned so far
• Thesaurus indexing (summarise main concepts of document)– Fully automatic– Interactive – Monolingual and cross-lingual
• Document retrieval– Monolingual and cross-lingual
Eurovoc indexing can be used for MUCH MORE …
JRC-Ispra, 16.09.04, Slide 3
Main goals of JRC’s Language Technology (LT) activity
• Gather potentially user-relevant documents
• Analyse texts in various languages – extract information from texts (Eurovoc)– identify similarity between documents (Eurovoc)– Classify documents (Eurovoc)
• Visualise contents– of individual documents (Eurovoc)– of whole document collections (Eurovoc)
JRC-Ispra, 16.09.04, Slide 4
Eurovoc indexing as part of a tool set
JRC-Ispra, 16.09.04, Slide 5
(Cross-lingual) document similarity calculation
EnglishEnglishTextText
Resolution on radio-
active waste
SpanishSpanishTextText
Resolución sobre los residuos
radioactivos
6621020304
52160104
monolingual
JRC-Ispra, 16.09.04, Slide 6
(Multilingual) text classification
• Most current approaches to text classification are monolingual
Category 1 Category 2 Category 3
EsEs EsFr Es
• Text classification, via Eurovoc, is multilingual
JRC-Ispra, 16.09.04, Slide 7
(Multilingual) document map© Cartia’s ThemeScape
JRC-Ispra, 16.09.04, Slide 8
‘Translation Spotting’
Why?• To test document similarity calculation• To compile a collection of parallel texts (for the training and testing of
other multilingual text analysis applications)• To detect cross-lingual document plagiarism
JRC-Ispra, 16.09.04, Slide 9
‘Translation Spotting’ - Results Task: find Spanish translations of English source document in a
parallel text collection
DS considering the length of documents
DS correcting the monolingual bias (83%)
Simple document similarity (DS)
JRC-Ispra, 16.09.04, Slide 10
• To organise unknown document collections• Algorithm:
–Find pairs of texts that are most similar–Group them in one cluster, repeat the operation until only one cluster
remains
(Multilingual) clustering of documents
90%
80%
75%
40%
10%
JRC-Ispra, 16.09.04, Slide 11
Building a (multilingual) cluster tree
JRC-Ispra, 16.09.04, Slide 12
Application to (multilingual) news analysis
EMM system in JRC’s Web Technology sector retrieves about 20,000 news articles per day in ~20 languages (4000 articles in English) (http://emm.jrc.it)
• Cluster related news stories and identify duplicates (news topic identification)
• Identify keywords, people’s names, place names, main sentences (information extraction)
• Find related news stories over time (news topic tracking)
• Find related news stories in other languages (cross-lingual topic tracking mainly via Eurovoc and place names)
JRC-Ispra, 16.09.04, Slide 13
Detection of the major news of the day (EMM)
JRC-Ispra, 16.09.04, Slide 14
Establish Links to Related News over time
JRC-Ispra, 16.09.04, Slide 15
Establish links to related news in other languages
JRC-Ispra, 16.09.04, Slide 16
Subject-specific summarisation (1)
Title: "Resolution on the 10th anniversary of the Chernobyl accident"
Eurovoc descriptors:
JRC-Ispra, 16.09.04, Slide 17
Subject-specific summarisation (2)
Eurovoc descriptors:
JRC-Ispra, 16.09.04, Slide 18
Further JRC LT applications
• Recognition and translation of:
– Place names; + visualisation
– People’s names; + retrieval of images and further information
– Dates
– Products • Recognition of text language
JRC-Ispra, 16.09.04, Slide 19
Place name recognition / Cross-lingual display
JRC-Ispra, 16.09.04, Slide 20
Place name recognition / Visualisation
18 references (Boston, American, America, New York)
11 references (Vietnam)
5 references (Iraq)+ 1 reference to Sweden(Andre Heinz(…) Swedish based environmental consultant)
JRC-Ispra, 16.09.04, Slide 21
Place name recognition / Disambiguation
Requires disambiguation• 14 Paris’, 7 Birminghams• cities called ‘And’, ‘Annan’• name variants (exonyms)
Zoom on Europe
JRC-Ispra, 16.09.04, Slide 22
Recognising names, places, … - News navigation
Top-mentioned personalities En/Fr news
26 July 2004
JRC-Ispra, 16.09.04, Slide 23
Automatic recognition of name variants
JRC-Ispra, 16.09.04, Slide 24
Automatic link to online encyclopaedia
JRC-Ispra, 16.09.04, Slide 25
News clusters mentioning a person
JRC-Ispra, 16.09.04, Slide 26
Persons talked about in same news clusters
JRC-Ispra, 16.09.04, Slide 27
Countries talked about in same news clusters
JRC-Ispra, 16.09.04, Slide 28
Frequent keywords for these news clusters
JRC-Ispra, 16.09.04, Slide 29
Recognising products and product groups
Sample text
JRC-Ispra, 16.09.04, Slide 30
Recognising products and product groups
Identified products
JRC-Ispra, 16.09.04, Slide 31
Recognising products and product groups
Cross-lingual display of products found
JRC-Ispra, 16.09.04, Slide 32
Multilingual Information Extraction– Language recognition (demo)– Keywords (monolingual; cross-lingual)– Geographical place names (intro; new EU languages; demo)– Products and product groups (slides; demo JRC, demo CIS)– Names of people (demo news names, demo recognition,
related names, Cyrillic/Greek fuzzy name matching, demo fuzzy matching)– Dates (demo recognition)– Terminology extraction– Summarisation (standard sentence extraction; subject-specific summarisation)
Cross-lingual navigation and classification– Document similarity (monolingual; cross-lingual; translation spotting)– Bottom-up document clustering; topic detection (demo news analysis)– Classification (multi-monolingual and cross-lingual; pre-classification clustering)– Relevance-ranking of documents (slides)– News topic tracking (monolingual historical; cross-lingual; demo news analysis)– Navigate text collections via people, countries, keywords, clusters, across languages (slides; demo news names).
Visualisation of textual contents– Individual documents (document profile)– Whole document collections (document map)– Geographical information (maps; animated maps, demo)– Clustering (ascii, star, tree), key-word-in-context (KWIC), search, …
Further tools– Document Gathering (Lang-Tech crawler; WT’s EMM system)– Document format conversion (PDF, MS-Word, PS, HTML, XML)– Character set conversion (UTF-8, ISO-Latin, HTML, …)
Projects IDoRA for OLAF (slides) Cross-lingual Indexing
(EUROVOC) Breaking News –
Detection and Visualisation (BNDV / State-of-the-World)
SVM for Text Classification Modus Operandi Ad-hoc analyses (REACH,
AM, INFSO project proposals, ADMIN job descriptions, ENV Public Consultation Sustainable Development)
JRC Introduction
Multilingual and crosslingual text analysis