The impact of standardized terminologies and domain-ontologies in multilingual information processing

The impact of standardized-terminologies and domain-

ontologies in multilingual information processing

Maruf Hasan, D.Eng.Senior Researcher

Thai Computational Linguistics Laboratory, Thailand National Institute of Information and Communication Technology, Japan

The 5th AOS Workshop, Beijing, April 27-29, 2004

2

Outline

Natural Language Processing (NLP) Research Cross Language Information Retrieval Named Entity Extraction

Integrated Knowledge Management Scenario Terminology and Ontology Initiatives The Future: Bootstrapping NiCT resources and technologies Conclusions


3

NLP Research

Corpus-based Statistical NLP became a popular research theme in recent years many smart applications exist (e.g., Google

search engine, MS Word’s Grammar Checking, etc.)

semantics and knowledge still remain obscured behind words (symbols)

meaning, concepts are difficult to extract/build with statistics alone Bootstrapping helps


4

New Research Trends

While relying heavily on sophisticated NLP techniques, researchers are paying increasing attention to take advantage of semi-automatically built Lexical and Knowledge resources

Outcomes Increasing number of monolingual lexical resources Increasing number of multilingual dictionaries, thesauri, and

generalized ontologies Increasing number of specialized ontologies Increasing number of bootstrapping approaches to get the best

from both ends augmenting statistically extracted knowledge with the manually

encoded one, and vice versa.


5

Two perspectives of Information/Knowledge

Content Management Perspective Metadata (e.g., Dublin Core Metadata, 13 fields) Taxonomy/thesauri (augmenting the Keyword field)

Analogy: HTML (Fixed set of tags)

Content Harnessing Perspective Machine understandable content Conceptual and associative hierarchy based on content Ontology (Modeling a domain with concepts and their

relationships from domain-expert’s perspective) Analogy: XML (Tags are not fixed)


6

Interoperability

XML technology revolutionized the computing industry in terms of data interoperability and exchange

Ontology has started bringing new dimensions in modeling information and knowledge in the same way Traditional dictionaries and thesauri suffered badly from

interoperability problems Ontology offers a flexible framework for Knowledge

Modeling (similar to that of XML in Data manipulation)


7

Bootstrapping: How It Helps

Two major pitfalls with ontology Developing ontologies (expensive! requires Knowledge Engineers) Populating ontologies (labor intensive! Semi-automatic means exist)

Bootstrapping: a simple example X is identified as a Person in the ontology but Y is not Analyzing a piece of text with NLP tools, we found the evidence that X

and Y are conducting research in an organization for some projects, for example.

It is easy to infer that Y is a person (and, also her affiliation, research interests, etc. through similar analysis)

NLP techniques helps in semi-automatically populating an ontology NLP tools and algorithms can be further augmented with the help of the

ontology-driven knowledge What if we do not find any such evidence that X is also a family-friend of Y?

How can we possibly deal with such cases? I will show an example later


8

Human factors: Why MT fails but IR wins

So far, Information Retrieval (IR) applications including Search Engines, such as Google, have been largely successful but Machine Translation (MT) systems are not so successful. Reasons include

Failures in modeling linguistic and extra-linguistic phenomena, context and concepts, etc.

Human tolerance in finding information and in translation quality varies• Human tolerance: [ (low) Written Audio Video (high)]

Case-Study: Telstra Voice-operated Directory Service – a failure from user’s perspective but a successful investment from Telstra’s point of view

Many queries (70%) are repeating and the system can handle them quickly (success from Telstra’s perspective). But when a user enquires about rare entities, the system fails (failure from user’s perspective).


9

Cross-Language Information Retrieval

Cross-language Information Retrieval is crucial Why: Querying with native language is comfortable, but every

now and then, the most valuable information related to our search is probably available in another language

How: Translating the queries or the document-collection (using a simplified MT model) to find information in other languages

Economic Factor: Finding relevant information at a low cost (using noisy translation) is possible. And, after receiving a list of documents (and selecting the relevant ones - as we often do with Google), we can take the (costly) decision of whether or not to translate the information.

That is, even if someone’s foreign language level is not so competitive, we can still make sense of information from other cues (tables, graphs, etc.) and take the right decision.


10

Cross-Language Information Retrieval (2)

Multilingual dictionaries or simplistic MT models are typically used Although noisy to some extent, language pair, such as Chinese and

Japanese can take advantage of Hanzi- (Kanji-) semantics also applicable for alphabetic languages if we map words with their root

forms Further enhancement, for example, Latent Semantic Indexing (or

other conceptual retrieval techniques help in mapping symbolic words to abstract concepts

Statistically built dictionaries (based on statistical correlation) also proved effective in CLIR

CLIR Demo In CLIR, the best effect can be achieved, if a user is guided through

a correlation dictionary (statistically created) and an ontology (manually crafted).

Associative relationships are better captured by statistical correlation Hierarchical relationships are better captured in ontologies or KBs


11

Searching Idiosyncrasies (pseudo CLIR)

Experiment with Kanji Semantics Searching “ 大学” on Google

大学 site:cn 大学 site:jp

• The word, 大学 has the same meaning in both Japanese and Chinese Experiment with different server

Searching “DNA” on different Google local sites www.google.co.jp www.google.co.th www.google.com

• The retrieved results are quite different When it comes to information, we prefer to harness it in

an integrated fashion. Communication and connectivity are no longer barriers but

languages are!


12

Dilemma in Named-Entity Extraction

Named Entities play an important role in harnessing information

Significant research efforts have been channeled to automatic Named Entity Extraction - using simple heuristics as well as sophisticated machine learning algorithms.

For some reasons, the task remained restricted Organization, Person, Location, Date, Time, Money, Percent

In specific domains such as Bio- or Agro- informatics, the notion of named-entities is broader (and different from the above, of course) Domain specific entities are important. With carefully designed tools

(using NLP techniques), it is possible to identify domain-specific entities Event extraction is more difficult but crucial in harnessing information


13

Integrated Knowledge Management

In an optimal scenario, we need to elicit knowledge from 3 different sources and manage it in an integrated fashion Knowledge extracted from symbolic systems (written text, utterance,

etc.) – relatively explicit but not so precise! More precise knowledge encoded in ontologies and KBs (semi-

automatic) – converted from implicit towards explicit forms! Expert’s tacit knowledge – possible to capture in a system if the experts

cooperate. Ontology-based knowledge representation is the most appropriate

representation so far – because it is understood by both human and machine equally

Ontologies, if not maintained regularly can be outdated soon. There are certain other pitfalls which can be circumvented

through sophisticated NLP techniques, bootstrapping and indexing scheme. see examples in the following slides


14

An Integrated KM Scenario

An “academic ontology” about people, project, organisations, project-reports, etc. within an organization (precise knowledge: ontologies are populated semi-automatically, sometimes from databases)

A set of sophisticated “NLP Tools” for Tokenizing, Parsing, Text Classifications, etc. (non-precise knowledge: Extracted from text automatically)

A group of users/experts who are inspired to make things better (Tacit Knowledge) by giving feedback.

A Spreading Activation based indexing scheme is used to capture and propagate changes in a bootstrapped fashion

c.f., Hasan, M.M. (2004). Spreading Activation Framework for Ontology-enhanced Effective Information Access within Organisations, In van Elst, L. et al. eds.: "Agent-Mediated Knowledge Management". Springer’s Lecture Notes in Computer Science, Vol. 2926. pp. 288-296. Also published in the proceedings of AAAI Spring Symposium, AMKM-2003, USA.


15

Heterogeneous Sources of Knowledge


16

But, Integrated Manipulation

Underneath, there is a spreading activation based indexing structure which changes over time

Expert’s feedback is also captured and propagated into the network

Commercial systems are developed using similar technique (e.g., TeSSI ® from L&C Global in pharmaceutical domain using a multilingual pharmaceutical ontology (developed under EU initiative)


17

Lexical and Ontological Resources

China: HowNet (similar to WordNet with broader conceptual coverage

Japan: EDR Dictionary - A set of dictionaries including bilingual E-J dictionaries, Dictionary of Technical Terms and concept; NTT Goi Taikei, etc

Korea: KORTERM initiativeThai: TCL’s Computational Lexicon


18

Lexical and Ontological Resources (2)

GENIA Annotated Corpus and GENIA Ontology from University of Tokyo for Bioinformatics research based on Medline Abstract Multilingual specialized ontologies are still rare but

likely to appear Similar resources in Agricultural domain

including AGROVOC thesaurus, and related ontologies and resources (corpora) FAO’s Bio-Safety Ontology:

Frequent verbs (Free Text Corpus) Arguments (NPs) KAON concepts Domain Experts

a bootstrapping approach of creating ontology


19

NiCT Language Resources

EDR LexiconsNiCT acquired all copyright of the EDR electronic dictionary in 2002 and

able to distribute them for a nominal handling fee. Word Dictionaries

Japanese Word Dictionary (260,000) English Word Dictionary (190,000)

Bilingual Dictionaries Jpn.-Eng. Bilingual Dictionary (240,000) Eng.-Jpn. Bilingual Dictionary (160,000)

Concept Dictionary (410,000) Co-occurrence Dictionary

Japanese Co-occurrence Dictionary (930,000)• 20,000 Japanese example sentences

English Co-occurrence Dictionary (460,000)• 12,000 English example sentences

Technical Terminology Dictionary (110,000 Japanese & 70,000 English entries)


20

NiCT Language Resources (2)

Multilingual Annotated Corpus 40,000 Japanese sentences from Mainichi Newspaper (i.e.,

Kyoto University Corpus) Morphologically and syntactically annotated English translation (manually translated); Phrase alignment

done Syntactic annotation (based on Penn Treebank) 10,000 sentences will be translated and aligned in the phrase level

in April 2004 (tentative) Chinese translation (manually translated)

10,000 sentences are already translated And, many other tools and linguistic resources

Project Gutenberg Corpus (English-Japanese Bilingual Sentence Aligned corpus)

SST Learner Corpora (with error annotation)


21

Conclusions

In this new era of ubiquitous connectivity, Integrated processing of information is a necessity .

Language (not physical communication/bandwidth) remains to be the strongest barrier.

Multilingual resources (dictionaries, thesauri, corpora) are either rare or incomplete

AGROVOC still doesn’t cover many languages (including Japanese) Effective processing of multilingual information needs concerted effort in

resource building and standardization Specially in terminology and interoperable ontology standards

Multilingual resources along with effective bootstrapping strategy will help us overcoming the difficulties in NLP and multilingual information processing

With the resources and technologies we have at NiCT, it could be worthy to try extending AGROVOC and related ontology to cover Japanese

AGROTERM from AFFRC-Japan contains 57,000 agricultural terms extracted from a corpus using NLP tools.

Aligning AGROTERM or other similar resources with AGROVOC semi-automatically is a useful challenge.

Education

The impact of standardized terminologies and domain-ontologies in multilingual information processing