23
NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des Langues et Civilisations Orientales (INALCO), Paris

NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des

Embed Size (px)

Citation preview

Page 1: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des

NooJ international Conference, Komotini, May 2010

Portability of Armenian Corpus

by NoojAnaid Donabedian & Victoria Khurshudian

Institut National des Langues et Civilisations Orientales (INALCO), Paris

Page 2: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des

Armenian: preliminaries

an Indo-European language

right-branching

of an accusative type

typically with an SOV structure and

dominantly with an agglutinative morphology

Page 3: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des

Historical Armenia

Page 4: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des

Republic of Armenia

Page 5: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des

Periodization prealphabetical

alphabetical (405 A.D. – up to present).

1. Old Armenian or Grabar (V-XI);

2. Middle Armenian (XII-XVI);

3. Modern Armenian (XVII – up to present)

Western Eastern (based on Constantinople dialect) (based on Ararat dialect)

dialects… dialects….

Page 6: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des

Objective

Provide data compatibility and portability between Nooj and

Eastern Armenian National Corpus (EANC) platform

Page 7: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des

What is Eastern Armenian National Corpus

www.eanc.netCorpus Technologies

Michael Daniel, Victoria Khurshudian, Dmitri Levonian,

Vladimir Plungian, Alexey Polyakov,Sergey Rubakov

Page 8: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des

8

Source texts

PARSER

Annotated texts

Annotation algorithm

Grammatical dictionary

Page 9: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des

EANC History

Moscow, Russia

March 2006: Project Launch

July 2007: 1st Release

May 2008: 2nd Release

March 2009: 3rd release

Page 10: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des

Eastern Armenian National Corpus (EANC) is:

about 110 million tokens

morphological and other markup

English translations for frequent tokens

covers SEA from the mid-19th century to the present

both written and oral discourse

full-text view for over 100 Armenian classic titles

open internet access

Page 11: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des

Written Discourse

over 106 mln. tokens

510 authors (1841-2009)

1039 fiction texts (including 206 translated texts)

7858 press issues

non-fiction (scientific and other) texts

Page 12: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des

Spontaneous discourse

Polylogues

Task-oriented discourse

TV-shows transcripts

Movies …

☼ EANC oral corpus has all been recorded and transcribed

by the project.

Oral Discourse (3.5 mln. tokens)

Page 13: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des

13

EANC Functionality

Page 14: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des

14

Search Functionality

Token queries

Context queries

Subcorpus selection

Page 15: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des

15

Simple token queries:

• lexeme search

• wordform search

• gram search

• translation search

• lexeme + gram search

Search Functionality

Page 16: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des

16

Advanced options for token queries:

case-sensitivity

punctuation marks

position in the sentence

wildcard (*)

logical functions (e.g. ‘or' |)

negated features

grammatical/lexical homonymy inclusion/exclusion

Search Functionality

Page 17: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des

17

Subcorpus selection by:

time

author(s) / title(s)

genres

types of texts (translated vs. original)

superposition of any of the above

Search Functionality

Page 18: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des

18

Display options

context expanding

‘sort by’ (time, lexeme, wordform etc.)

Latin transliteration

glossed display

KWIC (key word in the context)

Search Functionality

Page 19: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des

19

Transliterated samples:

Page 20: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des

20

Glossed samples:

Page 21: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des

21

KWIC samples:

Page 22: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des

Main Current Tasks:

Make Nooj-based Western Armenian morphological annotation compatible with EANC grammatical dictionary structure

Make EANC and Nooj Western Armenian platforms interportable

Mutual full coverage of Nooj and EANC capacities (e.g. syntactical annotation of Nooj)

Page 23: NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des