16
GSK: Development and Distribution of Resources Hitoshi ISAHARA GSK: Gengo Shigen Kyokai (Lan guage Resource Associatio n) National Institute of Info rmation and Communications Technology (NICT) Licensing and Distribution of Resources and Applications

GSK: Development and Distribution of Resources

  • Upload
    tevy

  • View
    52

  • Download
    1

Embed Size (px)

DESCRIPTION

GSK: Development and Distribution of Resources. Licensing and Distribution of Resources and Applications. Hitoshi ISAHARA GSK : Gengo Shigen Kyokai (Language Resource Association) National Institute of Information and Communications Technology (NICT). - PowerPoint PPT Presentation

Citation preview

Page 1: GSK: Development and Distribution of Resources

GSK: Development and Distribution of

ResourcesHitoshi ISAHARA

GSK: Gengo Shigen Kyokai (Language Resource Association)

National Institute of Information and Communications Technology (NICT)

Licensing and Distribution of Resources and Applications

Page 2: GSK: Development and Distribution of Resources

Regional Conference on Localized ICT Development and Dissemination

across AsiaJan. 15, Vientiane, Laos

2

Organizing Creation & Utilization of Language Corpora

Creation of language corpora needs some cost.Utilization needs a system to distribute corpora.Some activities started early in 1990s. 1992 LDC in U.S.A. 1995 ELRA in Europe

Page 3: GSK: Development and Distribution of Resources

Regional Conference on Localized ICT Development and Dissemination

across AsiaJan. 15, Vientiane, Laos

3

Japanese Activities

GSK: Gengo Shigen Kyokai (Language Resource Association) Launched in 1999, Reformed as an NPO in 2003, Project accepted in 2005 for 3 years, Text corpora are its main concern at present. NII-SRC distributes speech corpora.

Page 4: GSK: Development and Distribution of Resources

Regional Conference on Localized ICT Development and Dissemination

across AsiaJan. 15, Vientiane, Laos

4

GSK and NII-SRC

Language Resource Association (GSK) A nonprofit organization collecting and distributing text and speech corpora.

http://www.gsk.or.jp/

NII-Speech Resources Consortium (NII-SRC) Collects and distributes most major speech corpora. http://research.nii.ac.jp/src/eng/

These two organizations try to play central roles for collecting

and distributing speech and language corpora in Japan.

Page 5: GSK: Development and Distribution of Resources

Regional Conference on Localized ICT Development and Dissemination

across AsiaJan. 15, Vientiane, Laos

5

Knowledge Information Processing Technologies

Committee

Language ResourceSub-committee

JEITA(Japan Electronics and Information Technology Industries Association)

Natural Language Processing Portal Site

SHACHI: Language Resource Metadata DB

NICT: National Institute of Information and Communications

Technology

GSKNII-SRC

TCL

NII: National Institute of Informatics

Page 6: GSK: Development and Distribution of Resources

Regional Conference on Localized ICT Development and Dissemination

across AsiaJan. 15, Vientiane, Laos

6

Purpose of GSK

Collection, distribution, investigation, research, and standardization of electronic data and software tools necessary for the promotion of science, technology, education and industry concerning natural language.

Page 7: GSK: Development and Distribution of Resources

Regional Conference on Localized ICT Development and Dissemination

across AsiaJan. 15, Vientiane, Laos

7

GSK Organization

President

Two vice presidents

11 board members

25 steering committee members

All are voluntary workers.

Page 8: GSK: Development and Distribution of Resources

Regional Conference on Localized ICT Development and Dissemination

across AsiaJan. 15, Vientiane, Laos

8

No-fee Distribution

Provider

UserGSK

Agreement

Distribution permission

Corpus

Payment

As a rule, the cost of handling corpora falls on the user, though the corpus itself is free of charge.

Page 9: GSK: Development and Distribution of Resources

Regional Conference on Localized ICT Development and Dissemination

across AsiaJan. 15, Vientiane, Laos

9

Agency

Agency

Commission

GSK Request

Form

Payment

Agreement

Provider

User

The providers of the corpora entrust GSK with requests received from users. GSK mediates between users and providers.

Page 10: GSK: Development and Distribution of Resources

Regional Conference on Localized ICT Development and Dissemination

across AsiaJan. 15, Vientiane, Laos

10

Advertizing

Provider

User

GSKAd request

Ad rate

Payment

Agreement

Publicity

Corpora providers entrust GSK with advertizing useful information on their data or corpora.

Page 11: GSK: Development and Distribution of Resources

Regional Conference on Localized ICT Development and Dissemination

across AsiaJan. 15, Vientiane, Laos

11

Some Examples of GSK Corpora

JEITA Multimodal Corpus

Japanese Web N-ram Version 1

CICC Multilingual Dictionary

IPAL Lexicon of Basic Japanese

Page 12: GSK: Development and Distribution of Resources

Regional Conference on Localized ICT Development and Dissemination

across AsiaJan. 15, Vientiane, Laos

12

JEITA Multimodal Corpus

A corpus of collected person-to-person task-oriented dialogues. 80 min. of video for 9 conversations concerning topics of “faces” and “travel” included. Speech data transcribed and provided with annotations indicating morphemes, dialogue structure and prosody. Contained in 1 DVD-R (800 MB).

Page 13: GSK: Development and Distribution of Resources

Regional Conference on Localized ICT Development and Dissemination

across AsiaJan. 15, Vientiane, Laos

13

Japanese Web N-gram Version 1 N-grams that have been extracted from Google cr

awling publicly available Japanese webpages. Pages requiring special permission to brows or indicated with nonarchaive/noindex are not included. N-grams (1-7) with frequency greater than 20 were extracted from approximately 20 billion sentences.

Contained in 6 DVD-Rs (26 GB after gzip compression).

Page 14: GSK: Development and Distribution of Resources

Regional Conference on Localized ICT Development and Dissemination

across AsiaJan. 15, Vientiane, Laos

14

CICC Multilingual Dictionary

A collection of Malay, Indonesian, Chinese, and Thai Dictionaries containing 50,000 basic words, POS tags; some contains English translations. Technical Term Dictionary for each language is also available.

Contained in 1 CD-ROM for each language.CICC: Center for the International Cooperation for Computation

Page 15: GSK: Development and Distribution of Resources

Regional Conference on Localized ICT Development and Dissemination

across AsiaJan. 15, Vientiane, Laos

15

IPAL Lexicon of Basic Japanese

Containing

861 verbs, 136 adjectives, and 1,081 Nouns and glossary. English translations also provided for nouns contained in glossary.

Contained in 1 CD-ROM.

Page 16: GSK: Development and Distribution of Resources

Regional Conference on Localized ICT Development and Dissemination

across AsiaJan. 15, Vientiane, Laos

16

Summary1. There are several distributers of language

resources in Japan.2. GSK is the only consortium of language

resources qualified as NPO in Japan. 3. GSK plans to collaborate with Language

Grid Project.