5
47 Developing Online Sangam Corpus and Concordance Dr. A. Kamatchi, CAS in Linguistics, Annamalai University Introduction Corpus linguistics, a new method of language study, has emerged in recent years and it has generated a number of research methods, attempting to trace a path from data to theory. Further it is explained, as we all know, that corpus is a large collection of written/spoken materials in machine readable form. It provides, as much as possible, an authentic data for linguistic studies as well as for other related studies in the languages. Most lexical corpora, today, are part-of-speech-tagged (POS-tagged). According to Wikipedia, “a landmark in modern corpus linguistics was the publication by Henry Kucera and W. Nelson Francis of Computational Analysis of Present-Day American English in 1967, a work based on the analysis of the Brown Corpus, a carefully compiled selection of current American English, totalling about a million words drawn from a wide variety of sources. Around forty thousand lines are available in these Sangam texts, which comprise eTTuttokai – naRRin ai, kuRuntokai, aiŋkuRunuuRu, patiRRuppattu, paripaaTal, kalittokai, akanaan uuRu and puRa naan uuRu and pattup paaTTu literature – tirumurukaaRRuppaTai, porunaraaRRuppaTai, ciRupaaNaaRRuppaTai, perumpaan aaRRuppaTai, mullaippaaTTu, maturaikkaañci, neTunal vaaTai, kuRiñcippaaTTu, paTTin appaalai and malaipaTukaTaam – which are earliest literary texts and dated to 3 rd century B.C. to 2 nd century A.D. Developing the online corpus is the need of the hour As far as Tamil language is concerned, the first corpus for modern written Tamil was started to be built in the Central Institute of Indian Languages (CIIL), Mysore in 1987. But its usage by the people is very less in number. The reason may be that it is only in the CD form but not posted in the internet. The other one which is now available in the internet is the Cre-A: Online Tamil Language Repository posted by Cre-A. These corpuses are, of course, concerned only with the modern Tamil, but not with the other period of languages. Besides these corpora of living languages, computerized corpora have also been made of collections of texts in ancient languages. According to Wikipedia, “An example is the Andersen-Forbes database of the Hebrew Bible, developed since the 1970s. The Quranic Arabic Corpus is an annotated corpus for the Classical Arabic language of the Quran. On this line, the present study attempts to prepare a Corpus and Concordance to Sangam Tamil, through which one can search lexical items, rather than words, and their concordance available in these texts. Useful for inter and intra language studies The completion of this work definitely leads to develop software for Sangam Tamil and also for all other old texts. Ultimately, it is highly expected that once this work would be completed, it will, undoubtedly, help us to compare intra and inter languages and language families grouped in the world. Moreover it could be useful for the scholars/researchers working in the field of comparative language as well as historical linguistics studies. Further, it may be helpful for the quantitative analysis, too. A maximum number of words in Sangam Tamil, as we all know, occupy the head entries in Dravidian Etymological Dictionary (DED), which was prepared five decades ago and was widely used by the scholars worldwide. So, posting this material in the website is necessary for the use of the scholars working in the areas of comparative linguistics, in general, and comparative Dravidian, in particular. In the same way, it is also very useful for the historical linguistics scholar in the world. Moreover, it may also use to glottochronological linguistics study all over the world. It is sure that this online Sangam Corpus and Concordance would represent the classic language in Tamil. After Tolkaappiyam, in which a few words have

Developing Online Sangam Corpus and Concordance … conference papers/ti2013/Pages from INFIT... · 47 Developing Online Sangam Corpus and Concordance Dr. A. Kamatchi, CAS in Linguistics,

  • Upload
    dodiep

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Developing Online Sangam Corpus and Concordance … conference papers/ti2013/Pages from INFIT... · 47 Developing Online Sangam Corpus and Concordance Dr. A. Kamatchi, CAS in Linguistics,

47

Developing Online Sangam Corpus and Concordance

Dr. A. Kamatchi, CAS in Linguistics, Annamalai University

Introduction

Corpus linguistics, a new method of language study, has emerged in recent years and it has generated a number of research

methods, attempting to trace a path from data to theory. Further it is explained, as we all know, that corpus is a large

collection of written/spoken materials in machine readable form. It provides, as much as possible, an authentic data for

linguistic studies as well as for other related studies in the languages. Most lexical corpora, today, are part-of-speech-tagged

(POS-tagged). According to Wikipedia, “a landmark in modern corpus linguistics was the publication by Henry

Kucera and W. Nelson Francis of Computational Analysis of Present-Day American English in 1967, a work based on the

analysis of the Brown Corpus, a carefully compiled selection of current American English, totalling about a million words

drawn from a wide variety of sources. Around forty thousand lines are available in these Sangam texts, which comprise

eTTuttokai – naRRinai, kuRuntokai, aiŋkuRunuuRu, patiRRuppattu, paripaaTal, kalittokai, akanaanuuRu and puRa

naanuuRu and pattup paaTTu literature – tirumurukaaRRuppaTai, porunaraaRRuppaTai, ciRupaaNaaRRuppaTai,

perumpaanaaRRuppaTai, mullaippaaTTu, maturaikkaañci, neTunal vaaTai, kuRiñcippaaTTu, paTTinappaalai and

malaipaTukaTaam – which are earliest literary texts and dated to 3rd century B.C. to 2nd century A.D.

Developing the online corpus is the need of the hour

As far as Tamil language is concerned, the first corpus for modern written Tamil was started to be built in the Central

Institute of Indian Languages (CIIL), Mysore in 1987. But its usage by the people is very less in number. The reason may be

that it is only in the CD form but not posted in the internet. The other one which is now available in the internet is the Cre-A:

Online Tamil Language Repository posted by Cre-A. These corpuses are, of course, concerned only with the modern Tamil,

but not with the other period of languages.

Besides these corpora of living languages, computerized corpora have also been made of collections of texts in ancient

languages. According to Wikipedia, “An example is the Andersen-Forbes database of the Hebrew Bible, developed since the

1970s. The Quranic Arabic Corpus is an annotated corpus for the Classical Arabic language of the Quran. On this line, the

present study attempts to prepare a Corpus and Concordance to Sangam Tamil, through which one can search lexical items,

rather than words, and their concordance available in these texts.

Useful for inter and intra language studies

The completion of this work definitely leads to develop software for Sangam Tamil and also for all other old texts.

Ultimately, it is highly expected that once this work would be completed, it will, undoubtedly, help us to compare intra and

inter languages and language families grouped in the world. Moreover it could be useful for the scholars/researchers working

in the field of comparative language as well as historical linguistics studies. Further, it may be helpful for the quantitative

analysis, too.

A maximum number of words in Sangam Tamil, as we all know, occupy the head entries in Dravidian Etymological

Dictionary (DED), which was prepared five decades ago and was widely used by the scholars worldwide. So, posting this

material in the website is necessary for the use of the scholars working in the areas of comparative linguistics, in general, and

comparative Dravidian, in particular. In the same way, it is also very useful for the historical linguistics scholar in the world.

Moreover, it may also use to glottochronological linguistics study all over the world. It is sure that this online Sangam

Corpus and Concordance would represent the classic language in Tamil. After Tolkaappiyam, in which a few words have

Page 2: Developing Online Sangam Corpus and Concordance … conference papers/ti2013/Pages from INFIT... · 47 Developing Online Sangam Corpus and Concordance Dr. A. Kamatchi, CAS in Linguistics,

48

been simply explained on the part uriccol, this would be the potential work with using the modern theories in linguistics and

scientific methods in the process of preparing the collection of lexical items. This work may be very useful for utilizing for

the school curriculum because most of the students and teachers cannot understand the old Tamil words in proper way. In

this way, this study tries its level best to make classical literature easier for the teachers and students community in

understanding the classical works in the school curriculum of Tamil living countries.

Morphological Parser

Approximately, ten thousand words are, perhaps, attested in these texts. Without parsing the poetic lines of these texts in

these literatures, we are not able to make the machine to understand the materials. Therefore, every line should be parsed

morphologically then only, the original root forms could be retrieved from the database. This parser would consist of not

only nouns and verbs but also the possible grammatical items found in the texts. These words would be collected

systematically and incorporated in the corpus. For developing this online Sangam Corpus and Concordance, the data would

be collected from old Tamil Sangam texts. In fact, it is the pioneer attempt to develop the Corpus and Concordance to

Sangam Tamil, which consists of not only lexical items but also the grammatical elements attested in these texts. When we

click one word already given in the Drop-Down box in the window, we can, accurately, get a number of occurrences of that

word, the line on which it occurs, its meaning in that particular line, literature name and poem number along with the line

number. If it is successfully completed at the earliest, this work will be the model for other period of Tamil works.

Available materials

There are indexes for some of the Sangam works, but not for all, of course. Among these Sangam texts, only for some of the

anthologies, the indexes have been made by the scholars and published. Though the others have been indexed, they have not

published but are only in the form of unpublished Ph. D. Dissertations. Of course, these indexes comprise the words with

root form plus some other suffixes, as far as the verb is concerned. Further, it includes their occurrences only with poem and

line number. In case someone wants to refer something, he needs of original text for the reference. Through these available

published materials, it is too difficult for a scholar to locate such word in Sangam Tamil. But, if such an online work is

completed, then we need not want of such textual materials. By a single click of a word, one can get all the occurrences in

this literature with poem and line number of those occurrences. As we all know, accuracy is one of the unique features for

computer. Through this program, one can attain the total number of occurrences in these Sangam works. Of course, it is the

user friendly material, of course. One can easily find a word which is doubtful for the researcher working in Sangam Tamil.

Further, such word that occurs in one text may or may not available in the other texts. For such situation, it is very helpful

for the researcher or reader to find out such things in these areas.

Creation of data base

The data would be collected from the original texts of Sangam literature, using the Index works, which are available in

various institutions in the nation as well as the commentaries of the literature. Using the computer software such as POS

tagging, Sangam corpus, searching engines, the data would be collected from the Sangam texts and analyzed in the way of

descriptive methods. The works of the eminent Tamil Linguistics Scholars such as Prof. Nida, Prof. V.I. Subramoniam, Prof.

S. Agesthialingom, Prof. M. Israel, Prof. M. Elayapermal, Prof. A. Kamatchi, Prof. Rm. Sundaram Prof. S. N. Kandaswamy,

Ms. Eva Wilden and so on, would mostly be utilized for this study.

The proposed online corpus model would be as follow:

Page 3: Developing Online Sangam Corpus and Concordance … conference papers/ti2013/Pages from INFIT... · 47 Developing Online Sangam Corpus and Concordance Dr. A. Kamatchi, CAS in Linguistics,

49

Selected Keyword : அணிஅணிஅணிஅணி ---- beauty

No of Occurances : 11 Online Sangam dictionary

அல சிைன மாஅ அணி மயி இ உஅல சிைன மாஅ அணி மயி இ உஅல சிைன மாஅ அணி மயி இ உஅல சிைன மாஅ அணி மயி இ உ 2

அல சிைன மாஅ தணிமயி ஐ ஐ ஐ ஐ 8 :4

அணி ைற ஊர மா ஏ அணி ைற ஊர மா ஏ அணி ைற ஊர மா ஏ அணி ைற ஊர மா ஏ

மணி ைற ர மா ேப ஐ ஐ ஐ ஐ 14 :3

சில அணி ெகா சில அணி ெகா சில அணி ெகா சில அணி ெகா //// அ அ அ அ2 வல ாி மரா வல ாி மரா வல ாி மரா வல ாி மரா

சில அணி ெகா ட வல ாி மராஅ 22 :3

//// உ உ உ உ2 எ இ எ இ எ இ எ இ2 அ அ அ அ2 ன அணி ஊர ன அணி ஊர ன அணி ஊர ன அணி ஊர

ெற திய னலணி ர ஐ ஐ ஐ ஐ 23 :2

பச அணி அ அபச அணி அ அபச அணி அ அபச அணி அ அ5 ஆ மகி ந எ க ஏ ஆ மகி ந எ க ஏ ஆ மகி ந எ க ஏ ஆ மகி ந எ க ஏ

பச பணி தனவா மகி நெவ க ேண ஐ ஐ ஐ ஐ 45 :4

யா அணி அயா அணி அயா அணி அயா அணி அ 1 நி ஊ ஏ நி ஊ ஏ நி ஊ ஏ நி ஊ ஏ

யாறணி த நி ேர ஐ ஐ ஐ ஐ 45 :3

ல அணி அல அணி அல அணி அல அணி அ 1 அவ மண அ அவ மண அ அவ மண அ அவ மண அ2 ேதா ஏ ேதா ஏ ேதா ஏ ேதா ஏ

ல அணி த அவ மண த ேதாேள. 50 :5

ைற அணி அைற அணி அைற அணி அைற அணி அ 1 அவ அவ அவ அவ ஊ ஏ இைற இற ஊ ஏ இைற இற ஊ ஏ இைற இற ஊ ஏ இைற இற உஉஉஉ

ைறஅணி த அவ ஊேர இைறஇற 50 :3

ந அணி நய உந அணி நய உந அணி நய உந அணி நய உ2 நீ ற த இ நீ ற த இ நீ ற த இ நீ ற த இ 1

ந லணி நய நீ ற த ஐ ஐ ஐ ஐ 55 :3

மல அணி வாயி ெபா ைக ஊர நீமல அணி வாயி ெபா ைக ஊர நீமல அணி வாயி ெபா ைக ஊர நீமல அணி வாயி ெபா ைக ஊர நீ

மலரணி வாயி ெபா ைக ரநீ ஐ ஐ ஐ ஐ 81 :3

Page 4: Developing Online Sangam Corpus and Concordance … conference papers/ti2013/Pages from INFIT... · 47 Developing Online Sangam Corpus and Concordance Dr. A. Kamatchi, CAS in Linguistics,

50

Of course, the database also enables for programming of online dictionary of Sangam Tamil. There are a number of online

dictionaries – online dictionary kids, online dictionary for students, medical online dictionary, legal online dictionary, etc. –

which are developed neatly in many languages in the world. Even in Modern Tamil, many websites for online dictionary are

available but there is no Sangam Classical Tamil-English dictionary although a number of indexes are available with lack of

head entries. The scholars in Tamil have prepared indexes for individual literature but they have not consolidated in one

platform. In fact, such a dictionary is necessitated for translation from one language to another.

Organization of dictionaries

As we all know, accuracy is one of the unique features for computer. The text of dictionary is organized under head words,

which would be listed in alphabetical order. It is estimated that there may be more than 10, 000 words available in Sangam

Tamil. All the words could be accommodated in the dictionary. Compound words will also be given separate entries in the

alphabetical order. Certain words may have the same spelling but different meanings and different etymologies and such

words are, as we all know, called as homonyms; they would be treated as separate head words, even when they have the same

parts of speech. Homonym would be added with the words to distinguish identical headwords. There is a list of words which

is alphabetically arranged and provided in the data base, from which the data would be retrieved for the online dictionary. A

few words from the Sangam texts are given as sample here.

அணி அ

அ ப

அர

அ ல

அல

அவினி அற

ஆத

இத

இரவல

இவ

உழவ

உைள

ஊர

எ ைத

கஞ

கயலா

References

Page 5: Developing Online Sangam Corpus and Concordance … conference papers/ti2013/Pages from INFIT... · 47 Developing Online Sangam Corpus and Concordance Dr. A. Kamatchi, CAS in Linguistics,

51

� Agesthialingom, S. 1979. A Grammar of Old Tamil (With special Reference to patiRRuppattu) Annamalai

University, Annamalainagar.

� Andiappa Pillai, D. 1970. Descriptive Grammar of kalittokai Ph.D. Dissertation, University of Kerala,

Trivandrum.

� Andronow, M. 1959. ‘On the future tense base on Tamil’, Tamil Culture, Vol.3, Madras.

� Andronow, M. 1978. Tense and Mood in Dravedian: A Comparative Study, in (tamiliyal) Jornal of Tamil

Studies.

� Elayapermal, M. 1958. ‘The maar suffix in Early Tamil Literature’ , Indian Linguistics, Ralfh Turner Jublee,

Vol.1.

� Elayapermal, M. 1975. Grammar of AiŋkuRunuuRu with Index, University of Kerala, Trivandrum.

� Eva Wilden, 2008. Word Index of NaRRiNai (vol.3) Tamil mann patippakam, Chennai.

� Israel, M. 1964. ‘The finite verbs of ‘ceyyum’ pattern in Tamil’ Indian Linguistics, Vol.25, Poona.

� Kandaswamy, S.N. 1962. paripaaTal- A Linguistic Study , M.Litt. Dissertation, Annamalai University,

Annamalainagar.

� Krishnambal, S.R. 1974. Grammar of kuRuntokai with Index , University of Kerala, Trivandrum.

� Subrahmanyam, P.S. 1971. Dravidian Verb Morphology (A comparative study), Annamalai University,

Annamalainagar.

� Subramaniyan, S.V. 1972. Grammar of AkanaanuuRu with Index, University of Kerala, Trivandrum.

� Subramoniyam, V.I. 1962. Index of puRanaanuuRu, University of Kerala, Trivandrum.