Upload
dodiep
View
218
Download
0
Embed Size (px)
Citation preview
47
Developing Online Sangam Corpus and Concordance
Dr. A. Kamatchi, CAS in Linguistics, Annamalai University
Introduction
Corpus linguistics, a new method of language study, has emerged in recent years and it has generated a number of research
methods, attempting to trace a path from data to theory. Further it is explained, as we all know, that corpus is a large
collection of written/spoken materials in machine readable form. It provides, as much as possible, an authentic data for
linguistic studies as well as for other related studies in the languages. Most lexical corpora, today, are part-of-speech-tagged
(POS-tagged). According to Wikipedia, “a landmark in modern corpus linguistics was the publication by Henry
Kucera and W. Nelson Francis of Computational Analysis of Present-Day American English in 1967, a work based on the
analysis of the Brown Corpus, a carefully compiled selection of current American English, totalling about a million words
drawn from a wide variety of sources. Around forty thousand lines are available in these Sangam texts, which comprise
eTTuttokai – naRRinai, kuRuntokai, aiŋkuRunuuRu, patiRRuppattu, paripaaTal, kalittokai, akanaanuuRu and puRa
naanuuRu and pattup paaTTu literature – tirumurukaaRRuppaTai, porunaraaRRuppaTai, ciRupaaNaaRRuppaTai,
perumpaanaaRRuppaTai, mullaippaaTTu, maturaikkaañci, neTunal vaaTai, kuRiñcippaaTTu, paTTinappaalai and
malaipaTukaTaam – which are earliest literary texts and dated to 3rd century B.C. to 2nd century A.D.
Developing the online corpus is the need of the hour
As far as Tamil language is concerned, the first corpus for modern written Tamil was started to be built in the Central
Institute of Indian Languages (CIIL), Mysore in 1987. But its usage by the people is very less in number. The reason may be
that it is only in the CD form but not posted in the internet. The other one which is now available in the internet is the Cre-A:
Online Tamil Language Repository posted by Cre-A. These corpuses are, of course, concerned only with the modern Tamil,
but not with the other period of languages.
Besides these corpora of living languages, computerized corpora have also been made of collections of texts in ancient
languages. According to Wikipedia, “An example is the Andersen-Forbes database of the Hebrew Bible, developed since the
1970s. The Quranic Arabic Corpus is an annotated corpus for the Classical Arabic language of the Quran. On this line, the
present study attempts to prepare a Corpus and Concordance to Sangam Tamil, through which one can search lexical items,
rather than words, and their concordance available in these texts.
Useful for inter and intra language studies
The completion of this work definitely leads to develop software for Sangam Tamil and also for all other old texts.
Ultimately, it is highly expected that once this work would be completed, it will, undoubtedly, help us to compare intra and
inter languages and language families grouped in the world. Moreover it could be useful for the scholars/researchers working
in the field of comparative language as well as historical linguistics studies. Further, it may be helpful for the quantitative
analysis, too.
A maximum number of words in Sangam Tamil, as we all know, occupy the head entries in Dravidian Etymological
Dictionary (DED), which was prepared five decades ago and was widely used by the scholars worldwide. So, posting this
material in the website is necessary for the use of the scholars working in the areas of comparative linguistics, in general, and
comparative Dravidian, in particular. In the same way, it is also very useful for the historical linguistics scholar in the world.
Moreover, it may also use to glottochronological linguistics study all over the world. It is sure that this online Sangam
Corpus and Concordance would represent the classic language in Tamil. After Tolkaappiyam, in which a few words have
48
been simply explained on the part uriccol, this would be the potential work with using the modern theories in linguistics and
scientific methods in the process of preparing the collection of lexical items. This work may be very useful for utilizing for
the school curriculum because most of the students and teachers cannot understand the old Tamil words in proper way. In
this way, this study tries its level best to make classical literature easier for the teachers and students community in
understanding the classical works in the school curriculum of Tamil living countries.
Morphological Parser
Approximately, ten thousand words are, perhaps, attested in these texts. Without parsing the poetic lines of these texts in
these literatures, we are not able to make the machine to understand the materials. Therefore, every line should be parsed
morphologically then only, the original root forms could be retrieved from the database. This parser would consist of not
only nouns and verbs but also the possible grammatical items found in the texts. These words would be collected
systematically and incorporated in the corpus. For developing this online Sangam Corpus and Concordance, the data would
be collected from old Tamil Sangam texts. In fact, it is the pioneer attempt to develop the Corpus and Concordance to
Sangam Tamil, which consists of not only lexical items but also the grammatical elements attested in these texts. When we
click one word already given in the Drop-Down box in the window, we can, accurately, get a number of occurrences of that
word, the line on which it occurs, its meaning in that particular line, literature name and poem number along with the line
number. If it is successfully completed at the earliest, this work will be the model for other period of Tamil works.
Available materials
There are indexes for some of the Sangam works, but not for all, of course. Among these Sangam texts, only for some of the
anthologies, the indexes have been made by the scholars and published. Though the others have been indexed, they have not
published but are only in the form of unpublished Ph. D. Dissertations. Of course, these indexes comprise the words with
root form plus some other suffixes, as far as the verb is concerned. Further, it includes their occurrences only with poem and
line number. In case someone wants to refer something, he needs of original text for the reference. Through these available
published materials, it is too difficult for a scholar to locate such word in Sangam Tamil. But, if such an online work is
completed, then we need not want of such textual materials. By a single click of a word, one can get all the occurrences in
this literature with poem and line number of those occurrences. As we all know, accuracy is one of the unique features for
computer. Through this program, one can attain the total number of occurrences in these Sangam works. Of course, it is the
user friendly material, of course. One can easily find a word which is doubtful for the researcher working in Sangam Tamil.
Further, such word that occurs in one text may or may not available in the other texts. For such situation, it is very helpful
for the researcher or reader to find out such things in these areas.
Creation of data base
The data would be collected from the original texts of Sangam literature, using the Index works, which are available in
various institutions in the nation as well as the commentaries of the literature. Using the computer software such as POS
tagging, Sangam corpus, searching engines, the data would be collected from the Sangam texts and analyzed in the way of
descriptive methods. The works of the eminent Tamil Linguistics Scholars such as Prof. Nida, Prof. V.I. Subramoniam, Prof.
S. Agesthialingom, Prof. M. Israel, Prof. M. Elayapermal, Prof. A. Kamatchi, Prof. Rm. Sundaram Prof. S. N. Kandaswamy,
Ms. Eva Wilden and so on, would mostly be utilized for this study.
The proposed online corpus model would be as follow:
49
Selected Keyword : அணிஅணிஅணிஅணி ---- beauty
No of Occurances : 11 Online Sangam dictionary
அல சிைன மாஅ அணி மயி இ உஅல சிைன மாஅ அணி மயி இ உஅல சிைன மாஅ அணி மயி இ உஅல சிைன மாஅ அணி மயி இ உ 2
அல சிைன மாஅ தணிமயி ஐ ஐ ஐ ஐ 8 :4
அணி ைற ஊர மா ஏ அணி ைற ஊர மா ஏ அணி ைற ஊர மா ஏ அணி ைற ஊர மா ஏ
மணி ைற ர மா ேப ஐ ஐ ஐ ஐ 14 :3
சில அணி ெகா சில அணி ெகா சில அணி ெகா சில அணி ெகா //// அ அ அ அ2 வல ாி மரா வல ாி மரா வல ாி மரா வல ாி மரா
சில அணி ெகா ட வல ாி மராஅ 22 :3
//// உ உ உ உ2 எ இ எ இ எ இ எ இ2 அ அ அ அ2 ன அணி ஊர ன அணி ஊர ன அணி ஊர ன அணி ஊர
ெற திய னலணி ர ஐ ஐ ஐ ஐ 23 :2
பச அணி அ அபச அணி அ அபச அணி அ அபச அணி அ அ5 ஆ மகி ந எ க ஏ ஆ மகி ந எ க ஏ ஆ மகி ந எ க ஏ ஆ மகி ந எ க ஏ
பச பணி தனவா மகி நெவ க ேண ஐ ஐ ஐ ஐ 45 :4
யா அணி அயா அணி அயா அணி அயா அணி அ 1 நி ஊ ஏ நி ஊ ஏ நி ஊ ஏ நி ஊ ஏ
யாறணி த நி ேர ஐ ஐ ஐ ஐ 45 :3
ல அணி அல அணி அல அணி அல அணி அ 1 அவ மண அ அவ மண அ அவ மண அ அவ மண அ2 ேதா ஏ ேதா ஏ ேதா ஏ ேதா ஏ
ல அணி த அவ மண த ேதாேள. 50 :5
ைற அணி அைற அணி அைற அணி அைற அணி அ 1 அவ அவ அவ அவ ஊ ஏ இைற இற ஊ ஏ இைற இற ஊ ஏ இைற இற ஊ ஏ இைற இற உஉஉஉ
ைறஅணி த அவ ஊேர இைறஇற 50 :3
ந அணி நய உந அணி நய உந அணி நய உந அணி நய உ2 நீ ற த இ நீ ற த இ நீ ற த இ நீ ற த இ 1
ந லணி நய நீ ற த ஐ ஐ ஐ ஐ 55 :3
மல அணி வாயி ெபா ைக ஊர நீமல அணி வாயி ெபா ைக ஊர நீமல அணி வாயி ெபா ைக ஊர நீமல அணி வாயி ெபா ைக ஊர நீ
மலரணி வாயி ெபா ைக ரநீ ஐ ஐ ஐ ஐ 81 :3
50
Of course, the database also enables for programming of online dictionary of Sangam Tamil. There are a number of online
dictionaries – online dictionary kids, online dictionary for students, medical online dictionary, legal online dictionary, etc. –
which are developed neatly in many languages in the world. Even in Modern Tamil, many websites for online dictionary are
available but there is no Sangam Classical Tamil-English dictionary although a number of indexes are available with lack of
head entries. The scholars in Tamil have prepared indexes for individual literature but they have not consolidated in one
platform. In fact, such a dictionary is necessitated for translation from one language to another.
Organization of dictionaries
As we all know, accuracy is one of the unique features for computer. The text of dictionary is organized under head words,
which would be listed in alphabetical order. It is estimated that there may be more than 10, 000 words available in Sangam
Tamil. All the words could be accommodated in the dictionary. Compound words will also be given separate entries in the
alphabetical order. Certain words may have the same spelling but different meanings and different etymologies and such
words are, as we all know, called as homonyms; they would be treated as separate head words, even when they have the same
parts of speech. Homonym would be added with the words to distinguish identical headwords. There is a list of words which
is alphabetically arranged and provided in the data base, from which the data would be retrieved for the online dictionary. A
few words from the Sangam texts are given as sample here.
அணி அ
அ ப
அர
அ ல
அல
அவினி அற
ஆ
ஆத
ஆ
இத
இரவல
இ
இவ
உழவ
உைள
ஊ
ஊர
ஊ
எ ைத
எ
ஓ
கஞ
கயலா
References
51
� Agesthialingom, S. 1979. A Grammar of Old Tamil (With special Reference to patiRRuppattu) Annamalai
University, Annamalainagar.
� Andiappa Pillai, D. 1970. Descriptive Grammar of kalittokai Ph.D. Dissertation, University of Kerala,
Trivandrum.
� Andronow, M. 1959. ‘On the future tense base on Tamil’, Tamil Culture, Vol.3, Madras.
� Andronow, M. 1978. Tense and Mood in Dravedian: A Comparative Study, in (tamiliyal) Jornal of Tamil
Studies.
� Elayapermal, M. 1958. ‘The maar suffix in Early Tamil Literature’ , Indian Linguistics, Ralfh Turner Jublee,
Vol.1.
� Elayapermal, M. 1975. Grammar of AiŋkuRunuuRu with Index, University of Kerala, Trivandrum.
� Eva Wilden, 2008. Word Index of NaRRiNai (vol.3) Tamil mann patippakam, Chennai.
� Israel, M. 1964. ‘The finite verbs of ‘ceyyum’ pattern in Tamil’ Indian Linguistics, Vol.25, Poona.
� Kandaswamy, S.N. 1962. paripaaTal- A Linguistic Study , M.Litt. Dissertation, Annamalai University,
Annamalainagar.
� Krishnambal, S.R. 1974. Grammar of kuRuntokai with Index , University of Kerala, Trivandrum.
� Subrahmanyam, P.S. 1971. Dravidian Verb Morphology (A comparative study), Annamalai University,
Annamalainagar.
� Subramaniyan, S.V. 1972. Grammar of AkanaanuuRu with Index, University of Kerala, Trivandrum.
� Subramoniyam, V.I. 1962. Index of puRanaanuuRu, University of Kerala, Trivandrum.