19
The Process of Designing a Multidisciplinary Monolingual Sample Corpus N. S. DASH AND B. B. CHAUDHURI Indian Statistical Institute This paper discusses the approach of developing a sample of printed corpus in Bangla, one of the national languages of India and the only national language of Bangladesh. It is designed from the data collected from various published documents. The paper highlights different issues related to corpus generation, data-file preparation, language analysis, and processing as well as application potentials to different areas of pure and applied linguistics. It also includes statistical studies on the corpus along with some interpretation of the results. The difficulties that one may face during corpus generation are also pointed out. KEYWORDS: corpus, data-file, dictionary, word forms, concordance, NLP, machine translation, graphic symbol, diacritic, etc. 1. Introduction The introduction of corpus linguistics dates back to the 1960s, though the corpus-based analysis of language is much older as noted in the history of cryptography and printing technology. The relevance of the corpus in lin- guistics is to provide an empirical basis for language description. One can study the language-specific word order rules, vocabulary and word struc- ture, syntactic compositions, semanticity, mechanisms to refer objects, sense of time, users’ attitude, prosodic conventions, variations of style, contextual INTERNATIONAL JOURNAL OF CORPUS LINGUISTICS Vol. 5(2), 2000. 179–197 John Benjamins Publishing Co.

Designing Monolingual Sample Corpus

Embed Size (px)

DESCRIPTION

Monolingual sample corpus

Citation preview

  • The Process of Designinga Multidisciplinary Monolingual

    Sample Corpus

    N. S. DASH AND B. B. CHAUDHURIIndian Statistical Institute

    This paper discusses the approach of developing a sample of printed corpus inBangla, one of the national languages of India and the only national languageof Bangladesh. It is designed from the data collected from various publisheddocuments. The paper highlights different issues related to corpus generation,data-file preparation, language analysis, and processing as well as applicationpotentials to different areas of pure and applied linguistics. It also includesstatistical studies on the corpus along with some interpretation of the results.The difficulties that one may face during corpus generation are also pointed out.

    KEYWORDS: corpus, data-file, dictionary, word forms, concordance, NLP, machinetranslation, graphic symbol, diacritic, etc.

    1. Introduction

    The introduction of corpus linguistics dates back to the 1960s, though thecorpus-based analysis of language is much older as noted in the history ofcryptography and printing technology. The relevance of the corpus in lin-guistics is to provide an empirical basis for language description. One canstudy the language-specific word order rules, vocabulary and word struc-ture, syntactic compositions, semanticity, mechanisms to refer objects, senseof time, users attitude, prosodic conventions, variations of style, contextual

    INTERNATIONAL JOURNAL OF CORPUS LINGUISTICS Vol. 5(2), 2000. 179197 John Benjamins Publishing Co.

  • 180 N. S. DASH AND B. B. CHAUDHURI

    and world knowledge, etc. (Winograd 1972) from a language-specific corpusof reasonable size. Moreover, the corpus is a primary requirement for multi-purpose linguistic studies and for developing applications-oriented computerbased tools for Natural Language Processing (NLP). The referential value ofa corpus is immense, and it is growing gradually with time. People in everybranch of information science now realize that a corpus, as a sample of livinglanguage, can open up new horizons of study and research.

    The term corpus has a Latin origin meaning body. It is a collectionof linguistic data, either written texts or a transcription of recorded speech,which can be used as a starting point of linguistic description or as a meansof verifying hypotheses about a language (Crystal 1980). In the present con-text, it means the body of language which can be analyzed by collecting alarge representative set of data. It is chosen to encompass the diversity of alanguage. For that purpose, a corpus should contain many millions of runningword forms. The most authentic and scientific study of any natural languageshould be based on a systematically developed corpus. The decision aboutwhat should belong to the corpus and how the selection is to be made virtu-ally controls every aspect of subsequent analysis. If designed methodically,it can reflect the language with all its features and qualities. Moreover, itcan ventilate into the core of users mind through proper representation oflinguistic abilities of an individual or a language community.

    This paper deals with the corpus development of Bangla, a languageused by 220 million people of Eastern India and Bangladesh. This is the firstmultipurpose corpus of its kind in this language. The paper is arranged asfollows. The background of the work is given in Section 2. The features ofan ideal corpus and related issues are described in sections 34. The corpusdevelopment and its application potentials are elaborated in sections 56.Concluding results are provided in Section 7.

    2. The background

    Taking cues from Europe and America, the Indian effort for corpus devel-opment was initiated in 1979 when the Department of Electronics (DOE),Government of India first organized a symposium where many linguists,computer scientists, and technocrats participated. Based on the recommenda-tions of the committee constituted in the symposium, it was decided to give

  • THE PROCESS OF DESIGNING A MULTIDISCIPLINARY CORPUS 181

    more thrust to language technology for Indian languages in the 8th Five-yearPlan. The committee recommended the initiation of a research and develop-ment program on information processing of Indian languages with nationalrelevance and participation of traditional knowledge in mind. Towards thisgoal, the DOE initiated a national level program during the years 199091on Technology Development for Indian Languages (TDIL) with the followingobjectives (Murthy and Despande 1998):(i) To develop information processing tools for facilitating man-machine

    interaction, information processing in Indian languages and to developmultilingual knowledge systems.

    (ii) To promote the use of information processing tools for language studiesand research.

    (iii) To support research and development efforts in the area of informa-tion processing in Indian languages covering NLP, Machine Translation,Human-Machine Interaction, and Language Learning.

    Since 1991, a number of developmental activities were initiated underthe TDIL program. As a first step, major thrust areas were identified whichinclude development of machine readable corpora of texts of major Indianlanguages, machine-aided translation among languages, human machine in-terface systems, computer-assisted language learning, and teaching as wellas theoretical issues of NLP. For corpora development, it was estimated thatmachine readable corpora of texts of nearly 10 million word forms in As-samese, Bangla, English, Gujarati, Hindi, Kannada, Kashmiri, Malayalam,Marathi, Oriya, Punjabi, Sanskrit, Sindhi, Tamil, Telugu, and Urdu wouldbe developed. The software for tagging, word count, frequency count, spellchecker, language processing, and machine translation from English to In-dian languages would be developed at the Indian Institute of Technology,Kanpur (Murthy and Despande 1998). One of the authors of this paper hadan opportunity to work in developing the Bangla corpus that includes wordclass determination, word form processing, and automatic tagging.

  • 182 N. S. DASH AND B. B. CHAUDHURI

    3. Features of the corpus

    The electronic corpus we have developed holds the state-of-the-art of lan-guage for research purposes. It keeps the perfect copy of the text for two rea-sons: (a) each particular investigation is to look at the language according toits different priorities; (b) the linguists can discover word forms, morphologi-cal division, primary word classes, etc. by looking at the raw and undistortedcorpus as information which a subject specific corpus cannot provide but canbe retrieved from a general corpus. A systematically developed electroniccorpus contains the following features:

    (i) The corpus truly represents the language text from which it is devel-oped. It properly manifests the language with all its peculiarities andspecialties.

    (ii) It is balanced for all disciplines of printed texts, because multifacetedvarieties of the language are stored in multiple disciplines. The terms orword forms used in literature may not be in medical or technical booksand vice versa. If accumulated in a proper way, the corpus reflects alllinguistic usage.

    (iii) The corpus sincerely restores the spelling variations, use of punctuationmarks, even the structure of the word forms as noted in the source text.If not done carefully, the true image of the language cannot be reflectedin the corpus.

    (iv) The corpus should be large enough in size so that it contains almostall the features of the language. Perhaps one million words makes thelowest limit, while ten million words or more is a good choice.

    (v) The corpus should be machine accessible in a simple way so that fu-ture algorithmic development is convenient and easy. It is completelycompatible to the computer as it is to be used for NLP. For computeraccess, we have used Indian Standard Code for Information Interchange(ISCII); parallel to ASCII, through which the language data is stored.By storing the data in ISCII, we are able to handle the data accordingto our choice.

  • THE PROCESS OF DESIGNING A MULTIDISCIPLINARY CORPUS 183

    4. Issues related to corpus generation

    The reason behind the collection of text corpus was to acquire a readilyaccessible language database, which would be used in various fields of infor-mation technology as well as in NLP. There are different issues related to cor-pus development such as choice of documents (books, newspapers, journals,magazines, etc.); manner of page sampling (random, regular, or selective);problems of text screening (omission of foreign word forms, quotations, di-alects, mathematical symbols, diagrams, poems, pictures and graphs, etc.);manner of data input (typing, scanning, etc.), corpus size; editing of inputdata; corpus file management (long or short), etc. These issues are discussedbelow:

    4.1 Time span

    Language changes with time; therefore, the determination of a particular timespan is required to capture the features of a language within this span. Ourcorpus attempts to cover a particular period of time, namely, the materialspublished between 1981 and 1995. It represents the status of present dayBangla language. It was arguably agreed that the corpus based on this timespan would provide sufficient information about the changes, which havetaken place in the earlier decades. However, for some materials, we weresomewhat liberal. A few books, which were first published before our fixedtime span but re-published within this time span are also considered.

    4.2 Size of the corpus

    The size of the corpus is another important issue in corpus-based languageanalysis. The corpus should be as large as possible because the larger thecorpus, the more genuine the observation. Moreover, it should have a scopefor regular augmentation. One of the reasons for having a large corpus isto study the relation between word forms and their frequency. To study thebehavior of word forms in texts, we need a large number of instances wherethese word forms have occurred. Also, for studying collocations, phrases,clauses, etc., it is necessary to study large amounts of language text. Recentstudies on English suggest that the detailed patterns of individual word formsare necessary evidence on which the generalizations of grammar depend

  • 184 N. S. DASH AND B. B. CHAUDHURI

    (Sinclair 1991). The minimum size of a language corpus depends on theapplication in mind. In Bangla, we have a printed corpus of around 3.5 millionword forms.

    4.3 Method of data collection

    At first we had to decide what would be the required size of a sample cor-pus and which fields are to be selected for data collection. For this purpose,catalogues and the list of publications of different publishers are consultedto pick up the texts for inclusion in the corpus. Initially, we followed themethod of data input manually in the computer with the help of the key-board because printed Bangla materials are not available in electronic forms.Presently, however, the technique of machine reading (converting text in amachine-readable form by Optical Character Recognizer [OCR]) is beingused for the augmentation of Bangla corpus.

    4.4 Writers

    The aim of the corpus is to identify what is central and typical in a naturallanguage. A method of proportional representation of documents is main-tained throughout. To be realistic in approach, we have considered books byordinary writers as well as established writers. It is broadly heterogeneous inall senses as it is gathered from a variety of sources so that the individualityof a source is obscured. This diversity is a safeguard against idiosyncrasy.

    4.5 Corpus management

    The management of a corpus is tedious work: it includes tasks such as hold-ing, processing, screening, and retrieving information. Once the corpus isdeveloped and stored in the computer, it needs regular maintenance and aug-mentation. There are always some errors, which require correction, and someimprovements are needed. Moreover, adaptation to new hardware and soft-ware technology and changes in the requirement of users are to be exercisedregularly. In addition, attention should be paid to the retrieval systems aswell as processing and analytic tools.

  • THE PROCESS OF DESIGNING A MULTIDISCIPLINARY CORPUS 185

    5. Generation of the corpus

    For the electronic generation and processing of the corpus, the texts of differ-ent subjects published in a specific time period were selected. We conducteda survey to determine the percentage of use of various printed materials byBangla language users. The printed documents are selected following thatpercentage of use: literature 20%, fine arts 5%, social sciences 15%, naturalsciences 15%, commerce 10%, mass media 30%, and translation from otherlanguages 5%. The survey clearly reflects that newspapers have maximumreadership followed by books on literature, social sciences, natural sciences,commerce, fine arts, and translations.

    5.1 Hardware environment

    For the development of the corpus, we used a PC with a GIST (Graphics andIntelligence-based Script Technology) card, a Script Processor, a monitor, oneconventional computer keyboard, a multi-lingual printer, and some floppydiskettes. The text files were developed with the help of the GIST cardinstalled in the PC. This GIST technology allows display of various Indianscripts on the computer monitor screen based on the information enteredthrough a keyboard having an overlay of the Indian scripts. Based on thesedevelopments, codes for various keys used in Indian scripts and their layouthave been standardized by the Bureau of Indian Standard. This developmenthas led to a number of software products that are currently available on themarket to enable diverse users to carry out word form processing under DOSand Windows environment, in addition to DTP, Spread Sheet, Spell Checker,etc.

    The GIST card is a hardware add-on card for updating IBM PC/XT/ATcompatible for interaction in all major Indian scripts including English. More-over, multilingual printing is also possible with this card. It uses ISCII asper the recommendations of the DOE and adopted by the Bureau of IndianStandards, Government of India (ISI Code No: IS 13194: 1991).

    The Script Processor (SP) software supplied with the GIST card providesword processing in all Indian scripts in a uniform manner. The SP providesa simple user interface and facilitates the combination of all Indian scriptsand English in the same document. The card also provides a choice of twooperational display modes on the monitor: one is the conventional English

  • 186 N. S. DASH AND B. B. CHAUDHURI

    mode, and the other is Indian multilingual mode. Of the 8-bit code, the upper128 positions are used for defining Indian script characters whereas the lower128 positions are used for English.

    The GIST card has made it possible to store language data of Indianscripts in computers. Its earlier version had some deficiencies leading todifficulties in data input as well as text processing. A modified version of thecard, developed very recently, has solved some of these problems.

    5.2 Category determination

    As stated above, the books and other materials published in the timespan of1981 to 1995 were selected and classified. The text materials include almostall branches of human knowledge. According to the nature of the texts, thematerials were classified into seven major categories:

    (i) Literature(ii) Fine arts

    (iii) Social science(iv) Natural science(v) Commerce

    (vi) Mass media(vii) TranslationEach of the seven major categories has some sub-categories. Litera-

    ture includes novels, short stories, essays, etc.; fine arts relates to paintings,drawings, music, sculpture, etc.; social science includes philosophy, history,education, etc.; natural science includes physics, chemistry, mathematics,geography, etc.; mass media includes newspapers, magazines, posters, no-tices, advertisements, etc.; commerce includes accountancy, banking, etc.,and translation includes all the subjects translated into Bangla from otherIndian and foreign languages. The total number of subjects or text categoriestaken for input is 87.

    Once the text is selected, the computer-input format should be decided.For processing, the text is kept in a very simple format: a single long stringof letters with spaces and punctuation marks. Page and line numbers are keptonly for reference purposes, and other information such as layout, settingand typeface, etc. are discarded. The method of data input was random sam-pling, two pages after every ten pages. This method is supported by severalreasons. If a book contains many chapters, each chapter containing different

  • THE PROCESS OF DESIGNING A MULTIDISCIPLINARY CORPUS 187

    subjects written by different writers, all of the chapters might be accessed forsamples. The copyright related issue could be more easily tackled by suchrandom sampling since this corpus cannot be commercially used. It is alsoadvantageous to keep detailed records of the materials so that the documentscan be identified on grounds other than academic. The information whetherthe text is fiction or non-fiction, book, journal, or newspaper, formal or infor-mal, and the age, sex, and origin of the author(s) is carefully documented forboth linguistic and non-linguistic studies. Generally, the most recent editionof the book is recorded.

    For a data input session, the SP is opened and the name of the text file istyped in. In the first line, the physical information of the text is stored. Herewe retain the name of the book, name(s) of the author(s), year of publication,number of edition, name of the publisher, and the number of the pages takenfor input. These are required for maintaining records and solving copyrightproblems. The input of the text begins on the next line following ISCII codeusing the GIST card and keyboard. At the time of input, the physical lineof the text is maintained on the screen line. Usually, the books on naturaland social sciences rather than stories or novels contain foreign word forms(English, French, Hindi, etc.), phrases, and even sentences. Those foreignloanword forms which are already integrated as Bangla words are enteredinto the computer; otherwise, they are omitted. Dialectal variations, espe-cially those having less intelligibility or having the influence of neighboringstate languages, are discarded. Punctuation marks and transliterated forms arefaithfully reproduced in the machine.

    After input, the entire data is then subject to editing. Generally, five typesof errors can occur at the time of data entry: deletion, insertion, repetition,substitution, and transposition of characters or graphemes. The errors causedfrom the data entry operator are manually corrected at the time of editing.Much care is taken so that the text file resembles the physical text. It is alsochecked if any word form is changed, repeated, or omitted, if punctuationmarks are properly used, if lines are properly maintained, and if separateparagraphs are made for the text. All spelling variations of the surface wordforms in the physical text are faithfully restored. Quotations from other lan-guages, poems, songs, dialects, mathematical expressions, chemical formulae,geometric diagrams, tables, pictures, and other symbolic representations areomitted from insertion. Following this methodology, a large number of text

  • 188 N. S. DASH AND B. B. CHAUDHURI

    files are created where each file contains nearly ten thousand surface wordforms.

    6. Applications potentials of the corpus

    The potentiality of a well-developed corpus is immense as it provides anempirical basis for language description (Teubert 1996). With its vast sizeand scope for gradual augmentation, it provides us with much valuable andnovel information about context, world knowledge, pragmatics, anaphora,etc., and also phonology, morphology, semantics, and syntax of a language.The corpus can help us to perform various statistical analyses, find out themost frequently used word forms, and find multiple spelling variations of asingle surface word form, utterance, etc. Through concordance of the corpus,we can determine multiple semantic and pragmatic levels of word formsand can list up homophonus word forms for morpho-phonemic and semanticstudies.

    The information accumulated from this sample corpus is of great impor-tance for NLP and other linguistic studies. It is now being used for developinga monolingual dictionary, thesaurus, and grammar in Bangla both in printedand electronic form. Moreover, different statistical studies are compiled fromthe corpus to design primers, to build an OCR, to develop a spell-checker,and many other related works. Both tagged and untagged corpora are used forHuman-Aided Machine Translation (HAMT). The books and tools developedfrom the corpus are of great help to the native language users, researchers,writers, academicians, teachers, students, scholars, publishers, and both pri-mary and secondary language learners.

    Text processing is considered one of the basic techniques of languagedata processing and assesses their value in linguistic research. It is the methodfor determining new approaches, finding new evidence, and stating new de-scriptions. With the help of this technique, we can gather examples to furnishexplanations that fit the evidence, rather than adjusting the evidence to fit ourpre-supposed explanation. It has already been found that computer process-ing on corpora produces some results that directly contradict our intuitions(Merlo 1996). Moreover, collocations have helped us in understanding therole and positions of the word forms in a text. Recent work on lexicography(Sinclair 1991) shows that, for many common word forms, the most frequent

  • THE PROCESS OF DESIGNING A MULTIDISCIPLINARY CORPUS 189

    meaning is not the first one that comes to mind nor the first listed in mostdictionaries. Our traditional linguistic descriptions and hypotheses are chal-lenged by new evidence accumulated from the corpus. Such evidence hasnot been available before, and their computer assimilation would definitelycontribute to the maturity of linguistics as a discipline of human knowledge.

    The conventional notion of word is a long-standing problem in linguisticsbecause it is ambiguous in nature. It either refers to a string of graphemes asit appears in speech or writing, or it refers to a more abstract entity, a partof the structure of the language as represented in a dictionary. Words, onceformed, persist and change; they take on idiosyncrasies with the result thatthey are soon no longer generable by a simple algorithm of any generality(Aronoff 1981). The meaning of the word forms is not always determinedcompositionally. In some cases, it is the word form as a whole, which bearsthe meaning, and the relationship between the meaning of the parts and themeaning of the whole can be obscure. So, there are considerable difficultiespinning down any universally applicable notion of word; it appears that evenwhen we restrict ourselves to morphological criteria within a single language,we find that the term itself covers a multitude of sins, which need to becarefully distinguished (Spencer 1991). The words are generally referentiallyopaque; that is, it is impossible to see inside them and refer to their parts. Inthe rules of syntax, the word forms are the smallest unit to compose phrasesand sentences. Here a word form is a minimal free form, the smallest unitthat can exist on its own.

    For automatic processing, it is necessary to clarify the concept of word.Should the computer consider, for example, the surface word forms like kari,karchi, karechi, karlaam, karchilaam, kartaam, karechilaam, karba, etc. asinflected forms of the single LEMMA kar to do or as unique surface wordforms where each one has a separate entity? Moreover, using the simplenotion of a word form, we can now represent a text as a succession ofsurface word forms. The word forms can be counted so that the length of thetext, measured in word forms, can be calculated. Next, the word forms canbe compared with each other to find many repetitions of the same word form.Another count can be made of the number of different word forms, whichwe call vocabulary of the text. The most important feature of this vocabularyis its uniqueness where no word form would be repeated. In Bangla, it isestimated that a corpus of 10 million running word forms would produce a

  • 190 N. S. DASH AND B. B. CHAUDHURI

    vocabulary of around 2 lakh word forms. In the sections below, some of theuses of the Bangla corpus are stated in brief.

    6.1 Alphabetical and frequency sorting

    An algorithm is developed for easy alphabetical sorting and counting ofthe surface word forms in the Bangla corpus. Both of these lists are againplaced in ascending and descending order. The frequency list of word formsprovides clues to the nature of a text. By examining the list, we can get anidea about the type of the text. Moreover, the list helps us to know howthe word forms are distributed in the text. The frequency list is used mainlyfor reference; however, they are often helpful in formulating hypotheses andverifying previously made assumptions (Kjellmer 1984).

    The numerical frequency lists formed from the corpus provide an insightinto the language. The listing of a particular word form is compared to a largecollection of language text for statistical purposes. The most frequent itemstend to maintain suitable distribution and hence, marked changes in theirorder can be significant at the time of linguistic analysis.

    6.2 Concordance and key word in context (KWIC)

    A concordance is an index of the surface word forms in a text. It is a collec-tion of the occurrences of a word form, each in its own textual environment.It is indispensable in corpus linguistics since it gives access to many impor-tant language patterns in the texts. Similarly, KWIC helps us to understandthe contextual importance of a word form, to determine its actual behaviorin the context with any kind of contextual restriction it exercises. In Bangla,a concordance list is generated on some books of Tagore (Mallik and Nara1996). We applied concordance on our corpus to study semantic changesof the surface word forms. It is noted that the meaning and connotation ofthe word forms change drastically depending on the context of their use.Both diachronic and synchronic levels of semantic change are based on bothlinguistic and non-linguistic factors, such as social motivation, political in-tention, scientific needs, etc., which control aspects of syntactic analysis andmachine translation. Moreover, after accessing the corpus by concordance,we realized that a considerable amount of research on stylistics and languageinstruction could be launched form here.

  • THE PROCESS OF DESIGNING A MULTIDISCIPLINARY CORPUS 191

    6.3 Local word grouping

    Local Word Grouping (LWG) is another type of text analysis which throwssome light on the patterns of uses of the surface word forms in the Banglatext. Bangla, like most Indian languages, has relatively free word order. Stillthere are some syntactic units, which occur, in fixed order. For instance,the finite verbs are usually followed by auxiliary verbs, or the nouns arefollowed by suffixes or post-positions, etc. Both the noun and verb groupscan be formed by using only local information, which helps in the processingof word forms, phrases, and sentences according to the standard Banglagrammar. Moreover, it helps in dissolving lexical ambiguity of the surfaceword forms as well as in determining the fine shades of meaning. To a largeextent, these finer aspects are conveyed by the internal relation between theconstituents and the distribution of these constituents in the context. There aremany compound word forms and group verbs in Bangla where the meaningor idea, denoted by the particular arrangement of word forms, cannot becomposed from the meaning of the individual word form. The meaning ofeach individual word form, when combined, may not denote the same shadesof meaning. For understanding and translation into other language, they mustbe grouped together, and the related items should not be dislocated. Thisgrouping supplies information for dealing with functional behavior of theconstituents at the time of parsing both at the phrase and sentence level.

    6.4 Orthography and script analysis

    The corpus under study gives us a clear view about the shape, size, occur-rence, and functions of the graphic symbols used in the Bangla language. TheBangla script has six types of graphic symbols: vowels, consonants, voweldiacritics, consonant modifiers, compounds, and consonant clusters (Chaud-huri and Dash 1998). Form the corpus, we are able to identify each graphicsymbol, diacritic, compound, or cluster with all its attributes to develop anOCR for the Bangla script.

    6.5 Statistical studies

    The idea of placing language within the area of statistics and quantificationdoes not really convey the traditional concept of linguistics. But the introduc-

  • 192 N. S. DASH AND B. B. CHAUDHURI

    tion of the computer has raised a demand for quantitative linguistics, whichwould be used for developing various tools. Moreover, linguists without ad-equate knowledge of statistical information can make mistakes in handlinglinguistic data and in observation. The corpus-based quantitative analysisof characters is useful for OCR, cryptography, speech analysis and recog-nition, computer and typewriter keyboard design, spelling error correction,electronic dictionary and machine aids to the visually handicapped, designingtelegraphic codes, information-theoretic analysis of language, and for printingtechnology. If the study is based on a methodically designed corpus, it canproduce many authentic results which might not be observed earlier. More-over, this kind of analysis has thrown some light on the linguistic behaviorof Bangla language users.

    There are many studies on the quantitative analysis of European lan-guages. Miller (1951) made a statistical survey on the numerical results tostudy literary style and statistical properties of language and to conduct aninformation-theoretic analysis of languages. Herden (1962) made multiplequantitative investigations on some English texts, while Edwards and Cham-bers (1964) did frequency analysis on the natural language with interestingresults. In Bangla, Chatterji (1926, 1993) initiated a frequency study on word-level in a dictionary, while Bhattacharya (1965) made quantitative studies forword forms, phonemes, sentences, and syllables on a sample collection ofBangla writing. On global character occurrence on Bangla, Assamese, andManipuri, Das and Mitra (1984) studied some selected texts. Similar quan-titative analysis on some books of Nobel laureate Tagore was conducted ina collaborative project between ISI, Calcutta and ILCAA, Japan (Mallik andNara 1994, 1996). We present here some frequency statistics executed on theBangla corpus. The studies are done on three levels: character, word, andsentence level.

    In global occurrence of characters, the occurrence of consonants(52.76%) is much higher than vowels (39.63%) and clusters (07.61%) inthe corpus which differs from findings of Bhattacharya (1965) who foundthe occurrence of vowels at 46% and consonants at 66% in a sample Banglatext. In a similar analysis, Dewey (1950) noted that the occurrence of vowelsand diphthongs is 38% and that of consonants is 62% in a sample Englishtext. The relative frequency of the punctuation marks in an English text isstudied by Miller (1951). In Bangla, the use of comma, like that of English,is higher (22.32%) in occurrence followed by full stop (17.26%), semicolon

  • THE PROCESS OF DESIGNING A MULTIDISCIPLINARY CORPUS 193

    (15.27%), hyphen (8.89%), note of interrogation (7.38%), and colon (6.16%),respectively.

    Elderton (1949), Herden (1962), and many others undertook the studyof word length distribution for English. Dewey (1950) counted word lengthof Modern English prose texts, while Hoffman (1955) and Gibson (1962)studied the word length of Shakespeares work in syllable. Flesch (1948)made an interesting study on word length of English with stylistics analy-sis. The Bangla corpus shows that word forms with 4 characters do occurmaximally (19.34%) followed by word forms with 3 characters (16.77%)and 5 characters (16.29%) are maximum 19.34%; 3 characters are 16.77%,and 5 characters are 16.29%. Most of the word forms are confined within12 characters, and those having more than 12 characters are either inflected,compound, or reduplicated word forms. Yule (1964) showed that the mostfrequently used English word form is the. In Bangla, the most frequentlyused word form is naa no which provides a view about human preferencepattern in respect to the specific type of word forms used in conversation orany other kind of communication.

    In English, Yule (1964) estimated the average sentence length in simpledialogues. Flesch (1948) shows that sentences of scientific English consistof approximately 30 word forms while sentences of literary English haveapproximately 20 word forms. In Bangla, the sentences containing 7 wordforms occur most often (7.55%) in the text followed by sentences with 6word forms (7.42%) and 8 word forms (7.29%), respectively. Among thesesentences most are simple in construction with one subject, object, finite andnon-finite verb, one or two adjective(s) and adverb(s) along with or withoutan indeclinable. But there are some rare sentences (mostly legal and courtproceedings) which run more than 200 word forms.

    6.6 Dictionary development

    Dictionaries normally do not contain enough information about sub-categorization, selection restriction, and domain of application of the lexicons,which can be best extracted from a corpus. Moreover, one can prepare a dic-tionary of current language using the corpus, which would be more up-to-dateand useful. The lexicography developed from the corpus is used for develop-ing a talking dictionary primarily used by the blind. The pronunciation of theBangla word forms in this dictionary is controlled by the phonological rules,

  • 194 N. S. DASH AND B. B. CHAUDHURI

    lexical category, semanticity, and the influence of utterance of foreign wordforms (Dash and Chaudhuri 1998). The variation of pronunciation occurs atthree positions: word-initial, word-final, and intermediate position, which ismostly context dependent. For computer implementation, the Bangla wordforms are divided into two groups: word forms with diacritics and wordforms without diacritics, because a diacritic, depending on its position in thesurface word form, can modulate the utterance of its preceding or succeedingvowel or consonant.

    6.7 Spelling variation studies

    One of the difficult problems of the Bangla script is its perennial orthographicand spelling variations. From the corpus, it is calculated that nearly 30% ofthe word forms have more than one valid spelling variation whereas thereare some word forms which have more than two and even up to twelve validspelling variations. The change of spelling is not arbitrary: it follows certainsystems regulated by some linguistic and non-linguistic factors. The patternof spelling changes are four types: deletion, addition, substitution, or dis-placement of graphemes which are mostly caused by different phonologicalfactors such as vowel harmony, devoicing, deaspiration, syllabic loss, etc.

    6.8 Word-form parsing

    The word-form parsing is a vital step towards computer understanding ofa natural language. The term pars has a Latin origin meaning to tell theparts of speech of a word and the relation of the various words to each otherin a sentence (Crystal 1980). In NLP, word-form parsing means automaticidentification of a string of graphemes as a valid word form, analysis ofits formation, determination of its lexical category, and automatic extractionof its meaning (single or multiple) along with all underlying grammaticalinformation. The parsing method is based on traditional grammar systemsapplicable to all lexical categories with necessary modifications. After se-lecting a word form, it is divided into root and suffix (if any). The rootis stored in the root list and suffixes are stored into the suffix list in themachine. Based on certain grammatical rules and matching algorithms, thesurface Bangla word forms are either parsed or generated.

  • THE PROCESS OF DESIGNING A MULTIDISCIPLINARY CORPUS 195

    6.9 Machine translation

    Machine Translation technology is primarily an offspring of AI and CL.Looking back, one can generalize that MT technology in the 60s was pri-marily direct in approach, which used a series of pattern matching tech-niques. In the 70s, it was primarily a transfer-based approach making useof transformation of syntactic rules. In the 80s, the semantic-based approachwas developed using knowledge-based technology developed by AI research(Boon 1992). In the 90s, MT technology approached neural network, case-based reasoning, and semantic- and knowledge-based capabilities. Recentresearch includes human language cognition and linguistic discourse forimprovement of the existing technology.

    7. Conclusion

    This paper presents only a brief report on the development of the Banglacorpus and of work we have initiated using the corpus. Our initial problemwas the lack of an electronic corpus, which was solved by the method ofmanual data-entry. Moreover, the Bangla writing system is not well definedwhich created problems regarding the writing of compound and reduplicatedword forms. These are generally printed either with or without space or witha hyphen in between. This posed a great problem for word-form parsing.Because of the lack of standard utterance rules, we had to consult linguistsand phoneticians regarding the standard articulation of word forms for thetalking dictionary. Finally, the copyright problem was taken care of by DOE,Government of India.

    The corpus-based research itself has some limitations, as it cannot solvesome social, evocative, or historical aspects of language. It also fails to venti-late how world knowledge and context can play a pivotal role in determiningthe actual meaning of a word form and how meaning changes, divides, ormerges with the change of time and space, etc. Moreover, generative gram-marians criticize that a corpus is the sample of performance only and thatone still needs a means of projecting beyond the corpus of the language asa whole (Crystal 1980). It definitely cannot encompass the property of thegenerative power of the language, because it only deals with what is used inthe language and not with what could have been used in the language. So

  • 196 N. S. DASH AND B. B. CHAUDHURI

    the overall potential or competence of a language cannot be examined by thecorpus.

    Despite such difficulties it is now agreed that empirical analysis of lan-guage based on a large and methodically well-developed corpus can gathersome findings which can open up new horizons of language study hithertounknown to our earlier masters.

    Acknowledgment

    The DOE, Government of India is acknowledged for corpus support. Theanonymous reviewers of this paper are also acknowledged for providingsuggestions for necessary modifications.

    References

    Aronoff, M. 1981. Word Formation in Generative Grammar. Cambridge, Mass.: MITPress.

    Bhattacharya, N. 1965. Some Statistical Studies of the Bangla Language. DoctoralDissertation. Calcutta: Indian Statistical Institute.

    Boon, L. H. 1992. Some Lessons Learnt by a Newcomer. A handout form Institute ofSystem Science. Singapore: National University.

    Chatterji, S. K. 1993. The Origin and Development of the Bengali Language. Calcutta:Rupa Publications.

    Chaudhuri, B. B. and N. S. Dash. 1998. Bangla Script: A structural Study. LinguisticsToday. 2(1): 128.

    Chaudhuri, B. B. and T. Pal. 1998. Detection of Word Error Position and CorrectionUsing Reversed Word Dictionary, in Proceedings of International Conferenceof Computational Linguistics, Speech and Document Processing (ICCLSDP98):C4146.

    Crystal, D. 1980. A First Dictionary of Linguistics and Phonetics. Boulder, Colorado:Westview Press.

    Das, G. and S. Mitra. 1984. Representing Assamese, Bengali and Manipuri Text inLine Printer and Daisy-Wheel Printer. Journal of the Institution of Electronics andTelecommunication Engineers 30: 251256.

    Dash, N. S. and B. B. Chaudhuri. 1998. Utterance Rules for Bangla Wordsand Their Computer Implementation, in Proceedings of International Conferenceof Computational Linguistics, Speech and Document Processing (ICCLSDP98):C5562.

  • THE PROCESS OF DESIGNING A MULTIDISCIPLINARY CORPUS 197

    Dewey, G. 1950. Relative Frequency of English speech Sounds. Harvard: HarvardUniversity Press.

    Edwards, A. W. and R. L. Chambers. 1964. Journal of the Association for ComputingMachinery. 2: 465482.

    Elderton, W. P. 1949. A few Statistics on the length of English Words. Journal of RoyalStatistics. Series A. (CXII): 436445.

    Flesch, R. 1948. The Art of Plain Talk. New York: Harper & Brothers Publishers.Gibson, H. N. 1962. The Shakespeare Claimants: A Critical Survey of the Four Principle

    Theories Concerning the Authorship of the Shakespearean Play. London: Methuen.Herden, G. 1962. Calculus of Linguistic Observation. Hague: Mouton & Co.Hoffman, C. 1955. The Man who was Shakespeare. New York: Julias Messner Inc.Kjellmer, G. 1984. Why Great: Greatly but not Big: Bigly? Studia Linguistica (38):

    119.Mallik, B. P. and T. Nara (eds.). 1994. Gitanjali: Linguistic Statistical Analysis. ILCAA:

    Tokyo University.Mallik, B. P. and T. Nara (eds.) 1996. Sabhyatar Sankat: Linguistic Statistical Analysis.

    Calcutta: Rabindra Bharati University.Merlo, P. 1996. Parsing with Principles and Classes of Information. Dordrecht: Kluwer

    Academic Publishers.Miller, G. A. 1951. Language & Communication. New York: McGraw-Hills.Murthy, B. K. and W. R. Despande. 1998. Language Technology in India: Past, Present

    and Future, SAARC Conference on Extending the use of Multilingual and MultimediaInformation Technology (EMMIT98). Pune.

    Sinclair, J. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University Press.Spencer, A. 1991. Morphological Theory. Oxford: Basil Blackwell.Teubert, W. 1996. Editorial. International Journal of Corpus Linguistics. 1: 12.Winograd, T. 1972. Understanding Natural Language. New York: Academic Press.Yule, G. U. 1964. The Statistical Study of Literary Vocabulary. Cambridge: Cambridge

    University Press.

    The Process of Designing a Multidisciplinary Monolingual Sample Corpus1. Introduction2. The background3. Features of the corpus4. Issues related to corpus generation4.1. Time span4.2. Size of the corpus4.3. Method of data collection4.4. Writers4.5. Corpus management

    5. Generation of the corpus5.1. Hardware environment5.2. Category determination

    6. Applications potentials of the corpus6.1. Alphabetical and frequency sorting6.2. Concordance and key word in context (KWIC)6.3. Local word grouping6.4. Orthography and script analysis6.5. Statistical studies6.6. Dictionary development6.7. Spelling variation studies6.8. Word-form parsing6.9. Machine translation

    7. ConclusionAcknowledgmentReferences