SYNTACTIC PARSERFOR KANNADA 7.pdfSYNTACTIC PARSERFOR KANNADA LANGUAGE ... The different parts-of-speech tags and phrases associated with a sentence can be easily

  • View
    217

  • Download
    5

Embed Size (px)

Text of SYNTACTIC PARSERFOR KANNADA 7.pdfSYNTACTIC PARSERFOR KANNADA LANGUAGE ... The different...

  • 219

    CHAPTER7

    SYNTACTIC PARSERFOR KANNADA LANGUAGE

    This chapter deals with the development of Penn Treebank based statistical syntactic

    parsers for Kannada language. Syntactic parsing is the task of recognizing a sentence and

    assigning a syntactic structure to it. A syntactic parser is an essential tool used for various

    NLP applications and natural language understanding. The well known grammar

    formalism called Penn Treebank structure was used to create the corpus for developed

    statistical syntactic parser. The parsing system was trained using Treebank based corpus

    consists of 1,000 distinct Kannada sentences that was carefully created. The developed

    corpus has been already annotated with correct segmentation and POS information. The

    developed system uses an SVM based POS tagger generator as explained in chapter 5, for

    assigning proper tags to each and every word in the training and test sentences.

    Penn Treebank corpora have proved their value both in linguistics and language

    technology all over the world. At present a lot of research has been done in the field of

    Treebank based probabilistic parsing successfully. The main advantage of Treebank based

    probabilistic parsing is its ability to handle the extreme ambiguity produced by context-

    free natural language grammars. Information obtained from the Penn Treebank corpora

    has challenged the intuitive language study for various NLP purposes [168]. South

    Dravidian language like Kannada is morphologically rich in which a single word may

    carry different sorts of information. The different morphs composing a word may stand

    for, or indicate a relation to other elements in the syntactic parse tree. Therefore, it is a

    challenging task to the developers in terms of the status of the orthographic words in the

    syntactic parse trees.

    The proposed syntactic parser was implemented using supervised machine learning

    and PCFG approaches. Training, testing and evaluation were done by SVM algorithms.

    Experiment shows that the performance of the developed system is significantly well and

    has very competitive accuracy.

  • 220

    7.1 RELATED WORK

    A series of statistical based parsers for English were developed by various researchers

    namely: Charniak-1997, Collins-2003, Bod et al. - 2003 and Charniak and Johnson- 2005

    [169,170]. All these parsers were trained and tested on the standard benchmark corpora

    called WSJ. A probability model for a lexicalized PCFG was developed by Charniak

    in1997. In the same time Collins also describes three generative parsing models, where

    each model is a refinement of the previous one, and achieving improved performance. In

    1999 Charniak introduced a much better parser called maximum-entropy parsing

    approach. This parsing model is based on a probabilistic generative model and uses a

    maximum-entropy inspired technique for conditioning and smoothing purposes. In the

    same period Collins also present a statistical parser for Czech using the Prague

    Dependency Treebank. The first statistical parsing model based on a Chinese Treebank

    was developed in 2000 by Bikel and Chiang. A probabilistic Treebank based parser for

    German was developed by Dubey and Keller in 2003 using a syntactically annotated

    corpus called Negra. The latest addition to the list of available Treebank is the French

    Le Mondecorpus and it was made available for research purposes in May 2004. Ayesha

    Binte Mosaddeque & Nafid Haque wrote CFG for 12 Bangla sentences that have taken

    from a newspaper [76]. They used a recursive descent parser for parsing the CFG.

    7.2 THEORETICAL BACKGROUND

    7.2.1 Parsing

    Syntactic analysis is the process of analyzing a text or sentence that is made up of a

    sequence of words called tokens, and to determine its grammatical structure with respect

    to given grammatical rules.Parsing is an important process in NLP, which is used to

    understand the syntax and semantics of a natural language sentences confined to the

    grammar. Parsing is actually related to the automatic analysis of texts according to a

    grammar. Technically, it is used to refer to the practice of assigning syntactic structure to a

    text.On the other way, a parser is a computational system which processes input sentence

    according to the productions of the grammar, and builds one or more constituent structures

    called parse trees which conform to the grammar.

  • 221

    Before a syntactic parser can parse a sentence, it must be supplied with information

    about each word in the sentence.In another way, a parser accepts a sequence of words and

    an abstract description of possible structural relations that may hold between words or

    sequences of words in some language as input and produces zero or more structural

    descriptions of the input as output,as permitted by the structural rule set. There will be

    zero descriptions, if either the input sequence cannot be analyzed by the grammar, i.e. is

    ungrammatical, or if the parser is incomplete, i.e. fails to find all of the structure the

    grammar permits. There will be more than one description if the input is ambiguous with

    respect to the grammar, i.e. if the grammar permits more than one analysis of the input.

    In English, countable nouns have only two inflected forms,singular and plural, and

    regular verbs have only four inflected forms: the base form, the -s form, the -ed form, and

    the ing form. But the case is not same for a language like Kannada, which may have

    hundreds of inflected forms for each noun or verb. Here an exhaustive lexical listing is

    simply not feasible. For such languages, one must build a word parser that will use the

    morphological system of the language to compute the part of speech and inflectional

    categories of any word.

    7.2.1.1 Top-Down Parser

    Top-down parsing can be viewed as an attempt to find left-most derivations of an

    input-stream by searching for parse trees using a top-down expansion of the given formal

    grammar rules. Tokens are consumed from left to right. Inclusive choice is used to

    accommodate ambiguity by expanding all alternative right-hand-sides of grammar rules. It

    starts from the start symbol S, and goes down to reach the input. This is an advantage of

    this method. The top-down strategy never wastes time for exploring trees that cannot

    result in an S (root), since it begins by generating just those trees. This means it also never

    explores sub trees that cannot find a place in some S-rooted tree. But this also has its

    disadvantages. While it does not waste time with trees that do not lead to an S, it does

    spend considerable effort on S trees that are not consistent with the input. This weakness

    in top-down parsers arises from the fact that they generate trees before ever examining the

    input. Recursive descent parser and LL parsers are examples of this parser.

  • 222

    7.2.1.2 Bottom-up Parser

    A parser can start with the input and attempt to rewrite it to the start symbol.

    Intuitively, the parser attempts to locate the most basic elements, then the elements

    containing these, and so on. LR parsers are examples of bottom-up parsers. Another term

    used for this type of parser is Shift-Reduce parsing. The advantage of this method is, it

    never suggests trees that are not at least locally grounded in the input. The major

    disadvantage of this method is that, trees that have no hope of leading to an S, or fitting in

    with any of their neighbors, are generated with wild abandon. LR parser and Operator

    Precedence parsers are examples of this type of parsers.

    Another important distinction is whether the parser generates a leftmost derivation or

    a rightmost derivation. LL parsers will generate a leftmost derivation and LR parsers will

    generate a rightmost derivation (although usually in reverse).

    7.2.2 Syntactic Tree Structure

    The different parts-of-speech tags and phrases associated with a sentence can be easily

    illustrated with the help of a syntactic structure. Fig. 7.1 below shows the output syntactic

    tree structure produced by a syntactic parser for the Kannada input sentence

    (Rama threw the ball).

    Fig. 7.1: Syntactic tree structure

  • 223

    For a given sentence, the corresponding syntactic tree structure conveys the following

    information.

    7.2.2.1 Part-of-Speech-Tags

    The syntactic trees help in identifying the tags of all the words in a given sentence as

    shown in Table 7.1.

    Table 7.1: Parts-of-Speeches and Phases in a Sentence

    Parts-of-speech Phrases

    Node (Word) Tag Phrase Name of Phrase

    NNP Noun phrase

    NN Verb phrase

    VF

    7.2.2.2Identification of Phrases

    The syntactic tree also helps in identifying the various phrases and the organization of

    these phrases into a sentence. In the above example there are two phrases as shown in

    Table 7.1.

    7.2.2.3 Useful Relationship

    The syntactic tree structure also helps in identifying the relationships between

    different phrases or words. Indirectly it identifies the relationship between different

    functional parts like subject, object, and verb in a sentence. In the given example, the

    subject part is , object is and the verb is . Given a rule S ->NP VPin the

    above example, the NP is the subject of