22
Shallow Parsing for South Asian Languages -Himanshu Agrawal

Shallow Parsing for South Asian Languages

  • Upload
    cody

  • View
    61

  • Download
    0

Embed Size (px)

DESCRIPTION

Shallow Parsing for South Asian Languages. -Himanshu Agrawal. Shallow Parsing. Parts Of Speech Tagging Assigning grammatical classes to words in a natural language sentence. Text Chunking Dividing the text in syntactically co-related parts of words. - PowerPoint PPT Presentation

Citation preview

Page 1: Shallow Parsing for South Asian Languages

Shallow Parsing for South Asian Languages

-Himanshu Agrawal

Page 2: Shallow Parsing for South Asian Languages

Shallow Parsing

Parts Of Speech TaggingAssigning grammatical classes to words in a natural

language sentence.

Text ChunkingDividing the text in syntactically co-related parts of words.

Example: [NP He ] [VP reckons ] [NP the current account deficit ]

[VP will narrow ] [PP to ] [NP only # 1.8 billion ] [PP in [NP September ]] .

Page 3: Shallow Parsing for South Asian Languages

Applications

Direct Applications Automatic Spell Checking Software Grammar Suggestions ( MS word pop-ups) Full Parsing

Indirect Applications Machine Translation Systems Web Search ( )

Page 4: Shallow Parsing for South Asian Languages

Nature of the problem of Shallow Parsing

A classic problem of classifying input tokens into given classes.

The sequence aspect The sequence of best classes. The best sequence of classes.

Typically, the classifying information is the language context of the word under consideration.

Page 5: Shallow Parsing for South Asian Languages

Shallow Parsing for English

The problem has been well worked upon for English.

Very Efficient Systems ExistExample:

Brill’s Tagger: ’95, Transformation Based Learning. Adwait Ratnaparkhi: ’99, Parsing with Maximum Entropy

Significant effect on the development of MT systems for European Languages

Page 6: Shallow Parsing for South Asian Languages

Shallow Parsing for South Asian Languages

Portability of Shallow Parsing Systems across languages ??

NOT GOOD !!

Inflectional Richness of the Languages.* Training on 22,000 words and Testing on 5000 words.

POS tagging only English Hindi

Brill’s Transformation Based Learning

87% 79%

Ratnaparkhi’s Maximum Entropy Based Learning

89% 81%

Page 7: Shallow Parsing for South Asian Languages

Challenges with Indian Languages.

Poor Disambiguation between certain POS class categories example

• NNP and NNC !! (Error Type 1)• JJ and NN !! (Error Type 2)

Inflectional Richness of the language

Absence of markers like the capitalization of proper nouns and etc.

Is that Raj ?

Page 8: Shallow Parsing for South Asian Languages

On Improving the performance for Hindi and other South Asian Languages.

There can be two ways Improving the classifying information by the use

of better features or using language specific information or both.

Improving the learning by better training and better inference-ing.

Page 9: Shallow Parsing for South Asian Languages

A. POS Tagging

For better training and inference-ing.

o Approach 1: Training on a hierarchical structure of tags

o Approach 2: Building a knowledge database from raw / un-annotated text to use as a `look up`.

Page 10: Shallow Parsing for South Asian Languages

Approach 1:Training on Hierarchical Tagset

Training in steps, on a hierarchical

structure of classes. Training

Level

1

2

Page 11: Shallow Parsing for South Asian Languages

Approach 1:Training on Hierarchical Tagset

The approach was devised to minimize the number of errors that are made within a family class.

Results

73.33 %

Reason: No mechanism to correct errors in the part 1 of training Jittered language constructs while training in part 2.

Page 12: Shallow Parsing for South Asian Languages

Approach 2:Building a knowledge database for `look up.` The Knowledge database consists of words and

the POS tags it is known to have occurred with.

How is it important ??

Inflectional richness Vs per class ambiguity

Page 13: Shallow Parsing for South Asian Languages

Building the knowledge database

Adding words and their POS tags from the training data.

Training on 22,000 words on Gold Standard POS tags, and creating a training model `A`.

Using model ‘A’ to annotate the raw text consisting of 2 Lakh words.

Extracting the words/POS tags of words tagged with very high confidence measure. And adding them to the database.

Page 14: Shallow Parsing for South Asian Languages

Using the knowledge database

For the final tagging We use model ‘A’ to get the probability of each tag to

be associated with a word.

ie P(tagi / word) for (every tag)

for (every word in the test data)

If a word is found in the database, we choose the tag in its entry, which has the highest probability.

If not found, we let the tag predicted in the first run remain unchanged.

Page 15: Shallow Parsing for South Asian Languages

Approach 2

Results:

84.90 %

Page 16: Shallow Parsing for South Asian Languages

Training for Model `A`

We use Linear Chain Implementation of the Conditional Random Fields. Taku Kudo et. Al. 2005

We use simple language independent features Word Window [-2, 2]. Suffix Information as in last 2, 3, 4 chars. Presence of Special Characters. Word Length.

Page 17: Shallow Parsing for South Asian Languages

B. Chunking

We have followed the approach used by Anirudh, Himanshu ’06 NWAI.

2 step Training: Training on Boundary-Label scheme for extracting

Chunk Labels. Training on Boundaries with added information of

chunk labels.

Page 18: Shallow Parsing for South Asian Languages

Chunking cont.

Training for identifying Chunk tags is also done using a linear chain implementation of CRF.

Features:• Word window of [-2, 2]• POS tag window of [-2, 2]• Chunk Labels, for chunk Boundary Identification [-2, 0]

Page 19: Shallow Parsing for South Asian Languages

Chunking

Results

92.69 %

Page 20: Shallow Parsing for South Asian Languages

Consolidated Results

**The results below are on calculated on the development data.

Hindi Telugu Bengali

POS Tagging 84.90 % 71.22 % 81.09 %

Chunking 92.69 % 91.77 % 94.90 %

Page 21: Shallow Parsing for South Asian Languages

Conclusions:

Training on a tag-set optimal for capturing the language patterns.

If training is done in more than one step, esp. such that tags in the subsequent step are directly dependent on the tags in the present step, then it is of importance that there exist a way to re-tag the mis-tagged tokens.

Page 22: Shallow Parsing for South Asian Languages

References:

Charles Sutton, An Introduction to Conditional Random Fields for Relational Learning

Adwait Ratnaparkhi ,1998, Maximum Entropy Models For Natural Language Ambiguity Resolution, Dissertation in Computer and Information Science,University Of Pennslyvania,1998.

Akshay Singh, Sushma Bendre, Rajeev Sangal, 2005 ,HMM Based Chunker for Hindi, IIIT Hyderabad.

Thorsten Brants. 2000. TnT - A Statistical Part-of- Speech Tagger Proceedings of the sixth conference on Applied Natural Language Processing (2000) 224–231.

Himanshu Agrawal, Anirudh Mani 2006, Part Of Speech Tagging and Chunking Using Conditional Random Fields: Proceedings of the NLPAI MLcontest workshop, National Workshop on Artificial Intelligence.