Shallow Parsing for South Asian Languages

Shallow Parsing for South Asian Languages

-Himanshu Agrawal

Shallow Parsing

Parts Of Speech TaggingAssigning grammatical classes to words in a natural

language sentence.

Text ChunkingDividing the text in syntactically co-related parts of words.

Example: [NP He ] [VP reckons ] [NP the current account deficit ]

[VP will narrow ] [PP to ] [NP only # 1.8 billion ] [PP in [NP September ]] .

Applications

Direct Applications Automatic Spell Checking Software Grammar Suggestions ( MS word pop-ups) Full Parsing

Indirect Applications Machine Translation Systems Web Search ( )

Nature of the problem of Shallow Parsing

A classic problem of classifying input tokens into given classes.

The sequence aspect The sequence of best classes. The best sequence of classes.

Typically, the classifying information is the language context of the word under consideration.

Shallow Parsing for English

The problem has been well worked upon for English.

Very Efficient Systems ExistExample:

Brill’s Tagger: ’95, Transformation Based Learning. Adwait Ratnaparkhi: ’99, Parsing with Maximum Entropy

Significant effect on the development of MT systems for European Languages

Shallow Parsing for South Asian Languages

Portability of Shallow Parsing Systems across languages ??

NOT GOOD !!

Inflectional Richness of the Languages.* Training on 22,000 words and Testing on 5000 words.

POS tagging only English Hindi

Brill’s Transformation Based Learning

87% 79%

Ratnaparkhi’s Maximum Entropy Based Learning

89% 81%

Challenges with Indian Languages.

Poor Disambiguation between certain POS class categories example

• NNP and NNC !! (Error Type 1)• JJ and NN !! (Error Type 2)

Inflectional Richness of the language

Absence of markers like the capitalization of proper nouns and etc.

Is that Raj ?

On Improving the performance for Hindi and other South Asian Languages.

There can be two ways Improving the classifying information by the use

of better features or using language specific information or both.

Improving the learning by better training and better inference-ing.

A. POS Tagging

For better training and inference-ing.

o Approach 1: Training on a hierarchical structure of tags

o Approach 2: Building a knowledge database from raw / un-annotated text to use as a `look up`.

Approach 1:Training on Hierarchical Tagset

Training in steps, on a hierarchical

structure of classes. Training

Level

1

2

Approach 1:Training on Hierarchical Tagset

The approach was devised to minimize the number of errors that are made within a family class.

Results

73.33 %

Reason: No mechanism to correct errors in the part 1 of training Jittered language constructs while training in part 2.

Approach 2:Building a knowledge database for `look up.` The Knowledge database consists of words and

the POS tags it is known to have occurred with.

How is it important ??

Inflectional richness Vs per class ambiguity

Building the knowledge database

Adding words and their POS tags from the training data.

Training on 22,000 words on Gold Standard POS tags, and creating a training model `A`.

Using model ‘A’ to annotate the raw text consisting of 2 Lakh words.

Extracting the words/POS tags of words tagged with very high confidence measure. And adding them to the database.

Using the knowledge database

For the final tagging We use model ‘A’ to get the probability of each tag to

be associated with a word.

ie P(tagi / word) for (every tag)

for (every word in the test data)

If a word is found in the database, we choose the tag in its entry, which has the highest probability.

If not found, we let the tag predicted in the first run remain unchanged.

Approach 2

Results:

84.90 %

Training for Model `A`

We use Linear Chain Implementation of the Conditional Random Fields. Taku Kudo et. Al. 2005

We use simple language independent features Word Window [-2, 2]. Suffix Information as in last 2, 3, 4 chars. Presence of Special Characters. Word Length.

B. Chunking

We have followed the approach used by Anirudh, Himanshu ’06 NWAI.

2 step Training: Training on Boundary-Label scheme for extracting

Chunk Labels. Training on Boundaries with added information of

chunk labels.

Chunking cont.

Training for identifying Chunk tags is also done using a linear chain implementation of CRF.

Features:• Word window of [-2, 2]• POS tag window of [-2, 2]• Chunk Labels, for chunk Boundary Identification [-2, 0]

Chunking

Results

92.69 %

Consolidated Results

**The results below are on calculated on the development data.

Hindi Telugu Bengali

POS Tagging 84.90 % 71.22 % 81.09 %

Chunking 92.69 % 91.77 % 94.90 %

Conclusions:

Training on a tag-set optimal for capturing the language patterns.

If training is done in more than one step, esp. such that tags in the subsequent step are directly dependent on the tags in the present step, then it is of importance that there exist a way to re-tag the mis-tagged tokens.

References:

Charles Sutton, An Introduction to Conditional Random Fields for Relational Learning

Adwait Ratnaparkhi ,1998, Maximum Entropy Models For Natural Language Ambiguity Resolution, Dissertation in Computer and Information Science,University Of Pennslyvania,1998.

Akshay Singh, Sushma Bendre, Rajeev Sangal, 2005 ,HMM Based Chunker for Hindi, IIIT Hyderabad.

Thorsten Brants. 2000. TnT - A Statistical Part-of- Speech Tagger Proceedings of the sixth conference on Applied Natural Language Processing (2000) 224–231.

Himanshu Agrawal, Anirudh Mani 2006, Part Of Speech Tagging and Chunking Using Conditional Random Fields: Proceedings of the NLPAI MLcontest workshop, National Workshop on Artificial Intelligence.

Documents

Shallow Parsing for South Asian Languages