28
1 Mono & Cross Language Experiments on Persian Text Abolfazl AleAhmad, Hadi Amiri, Farhad Oroumchian Database Research Group School of Electrical and Computer Engineering University of Tehran University of Tehran Database Research Group 18 Sep 2008 Persian@CLEF 2008

Mono & Cross Language Experiments on Persian Text

  • Upload
    reia

  • View
    25

  • Download
    0

Embed Size (px)

DESCRIPTION

University of Tehran Database Research Group. Mono & Cross Language Experiments on Persian Text. Persian@CLEF 2008. Abolfazl AleAhmad, Hadi Amiri, Farhad Oroumchian Database Research Group School of Electrical and Computer Engineering University of Tehran. 18 Sep 2008. Outline. - PowerPoint PPT Presentation

Citation preview

Page 1: Mono & Cross Language Experiments on Persian Text

1

Mono & Cross Language Experiments on Persian Text

Abolfazl AleAhmad, Hadi Amiri, Farhad OroumchianDatabase Research Group

School of Electrical and Computer Engineering

University of Tehran

University of TehranDatabase Research Group

18 Sep 2008

Persian@CLEF 2008

Page 2: Mono & Cross Language Experiments on Persian Text

OutlinePersian Language

Persian Test Collections

Hamshahri in CLEF 2008

UT Participants Using Part of Speech Tagging in Persian Information Retrieval

Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track

Local Cluster Analysis Using Part of Speech Tagging

Investigation on Application of Local Cluster Analysis and Part of Speech Tagging on Persian Text

Cross Language Experiments at Persian@CLEF 2008

Next Year

2

Page 3: Mono & Cross Language Experiments on Persian Text

The Persian LanguageA branch of Indo-European Languages

Official Language of Iran, Afghanistan and Tajikistan

Its morphological analysis is Comparably difficult

The word “خبر” has two plural forms:• Persian rules: “خبرها”• Arabic rules: “اخبار”

3

Page 4: Mono & Cross Language Experiments on Persian Text

Writing Style Issues:e.g. ” شود are the same ”میشود“ and “می

e.g. ”کتابها“ and ” ها are the same “کتاب

KASRE:e.g. سوزاند را خانه علی has two چراغdifferent meanings:

• CheraghAli burned the house• Ali’s lantern burned the house

Some Processing Issues

4

Page 5: Mono & Cross Language Experiments on Persian Text

Some Processing Issues

5

Encoding

Page 6: Mono & Cross Language Experiments on Persian Text

Persian in the Middle East

6Source: Internet World Stats, http://internetworldstats.com/

December 31, 2007

User Population Growth on the Web (2000-2009)

Page 7: Mono & Cross Language Experiments on Persian Text

Persian Test Collections

IR DomainGhavanin (domain specific)

Hamshahri (news) WEB: http://ece.ut.ac.ir/dbrg/hamshahri

NLP DomainBijankhan (2 Million Word) WEB: http://ece.ut.ac.ir/dbrg/bijankhan

7

Page 8: Mono & Cross Language Experiments on Persian Text

Hamshahri in CLEF 2008

8

News articles of Hamshahri newspaper from year 1996 to 2002

Size of the documents varies from short news (under 1 KB) to rather long articles (e.g. 140 KB)

22 assessors

Evaluation based on DIRECT System

Page 9: Mono & Cross Language Experiments on Persian Text

Hamshahri in CLEF 2008

9

Collection size 564 MB (Unicode text)

No. Of documents 166,774

No. Of unique terms 417,339

Average length of documents 380 Terms

No. Of categories 9

No. Of Topics 50 bilingual

Page 10: Mono & Cross Language Experiments on Persian Text

Implementation of our methods

We submitted top 100 for each run

10

Page 11: Mono & Cross Language Experiments on Persian Text

11

Hamshahri corpusHamshahri tagged

document collection

Stemming

User

Refine Query

part of speeches with

corresponding weight

Query

Retrieval

POS Tagging

Bijankhan Tagged collection of documents

As train data

Simple Stemming

Stemmed and tagged corpus

POS Tagging

Bijankhan Tagged collection of documents

As train data

Stemming

Simple Stemming

Stemmed and

tagged queries

Using Part of Speech Tagging in Persian Information RetrievalReza Karimpour, Amineh Ghorbani, Azadeh Pishdad, Mitra Mohtarami, Abolfazl AleAhmad, Hadi Amiri, Farhad Oroumchian

Page 12: Mono & Cross Language Experiments on Persian Text

Config. Corpus Query

1 Tagged Title with equal weighting for all POS tags

2 Stemmed and tagged Stemmed title with equal weighting for all POS tags

3 Stemmed Stemmed title without POS tagging

4 Stemmed Stemmed Title plus description

5 Stemmed (stop words removed)

Stemmed Title plus description (stop words removed)

6 Tagged Title plus description with equal weighting for all POS tags

7 Tagged Title with various weighting schemes for different POS tags

8 Normal Title (Neither stemmed nor tagged)

12

Using Part of Speech Tagging in Persian Information Retrieval

Page 13: Mono & Cross Language Experiments on Persian Text

13

20 less used tags omitted, others equal weight

Noun=3

Verb=2

Adj=1

Adv=1

Noun=3

Verb=0

Avj=3

Adv = 0

Noun=0

Verb=2

Adj=0

Adv=0

Noun=0

Verb=0

Adj=1

Adv=0

Noun=0

Verb=0

Adj=0

Adv=1

Average precision

0.2745 0.2635 0.2597 0.1108 0.1198 0.0977

R-Precision 0.3097 0.3104 0.2888 0.1256 0.1186 0.1111

Using Part of Speech Tagging in Persian Information Retrieval

Page 14: Mono & Cross Language Experiments on Persian Text

14

Using Part of Speech Tagging in Persian Information Retrieval

Page 15: Mono & Cross Language Experiments on Persian Text

Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track Zahra Aghazade, Nazanin Dehghani, Leili Farzinvash, Razieh Rahimi, Abolfazel AleAhmad, Hadi Amiri, Farhad Oroumchian

Weighting Model Description

BB2Bose-Einstein model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization

BM25 The BM25 probabilistic model

DFR_BM25 The DFR version of BM25

IFB2Inverse Term Frequency model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization

In_expB2Inverse expected document frequency model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization

In_expC2Inverse expected document frequency model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization with natural logarithm

InL2Inverse document frequency model for randomness, succession for first normalization, and Normalization 2 for term frequency normalization

PL2Poisson estimation for randomness, succession for first normalization, and Normalization 2 for term frequency normalization

TF_IDFThe tf*idf weighting function, where tf is given by Robertson's tf and idf is given by the standard Sparck Jones' idf

15Terrier Open Source Retrieval Engine: http:// ir.dcs.gla.ac.uk/terrier/

Page 16: Mono & Cross Language Experiments on Persian Text

Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track

Weighting Model Average Precision R-Precision

BB2 0.3854 0.4167

BM25 0.3562 0.4009

DFR_BM25 0.4006 0.4347

IFB2 0.4017 0.4328

In_expB2 0.3997 0.4329

In_expC2 0.4190 0.4461

InL2 0.3832 0.4200

PL2 0.43140.4314 0.45480.4548

TF_IDF 0.3574 0.4017

16

Page 17: Mono & Cross Language Experiments on Persian Text

Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track

And two other variations of this operator: IOWA and NOWA

17

Page 18: Mono & Cross Language Experiments on Persian Text

Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track

18

Page 19: Mono & Cross Language Experiments on Persian Text

Retrieval Method Toolkit Average Precision R-Precision Dif

TF_IDF with unstemmed single terms Terrier 0.3847 0.4122

PL2 with 4gram terms Terrier 0.3669 0.3939

Indri with stemmed terms Lemur 0.3955 0.4149

IOWA 0.4515 0.4708 +5.6

NOWA 0.4522 0.4736 +5.67

19

Fusion of Retrieval Models at CLEF 2008 Ad-Hoc Persian Track

Post hoc Results

Page 20: Mono & Cross Language Experiments on Persian Text

Investigation on Application of Local Cluster Analysis and Part of Speech Tagging on Persian TextAmir Hossein Jadidinejad, Mitra Mohtarami,Hadi Amiri

20

Bijankhan Collection

POS Tagger (MLE and TNT)

Hamshahri Clear Collection

Hamshahri Tagged Collection

By MLE

Hamshahri Tagged Collection

By TNT

Training

Test

MLE

TNT

Content-less tag removalUseful Tags

Retrieval EngineRetrieval Engine

Pre

pro

ce

ss

ing

Re

tP

os

t P

roc

es

sin

g

Retrieved Results

Clustering

Relevant Cluster

Irrelevant Cluster

Cluster AnalysisReranked

Results

Bijankhan Collection

POS Tagger (MLE and TNT)

Hamshahri Clear Collection

Hamshahri Tagged Collection

By MLE

Hamshahri Tagged Collection

By TNT

Training

Test

MLE

TNT

Bijankhan Collection

POS Tagger (MLE and TNT)

Hamshahri Clear Collection

Hamshahri Tagged Collection

By MLE

Hamshahri Tagged Collection

By TNT

Training

Test

MLE

TNT

Content-less tag removalUseful Tags

Retrieval EngineRetrieval Engine

Pre

pro

ce

ss

ing

Pre

pro

ce

ss

ing

Re

tR

et

Po

st

Pro

ce

ss

ing

Po

st

Pro

ce

ss

ing

Retrieved Results

Clustering

Relevant Cluster

Irrelevant Cluster

Cluster AnalysisReranked

Results

Page 21: Mono & Cross Language Experiments on Persian Text

Investigation on Application of Local Cluster Analysis and Part of Speech Tagging on Persian Text

21But the result was not good on the test set

Page 22: Mono & Cross Language Experiments on Persian Text

Cross Language Experiments at Persian@CLEF 2008Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian

22

Run tot-ret rel-ret MAP Retrieval Model Tool

Using Light Stemmer

5161 1967 26.89 Vector Space Lucene

Without Stemmer 5161 1991 27.08 Vector Space Lucene

3Grams 5161 1901 26.07 Language Modeling Lemur

4Grams 5161 1950 26.70 Language Modeling Lemur

5Grams 5161 1983 27.13 Language Modeling Lemur

Term-Based 5161 2035 28.14 Language Modeling Lemur

Page 23: Mono & Cross Language Experiments on Persian Text

Probabilistic Structured Queries (PSQ)

Combinatorial Translation Probability (CTP)

Cross Language Experiments at Persian@CLEF 2008Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian

Query Translation

23

Page 24: Mono & Cross Language Experiments on Persian Text

Cross Language Experiments at Persian@CLEF 2008Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian

Query Translation Results

24

0

0.1

0.2

0.3

0.4

0.5

0.6

1 2 3 4 5 6 7 8 9 10 11Recall

Pre

cisi

on

All Meanings; MAP 6.73 First Meaning; MAP 12.4

PSQ_CTP+4Grams; MAP 14.46

Page 25: Mono & Cross Language Experiments on Persian Text

Cross Language Experiments at Persian@CLEF 2008Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian

Document Translation

Using Shiraz machine translation system from CRL of NMSU

Took 10 days to translate 130,000+ docs from Persian to English

25

Page 26: Mono & Cross Language Experiments on Persian Text

Cross Language Experiments at Persian@CLEF 2008Abolfazl AleAhmad, Ehsan Kamalloo, Arash Zareh, Masoud Rahgozar, Farhad Oroumchian

Document Translation & Hybrid Results

26

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Recall

Pre

cisi

on

Document Translation; MAP 12.88Monolingual; MAP 27.08Query Translation; MAP 14.46Hybrid; MAP 16.19

Page 27: Mono & Cross Language Experiments on Persian Text

Next YearHam2 for the Next Year

Extended Version of Hamshahri Collection

2 times larger (~1.5 GB)

27

<DOC><DOCID>HAM2-851011-001</DOCID><DOCNO>HAM2-851011-001</DOCNO>

<ORIGINALFILE<ORIGINALFILE>>/1385/851011/news/_adabh.htm</ORIGINALFILE></ORIGINALFILE><ISSUE><ISSUE> 4172 - سال چهاردهم - شماره1385 دي 11دوشنبه - Jan 1,

2007</ISSUE></ISSUE><DATE>2007-01-01</DATE><CAT xml:lang="fa">ادب و هنر</CAT><CAT xml:lang="en">Literature and Art</CAT>

<TITLE><TITLE><![CDATA[مديركل كتاب و كتابخواني وزارت فرهنگ و ارشاد اسالمي خبر داد<[[آيين نامه خريد كتاب اصالح شد</TITLE></TITLE><TEXT>

<image<image>>/1385/851011/news/008505.jpg</image></image><![CDATA[فارس: مدير كل كتاب  و كتاب خواني وزارت فرهنگ و ارشاد اسالمي گفت: آيين نام</TEXT></DOC><DOC>

Page 28: Mono & Cross Language Experiments on Persian Text

28

Questions?Thanks For Your Attention

Database Research Grouphttp://ece.ut.ac.ir/dbrg