16
Myanmar Search Engine Nyi Lynn Seck EC (MCPA)

Myanmar Search Engine

Embed Size (px)

Citation preview

Page 1: Myanmar Search Engine

Myanmar Search Engine

Nyi Lynn SeckEC (MCPA)

Page 2: Myanmar Search Engine

Search Engine Evolution

● 1st generation (use only “on page” data)– text data, Word frequency, language

● 2nd generation (use off-page, web-specific data)– Link (or connectivity) analysis– Click-through data (What people click)– Anchor-text (How people refer to this page)

● 3rd generation (answer “the need behind the query”)– Semantic analysis - what is this about?– Focus on user need, rather than on query– Context determination

Page 3: Myanmar Search Engine

Text Mining Research Area

● Information Retrieval (IR)– Search Engines– Classification– Recommendation

● Information Extraction (IE)– Screen scraping– Product Information (e.g. price) scraping

● Information Understanding– Natural Language Processing (NLP)– Question Answering– Concept Extraction from Newsgroup– Visualization– Summarization

● Cross-Lingual Text Mining● Trend Detection

– Outlier Detection

Page 4: Myanmar Search Engine

Classical Indexing

Indexing

– Keyword Indexing

– Subject Indexing (Classification)

– Collocate subjects– Define & Assign code (Call Number) to document

Page 5: Myanmar Search Engine

Tokenization

Tokenization is the process of replacing sensitive data with unique identification symbols that retain all the essential information without compromising its security

Assign unique ID to each word & keep in a lexicon

Remove Stop/Noise words before/after tokenization

Page 6: Myanmar Search Engine

Stemming, Lemmatization

Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form.

Lemmatization is the process of reducing an inflected spelling to its lexical root or lemma form. The lemma form is the base form or head word form you would find in a dictionary. The combination of the lemma form with its word class (noun, verb. etc.) is called the lexeme.

Page 7: Myanmar Search Engine

ကစ

ကစကြကင� ကစစရ အကစကစပြ

ကစ

ကစေ�နသည� ကစလ�မမ���ည�ကစခ�သည�

Page 8: Myanmar Search Engine

Inverted Index

Page 9: Myanmar Search Engine

Inverted Index

Page 10: Myanmar Search Engine

Formula & Algorithm?

The weight of a term that occurs in all documents

Page 11: Myanmar Search Engine

Stop Wordsaableaboutaboveabroadaccordingaccordinglyacrossactuallyadjafterafterwardsagainagainstagoaheadain'tallallowallowsalmostalone

Engl

ish

What stop words will be use in Myanmar Search Engine?

Page 12: Myanmar Search Engine

NGram သ သသသသသ သ သသသသသသသသသသသသ သသသသ သသသသသေ�မမတယဉမ��သတ�ထ ေေ�မမင�န�င�န �လ��အ ညည�

ေ�မမတေ�တယဉ �ယဉမ��မမ�သသတ�တ�ထထမမင�ေ�မမင�န�င�န�င�န �ရနလန%�ည&လ��အ�အ ညည�

|ေ�မမ||ေ�တ||ယဉ �||မမ�||သ||တ�||ထ||ေ�မမင�||န�င�||ရန �||လ��||အ�||သည�|

ေ�မမတယဉ �ေ�တယဉမ��ယဉမ��သမမ�သတ�သတ�ထတ�ထမမင�ထမမင�န�င�ေ�မမင�န�င�န �န�င�နလန%�ည&ရနလန%�ည&အ�လ��အ ညည�

ေ�မမတယဉမ��ေ�တယဉမ��သယဉမ��သတ�မမ�သတ�ထသတ�ထမမင�တ�ထမမင�န�င�ထမမင�န�င�န �ေ�မမင�န�င�နလန%�ည&န�င�နလန%�ည&အ�ရနလန%�ည&အ ညည�

2 Gram |ေ�မမတ||ယဉမ��||သတ�||ေ�မမင�န�င�||ရနလန%�ည&||လ��အ�||အ ညည�|3 Gram |ေ�မမတယဉ �||သတ�ထ||ေ�မမင�န�င�န �||လ��အ ညည�|4 Gram |ေ�မမတယဉမ��|

ေ�မမတယဉမ��သေ�တယဉမ��သတ�ယဉမ��သတ�ထမမ�သတ�ထမမင�သတ�ထမမင�န�င�တ�ထမမင�န�င�န �ထမမင�န�င�နလန%�ည&ေ�မမင�န�င�နလန%�ည&အ�န�င�နလန%�ည&အ ညည�

Page 13: Myanmar Search Engine

MyanmarWord Segmentation using Syllable level Longest Matching : Hla Hla Htay

Page 14: Myanmar Search Engine

Simple Myanmar Syllable Structure

Consonant

Medial

Vowel

Killer

Diacritic

Diacritic

Killer

Diacriti

c

Diacritic

Vowel

Killer

Diacritic

Diacritic

Killer

Diacritic

CC+MC+M+VC+M+V+KC+M+ V+ K+ DC+M+V+DC+M+KC+M+K+DC+M+DC+VC+V+KC+V+K+DC+V+DC+KC+K+D

Page 15: Myanmar Search Engine

Corpus/Lexicon

WWWWWW

Ranking engine

Query engineParser Indexer

Language specific crawler

Pagerepository

queryresults

Crawler

Language Identification

Language Specific Search EngineBasic Architecture

Pann Yu Mon, Management and Information System Engineering Department, Nagaoka University of Technology, Japan

Page 16: Myanmar Search Engine

Crawling Coverage

Crawling Parameters

Seed URLs 35Level of depth 6 Crawling time 2 weeksCPU 2.40 GHzMemory 1 GBConnection: 100 Mbit per second

Domains The Number of Pages Collected

.mm 3,555 [ 1.1%]

.com 276,554 [ 83.2%]

Other gTLDs 52,245 [ 15.7%]

Total 332,354 [100.0%]

10th July 2008