21
Pat-Tree-Based Adaptive key Pat-Tree-Based Adaptive key phrase Extraction for Intel phrase Extraction for Intel ligent Chinese Information ligent Chinese Information Retrieval Retrieval 出出出出institute of information scie institute of information scie nce , academia sinica , taipei, ta nce , academia sinica , taipei, ta iwan,R.O.C. iwan,R.O.C. 出出 出出出出 出出出出 出出 :、、 出出 出出出出 出出出出 出出 :、、 出出出出 出出出 出出 出出出出 出出出 出出

Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval

  • Upload
    beck

  • View
    47

  • Download
    2

Embed Size (px)

DESCRIPTION

Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval. 出處: institute of information science , academia sinica , taipei, taiwan,R.O.C. 學生:陳道輝、周鉦琪、葉飛 指導老師:黃三益 教授. Abstract. PAT-tree-based adaptive approach - PowerPoint PPT Presentation

Citation preview

Page 1: Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval

Pat-Tree-Based Adaptive keyphrase Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese InfExtraction for Intelligent Chinese Inf

ormation Retrievalormation Retrieval

出處:出處: institute of information science , acainstitute of information science , academia sinica , taipei, taiwan,R.O.C.demia sinica , taipei, taiwan,R.O.C.

學生:陳道輝、周鉦琪、葉飛學生:陳道輝、周鉦琪、葉飛指導老師:黃三益 教授指導老師:黃三益 教授

Page 2: Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval

AbstractAbstract

• PAT-tree-based adaptive approachPAT-tree-based adaptive approach

• IR application: automatic term IR application: automatic term suggestion, domain-specific lexicon suggestion, domain-specific lexicon construction, book indexing and construction, book indexing and document classificationdocument classification

Page 3: Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval

IntroductionIntroduction

• Keyphrase (keywords) extraction in ChinKeyphrase (keywords) extraction in Chinese language is a critical problem becauese language is a critical problem because of difficulties in word segmentation ase of difficulties in word segmentation and unknown word identification.ex(nd unknown word identification.ex( 哈電哈電族族 ))

Page 4: Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval

Definition of the ProblemsDefinition of the Problems

• Lexical pattern: a string that consists of Lexical pattern: a string that consists of more than one successive character and more than one successive character and has certain occurrences in a text collectihas certain occurrences in a text collection with a specific domain.on with a specific domain.

• For example:For example: 關鍵詞抽取關鍵詞抽取• LPs:LPs: 關鍵、建詞、 詞抽、抽取、關鍵詞、關鍵、建詞、 詞抽、抽取、關鍵詞、

鍵詞抽、詞抽取、關鍵詞抽、鍵詞抽取、鍵詞抽、詞抽取、關鍵詞抽、鍵詞抽取、關鍵詞抽取關鍵詞抽取

Page 5: Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval

Definition of the Problems (coDefinition of the Problems (cont)nt)

• Complete lexical pattern: a LP with a coComplete lexical pattern: a LP with a complete meaning and lexical boundaries implete meaning and lexical boundaries in semantics.n semantics.

• For example: For example: 關鍵詞抽取關鍵詞抽取• CLP:CLP: 關鍵、抽取、關鍵詞、關鍵詞抽取關鍵、抽取、關鍵詞、關鍵詞抽取

Page 6: Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval

Definition of the Problems (coDefinition of the Problems (cont)nt)

• Significant lexical pattern: A CLP which iSignificant lexical pattern: A CLP which is either “specific” or “significant” in s either “specific” or “significant” in the databasethe database

• For example: For example: 關鍵詞抽取關鍵詞抽取• SLP:SLP: 關鍵詞、關鍵詞抽取關鍵詞、關鍵詞抽取

Page 7: Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval

Definition of the Problems (coDefinition of the Problems (cont)nt)• Definition 1:SLP Extraction ProblemDefinition 1:SLP Extraction Problem

• Definition 2:CLP Estimation ProblemDefinition 2:CLP Estimation Problem

• To solve problem 1, first we should To solve problem 1, first we should solve problem 2solve problem 2

Page 8: Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval

Definition of the Problems (coDefinition of the Problems (cont)nt)• Proposed Approach: 3 modulesProposed Approach: 3 modules

– Text analysis and PAT-tree indexing Text analysis and PAT-tree indexing modulemodule

– CLP extraction moduleCLP extraction module– SLP extraction moduleSLP extraction module

Page 9: Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval

Definition of the Problems (coDefinition of the Problems (cont)nt)

Page 10: Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval

Estimation of CLPEstimation of CLP

• Most CLP have strong associations between Most CLP have strong associations between their composed and overlapped substringstheir composed and overlapped substrings

• Association Norm Estimation functionAssociation Norm Estimation function

• If AE is large, it can be found that in many cIf AE is large, it can be found that in many cases, patterns y and z will occur together is ases, patterns y and z will occur together is the text collectionthe text collection

(( 關鍵詞抽取、鍵詞抽取、關鍵詞抽關鍵詞抽取、鍵詞抽取、關鍵詞抽 ))

xzy

xyzX fff

f

xzy

xMIAE

)Pr()Pr()Pr(

)Pr(

Page 11: Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval

Estimation of CLP (cont)Estimation of CLP (cont)

• It’s not enough to check if x has complete lexIt’s not enough to check if x has complete lexical boundaries using AE (ical boundaries using AE ( 關鍵詞關鍵詞 ))

• To overcome this, we use two additional metriTo overcome this, we use two additional metrics, LCD (left context dependency) and RCD(rigcs, LCD (left context dependency) and RCD(right context dependency) ex.ht context dependency) ex. 李登輝李登輝

• By these metrics we can say:By these metrics we can say:– X is a CLP iff it has no LCD and RCD, andX is a CLP iff it has no LCD and RCD, and

AE > (t3) thresholdAE > (t3) threshold

Page 12: Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval

Estimation of CLP (cont)Estimation of CLP (cont)

• X has LCD if |L|<t1, or MAX z (f(zx)/f(x))>t2,X has LCD if |L|<t1, or MAX z (f(zx)/f(x))>t2, where t1, t2 are threshold values , z E L where t1, t2 are threshold values , z E L and |L| means the number of unique righand |L| means the number of unique right adjacent characters of xt adjacent characters of x

• X has RCD if |L|<t1, or MAX z f(xy)/f(x)>t2, X has RCD if |L|<t1, or MAX z f(xy)/f(x)>t2, where t1, t2 are threshold values , y E L awhere t1, t2 are threshold values , y E L and |L|means the number of unique right nd |L|means the number of unique right adjacent characters of xadjacent characters of x

Page 13: Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval

Text Analysis and PAT-Tree Text Analysis and PAT-Tree IndexingIndexing

• PAT tree uses as primarily implementation structure,PAT tree uses as primarily implementation structure, and used for text retrieval and keyphrase extraction and used for text retrieval and keyphrase extraction

• Use delimiter(, “ ” .) to determine a segment bounUse delimiter(, “ ” .) to determine a segment boundary, then build semi-infinite stringdary, then build semi-infinite string

• For example:For example: 個人電腦個人電腦 ,, 人腦人腦– 個人電腦個人電腦 ,, 人電腦人電腦 ,, 電腦電腦 ,, 腦腦 ,, 人腦人腦 ,, 腦腦

• Node information (comparison bit, external nodes,frNode information (comparison bit, external nodes,frequency)equency)

• PAT Is easy for prefix search.PAT Is easy for prefix search.• IPAT is easy for postfix search.IPAT is easy for postfix search.

Page 14: Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval

Text Analysis and PAT-Tree IndexiText Analysis and PAT-Tree Indexing (cont)ng (cont)• Convert semi-infinite strings to bitsConvert semi-infinite strings to bits• According semi-infinite strings’ bit sequAccording semi-infinite strings’ bit sequ

ences and differences to build PAT Treeences and differences to build PAT Tree• We also create inverse PAT tree for inversWe also create inverse PAT tree for invers

e data streams of the database to check te data streams of the database to check the occurrences of LSs and RSshe occurrences of LSs and RSs

• (( 詞鍵關、詞鍵、詞鍵關展發、詞鍵關行進詞鍵關、詞鍵、詞鍵關展發、詞鍵關行進 ))

Page 15: Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval

Text Analysis and PAT-Tree IndexiText Analysis and PAT-Tree Indexing (cont)ng (cont)• Why use Pat tree (patricia)Why use Pat tree (patricia) ??

– Log key value comparison times is low.Log key value comparison times is low.– Computing time and space is down.Computing time and space is down.– Efficient search.Efficient search.– We can use Pat tree to check RCD.We can use Pat tree to check RCD.– We can use Inverse Pat tree to check LCD.We can use Inverse Pat tree to check LCD.

Page 16: Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval

Extraction of SLPExtraction of SLP

• A CLP is not always a SLPA CLP is not always a SLP– It cannot prove its significance in the text collectionIt cannot prove its significance in the text collection– Many CLP are commonly found in daily useMany CLP are commonly found in daily use

• All CLP is checked against a set of lexical rules All CLP is checked against a set of lexical rules and a general-domain corpusand a general-domain corpus

• Rules:Rules:– Numbers, Adverbs, Timing-related TermsNumbers, Adverbs, Timing-related Terms– General Domain Pat Tree vs Specific Domain Pat TrGeneral Domain Pat Tree vs Specific Domain Pat Tr

ee.ee.

Page 17: Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval

EvaluationEvaluation

• Extraction of SLPExtraction of SLP– Ask 3 people to select CLPs and keyphrases from 5Ask 3 people to select CLPs and keyphrases from 5

0 “seed sentence”0 “seed sentence”– Use these test data to test accuracy of SLP extractiUse these test data to test accuracy of SLP extracti

on on Phrase Phrase lengthlength

Total Number of ExtraTotal Number of Extracted Keyphrasescted Keyphrases

Number of Correct KNumber of Correct Keyphrases Extractedeyphrases Extracted

PrecisionPrecision

22 35683568 33113311 92.8%92.8%

33 11301130 661661 58.5%58.5%

44 999999 687687 68.77%68.77%

55 207207 150150 72.46%72.46%

>=6>=6 178178 151151 84.83%84.83%

TotalTotal 60826082 49604960 81.55%81.55%

Page 18: Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval

Evaluation (cont)Evaluation (cont)

• Speed and Space RequirementsSpeed and Space Requirements

CorpusCorpus Corpus size Corpus size (KB)(KB)

PAT Tree PAT Tree size (KB)size (KB)

Time to Time to construct PAT construct PAT tree (sec)tree (sec)

Time to extracTime to extract keyphrases t keyphrases (sec)(sec)

C1-O(10k)C1-O(10k) 1212 7777 0.190.19 0.010.01

C2-C2-O(100k)O(100k)

127127 670670 2.822.82 0.020.02

C3-O(1M)C3-O(1M) 10331033 46874687 25.5225.52 1.621.62

C4-O(10M)C4-O(10M) 1004810048 4431244312 306.32306.32 28.5128.51

C5-O(100M)C5-O(100M) 107333107333 439087439087 23812381 283283

Page 19: Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval

ConclusionConclusion

• This method reduced the difficulty of keThis method reduced the difficulty of keyphrase extraction in Chinese, with bettyphrase extraction in Chinese, with better performanceer performance

Page 20: Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval

String Bit 1 9 17 25

個人電腦 / 節點 0 10101101 11010011 10100100 …

人電腦 / 節點 2 10100100 01001000 10111001 …

電腦 / 節點 4 10111001 01110001 00000000 …

腦 / 節點 6 10111000 0000000 00000000 …

人腦 / 節點 9 10100100 01001000 00000000 …

腦 / 節點 6 10111000 00000000 00000000 …

0 2 4 6 8 9 11

個 人 電 腦 , 人 腦

節點號碼

Semi-infinite strings

Page 21: Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval

( 比較位元 ,外部節點數 ,字串次數 )

0

6

4

9

2

(0,6,1)

(4,6,1)

(5,3,1)

(24,2,1)

(8,3,2)

0

4

2

9

6