Data Mining Project

Hierarchical Words detection especially new words in Chinese

language

Chengyuan ZhuangMingming Du

Haili Wang

Tired of downloading .arff files and running them in Weka?

OK, let's do something more

challenging and fun!

Who is he?

Kobe In Chinese?

科比

See this sentence

Kobe made a good shot.

科比投了一个好球。

Is it possible to detect Chinese words without any dictionary, any annotated data corpus?

Data set

• http://sports.ifeng.com/lanqiu/nba/

Over 600 Chinese web pages

Can we succeed?

Previous Work

• Dictionary-based approaches

Scan forward and backward from a sentence to detect words in dictionary.

One most popular method is: Maximum Matching (MM), usually augmented with heuristics to deal with ambiguities.

Matching forward and backward from a sentence.

Previous Work

• Limitation: only words that are stored in the dictionary can be identified.

How about Out of vocabulary (OOV) word? (e.g. domain-specific terms, person names and

newly produced words increasing everyday)

Previous Work

• Supervised machine-learning approaches

Be formulated as a binary classification task

Each character in the text string is predicted to be: 1. the beginning of a word (labeled as class "B") 2. intra-word characters (labeled as class "I")or: 1. the beginning of a word (labeled as class "B") 2. the middle of a word (labeled as class "M") 3. the end of a word (labeled as class "E")

Previous Work

• Lots of algorithms can make this binary or ternary boundary decision:

Naive Bayes (NB)

Decision Tree (DT)

Support Vector Machine (SVM)

Conditional Random Field (CRF).

Previous Work

• Kawtrakul et al. (1997) proposed a language modeling technique based on a tri-gram Markov Model to select the optimal segmentation path.

• Meknavin et al. (1997) constructed a machine learning model by using the Part-Of-Speech (POS) features.

Previous Work

• Limitation: totally rely on annotated data corpus, hard to deal with unseen words

• Contradictory: Part-Of-Speech (POS), Markov Model, Words' boundary all from previous word, how to use that dealing with new words?

Want to understand text to do that?

That's an even harder problem!

Previous Work

• Unsupervised machine-learning approaches

Sproat and Shih(1993) proposed a method using neither lexicon nor segmented corpus.

For input texts, simply grouping character pairs with high value of mutual information into words.

Fully automatic, although many limitations(e.g., over generate unreasonable words, can only treat bi-character words)

Our method

• The formal definition is: Two events A and B are independent if and only if

their joint probability equals the product of their probabilities: P(A ^ B) = P(A) * P(B).

Two nearby characters A, B can be seen as two events

totally independent (e.g. not form a word) => P(AB) = P(A) * P(B)

close relationship (e.g. A, B from a word) => P(AB) >> P(A) * P(B)

Our method

• We define P(likelihood) = P(AB) / [ P(A) * P(B) ]

• If character A and character B are totally independent, then P(likelihood) should be 1

• The higher P(likelihood) is above 1, the less likely they are independent, which means they are not happen together by chance, they have close relationship.

Our method

• There are 2000 ~ 4000 core Chinese characters

• Over 9000 characters total

• Typically, character probability is 1 / 1000 ~ 1 / 9000

• Frequent words could be 1 / 100 ~ 1 / 500

Our method

• The formulas are:

P(A | B) = P(AB) / P(B)

P(B | A) = P(AB) / P(A)

• Define P(given) = max [ P(A | B), P(B | A) ]

Our method

• For tri-character words:• View tri-character word ABC as two parts: AB, C or A, BC

P1(likelihood) = P(ABC) / [ P(AB) * P(C) ]

P2(likelihood) = P(ABC) / [ P(A) * P(BC) ]

Define P(likelihood) = min[ P1(likelihood), P2(likelihood) ]

Our method

• Set threshold:

P(likelihood) = 100

P(given) = 0.5 (for not so confident case)

Measure freeness

Our result

• Top 20 bi-character words

Our result

• Top 20 tri-character words

Most frequent words

Compare to other approach

Future Work

• Incorporate syntactic knowledge and high quality dictionary

• Annotated data corpus could also be considered

• Extend our approach to quadri-character words or even more-character words

Any questions, suggestions?

Thank you!

Data Mining Project

Documents

Data mining project presentation

Text Tango: A New Text Data Mining Project Text Tango: A New Text Data Mining Project Marti A. Hearst GUIR Meeting, Sept 17, 1998

Data Mining Project

Data mining project poster

Data Mining and Project Definition

DATA MINING ON AIRCRAFT COMPONENT FAILURES · 2018-12-14 · DATA MINING TECHNIQUES I only used two data mining techniques for this project: Recursive Partitioning (Decision Tree)

Data Mining Applied to the Improvement of Project Management...Data Mining Applied to the Improvement of Project Management 51 Data mining can be helpful in all stages and fields:

DHL Data Mining Project - Singapore Management … Data Mining Project Table of Contents Introduction to DHL and the data ... preparation for mining. For Mining, SAS Enterprise Miner

Data Mining Project - ArcticNet · Data Mining Project Prepared for: Beaufort Regional Environmental Assessment (BREA) Prepared and Edited by: Maxime Geoffroy and Laurie Bisson Gauthier

Supercomputer Data Mining - Project Overvie · Supercomputer Data Mining - Project Overview Firat Tekiner and Mike Pettipher Research Support Services, Manchester Computing, University

Data Mining Project - Université Paris-Saclay › ~goude › ... · 2019-09-11 · Data Mining Project PredictingGroundwaterFluctuationsinFrance Martin Mugnier, Tong Chen February

Revit Data Mining for Project Health Insights

Fraud Detection using Data Mining Project

How to Buy Data Mining - Predictive Modeling & Data Mining ... · PDF filedata mining How to Buy Data Mining A Framework for Avoiding Costly Project Pitfalls in Predictive Analytics

Brewery: A Data Mining Project - Department of Computer Science

Data Mining Project for student academic specialization and performance

Sample Data Mining Project Paper

CIS435: Data Mining & Data Warehousingcar.oline.codes/docs/Artz_FinalPaper_Comparing_Classifiers_Data_M… · CIS435: Data Mining & Data Warehousing Final Project Caroline Artz December

Data mining techniques applied in educational environments ... · III. Implementing a Data Mining Project The data mining projects are implemented with the aim of discovering patterns

Advance Data Mining Project Report