CS 396 Pattern Recognition Project Language Classifier v1.0 By Paul Troncone, David Keiper, Eugene Schvarts

CS 396 Pattern Recognition ProjectCS 396 Pattern Recognition Project

Language Classifier v1.0

By Paul Troncone, David Keiper, Eugene Schvarts

Topics of discussion…Topics of discussion…

The Proposal – A Language Classifier Designing the project Implementing the project References Conclusion

The Proposal – A Language ClassifierThe Proposal – A Language Classifier

The user would input a text file in the format of a language which uses the standard A-Z alphabet:

The program then would then take the text file and determine which language the file was written in.

DesigningDesigning

The The

ProjectProject

Input

Cleanup

f1 f2 f3 f4 f5 f6

Vector

Classifier

Output

INPUT

English

German

French

Italian

Spanish

Swedish

Polish

Dutch

Romanian

Portuguese

Danish

This is a test. This is a test toooooo.

CLEANUP

This is a test ..... 1987 This is a test too-oo-oo.

Removes:

Multiple Periods

Numbers

Hyphens, Slashes, Etc.

FEATURES

Feature Vector:

[0] = Average Word Length

[1] = Percent of Words Ending In Vowels

[2] = Average Sentence Length

[3] = Average Characters Per Sentence

[4] = Average Number of Vowels Per Word

[5] = Number of Words With Z’s

[6] = Number of Words That End in “ing”

FEATURES

10

20

30

40

50

60

70

80

90

4 4.5 5 5.5 6 6.5 7

Dutch

Danish

English

French

German

Italian

Portuguese

Polish

Romanian

Spanish

Swedish

Implementing Implementing

The The

ProjectProject

Classifier 1

Nearest Neighbor

Features Used:

Average Word Length

Percent of Words Ending In Vowels

Accuracy:

80% – 85%

Classifier 2

Artificial Neural Network

Features Used:

Average Word Length

Percent of Words Ending In VowelsAverage Sentence Length

Average Characters Per SentenceAverage Number of Vowels Per Word

Number of Words With Z’s

Number of Words That End in “ing”Accuracy:

95%

Creating the Graphical User InterfaceCreating the Graphical User InterfaceWanted to implement that Java look-and-feelWanted to implement that Java look-and-feel

jComboBox – holds 15 samples plus the option for a random sample.

jButton – allows for the paste functionality

jTextArea – text files are read and added to this area

jRadioButton – triggered when either classifier is clicked

jTextArea – output from classifier appended here

jButton – sets all text areas to null, all buttons to false, effectively clearing the screen of text

jComboBox – holds 11 languages and an option for random

jTextArea – word count appended here

jButton – sends text sample to the feature extractors

jButton – sends text to cleanUp method

jTextArea – output from featture extractors stored in array, then appended here

ReferencesReferences

Language Identification and IT: Addressing Problems of Linguistic Diversity on a Global Scale

Peter Constable and Gary Simons, SIL International

6,800 languages known

References

Region # of languages

Africa 2062

Americas 1020

Asia 2202

Europe 237

Pacific 1312

There are many factors that may be considered, such as the following:•actual linguistic similarity between speech varieties; •Intelligibility•literacy and ability to share a common literature; •ethnic identities and self-perception of language communities; •other perceptions and attitudes based on political or social factors;

References

What factors will form the basis of an operational definition of language?

References

•Change•Categorization: Different operational definitions of language• Inadequate definition•Scale: There are on the order of 6,800 languages known to exist• Documentation

Problems

•that consistently applies an operational definition of language so that all entities for which an identifier is assigned are of a comparable nature, •that encompasses all of the languages of the world, •that clearly documents the speech variety that each identifier denotes, •that is maintained and updated on an on-going basis, and •that is freely and readily accessible to the public over the Internet.

A solution to these problems would be considerably advanced by a compilation of language information

References

ConclusionConclusion

Conclusion

Number Of Features: 7

Size of Training Set: 165 Files

Testing Set: 100+ Files

Overall Success Rate: 93%

Conclusion

Given more time to extract additional features, we could achieve 99.5% accuracy for the set of eleven languages.

Documents

CS 396 Pattern Recognition Project Language Classifier v1.0 By Paul Troncone, David Keiper, Eugene Schvarts