Upload
tobias-brooks
View
224
Download
5
Embed Size (px)
Citation preview
CS 396 Pattern Recognition ProjectCS 396 Pattern Recognition Project
Language Classifier v1.0
By Paul Troncone, David Keiper, Eugene Schvarts
Topics of discussion…Topics of discussion…
The Proposal – A Language Classifier Designing the project Implementing the project References Conclusion
The Proposal – A Language ClassifierThe Proposal – A Language Classifier
The user would input a text file in the format of a language which uses the standard A-Z alphabet:
The program then would then take the text file and determine which language the file was written in.
DesigningDesigning
The The
ProjectProject
Input
Cleanup
f1 f2 f3 f4 f5 f6
Vector
Classifier
Output
INPUT
English
German
French
Italian
Spanish
Swedish
Polish
Dutch
Romanian
Portuguese
Danish
This is a test. This is a test toooooo.
CLEANUP
This is a test ..... 1987 This is a test too-oo-oo.
Removes:
Multiple Periods
Numbers
Hyphens, Slashes, Etc.
FEATURES
Feature Vector:
[0] = Average Word Length
[1] = Percent of Words Ending In Vowels
[2] = Average Sentence Length
[3] = Average Characters Per Sentence
[4] = Average Number of Vowels Per Word
[5] = Number of Words With Z’s
[6] = Number of Words That End in “ing”
FEATURES
10
20
30
40
50
60
70
80
90
4 4.5 5 5.5 6 6.5 7
Dutch
Danish
English
French
German
Italian
Portuguese
Polish
Romanian
Spanish
Swedish
Implementing Implementing
The The
ProjectProject
Classifier 1
Nearest Neighbor
Features Used:
Average Word Length
Percent of Words Ending In Vowels
Accuracy:
80% – 85%
Classifier 2
Artificial Neural Network
Features Used:
Average Word Length
Percent of Words Ending In VowelsAverage Sentence Length
Average Characters Per SentenceAverage Number of Vowels Per Word
Number of Words With Z’s
Number of Words That End in “ing”Accuracy:
95%
Creating the Graphical User InterfaceCreating the Graphical User InterfaceWanted to implement that Java look-and-feelWanted to implement that Java look-and-feel
jComboBox – holds 15 samples plus the option for a random sample.
jButton – allows for the paste functionality
jTextArea – text files are read and added to this area
jRadioButton – triggered when either classifier is clicked
jTextArea – output from classifier appended here
jButton – sets all text areas to null, all buttons to false, effectively clearing the screen of text
jComboBox – holds 11 languages and an option for random
jTextArea – word count appended here
jButton – sends text sample to the feature extractors
jButton – sends text to cleanUp method
jTextArea – output from featture extractors stored in array, then appended here
ReferencesReferences
Language Identification and IT: Addressing Problems of Linguistic Diversity on a Global Scale
Peter Constable and Gary Simons, SIL International
6,800 languages known
References
Region # of languages
Africa 2062
Americas 1020
Asia 2202
Europe 237
Pacific 1312
There are many factors that may be considered, such as the following:•actual linguistic similarity between speech varieties; •Intelligibility•literacy and ability to share a common literature; •ethnic identities and self-perception of language communities; •other perceptions and attitudes based on political or social factors;
References
What factors will form the basis of an operational definition of language?
References
•Change•Categorization: Different operational definitions of language• Inadequate definition•Scale: There are on the order of 6,800 languages known to exist• Documentation
Problems
•that consistently applies an operational definition of language so that all entities for which an identifier is assigned are of a comparable nature, •that encompasses all of the languages of the world, •that clearly documents the speech variety that each identifier denotes, •that is maintained and updated on an on-going basis, and •that is freely and readily accessible to the public over the Internet.
A solution to these problems would be considerably advanced by a compilation of language information
References
ConclusionConclusion
Conclusion
Number Of Features: 7
Size of Training Set: 165 Files
Testing Set: 100+ Files
Overall Success Rate: 93%
Conclusion
Given more time to extract additional features, we could achieve 99.5% accuracy for the set of eleven languages.