37
Data Mining Final Project Nick Foti Eric Kee

FotiKee_Report2

Embed Size (px)

DESCRIPTION

project report

Citation preview

  • Data Mining Final ProjectNick FotiEric Kee

  • Topic: Author IdentificationAuthor Identification Given writing samples, can we determine who wrote them?This is a well studied fieldSee also: stylometryThis has been applied to works such asThe BibleShakespeareModern texts as well

  • Corpus DesignA corpus is: A body of text used for linguistic analysisUsed Project Gutenberg to create corpusThe corpus was designed as followsFour authors of varying similarityAnne BrontCharlotte BrontCharles DickensUpton SinclairMultiple books per authorCorpus size: 90,000 lines of text

  • Dataset DesignExtracted features common in literatureWord LengthFrequency of glue wordsSee Appendix A and [1,2] for list of glue words Note: corpus was processed usingC#, Matlab, PythonData set parameters areNumber of dimensions: 309 Word length and 308 glue wordsNumber of observations: 3,000Each obervation 30 lines of text from a book

  • Classifier Testing and AnalysisTested classifier with test dataUsed testing and training data sets 70% for training, 30% for testingUsed cross-validationAnalyzed Classifier PerformanceUsed ROC plotsUsed confusion matrices

    Used common plotting scheme (right)Anne B.TPAnne B.FPCharlotte B.TPCharlotte B.FP78%22%55%45%E X A M P L ERed Dots Indicate True-Positive Cases

  • Binary Classification

  • Word Length ClassificationCalculated average word length for each observationComputed gaussian kernel density from word length samplesUsed ROC curve to calculate cutoffOptimized sensitivity and specificity with equal importance

  • Word Length: Anne B. vs Upton S.100%0%Anne B.T PAnne B.F PUpton SinclairT PUpton SinclairF P100%0%Anne BrontCharlotte Bront

  • Word Length: Bront vs. Bront100%0%Anne B.T PAnne B.F PCharlotte B.T PCharlotte B.F P78.1%21.9%Anne BrontCharlotte Bront

  • Principal Component AnalysisUsed PCA to find a better axisNotice: distribution similar to word length distributionIs word length the only useful dimension?

    Anne Bront vs. Upton SinclairWord Length DensityPCA Density

  • Principal Component Analysis

    It appears that word length is the most useful axisWell come back to thisWithout word lengthAnne Bront vs. Upton SinclairPCA Density

  • K-MeansUsed K-means to find dominant patternsUnnormalized NormalizedTrained K-means on training setTo classify observations in test setCalculate distance of observation to each class meanAssign observation to the closest class

    Performed cross-validation to estimate performance

  • Unnormalized K-means Anne Bront vs. Upton Sinclair98.1%1.9%Anne B.T PAnne B.F PUpton SinclairT PUpton SinclairF P92.1%7.9%

  • Unnormalized K-means Anne Bront vs. Charlotte Bront95.7%4.3%Anne B.T PAnne B.F PCharlotte B.T PCharlotte B.F P74.7%25.3%

  • Normalized K-means Anne Bront vs. Upton Sinclair53.3%46.7%Anne B.T PAnne B.F PUpton SinclairT PUpton SinclairF P49.4%50.6%

  • Normalized K-means Anne Bront vs. Charlotte Bront15.8%84.2%Anne B.T PAnne B.F PCharlotte B.T PCharlotte B.F P86.7%13.3 %

  • Discriminant AnalysisPeformed discriminant analysis Computed with equal covariance matricesUsed average Omega of class pairsComputed with unequal covariance matricesQuadratic discrimination fails because covariance matrices have 0 determinant (see equation below)Computed theoretical misclassification probabilityTo perform quadratic discriminant analysisCompute Equation 1 for each classChoose class with minimum value (1)

  • Discriminant Analysis Anne Bront vs. Upton Sinclair92.2%3.8%Anne B.T PAnne B.F PUpton SinsclairT PUpton SinclairF P96.2%7.8%Theoretical P(err) = 0.149Empirical P(err) = 0.116

  • Discriminant Analysis Anne Bront vs. Charlotte Bront92.7%7.3%Anne B.T PAnne B.F PCharlotte B.T PCharlotte B.F P89.2%10.8%Theoretical P(err) = 0.181Empirical P(err) = 0.152

  • Logistic RegressionFit linear model to training data on all dimensionsThrew out singular dimensionsLeft with 298 coefficients + interceptProjected training data onto synthetic variableFound threshold by minimizing error of misclassificationProjected testing data onto synthetic variableUsed threshold to classify points

  • Logistic Regression Anne BTPAnne BTPCharlotte BTPCharlotte BTPAnne BrontCharlotte Bront89.5%10.5%8%92%Anne Bront vs Charlotte Bront

  • Logistic Regression Anne BTPAnne BFPUpton STPUpton SFPAnne BrontUpton Sinclair98%2%99%2%Anne Bront vs Upton Sinclair

  • 4-Class Classification

  • 4-Class K-meansUsed K-means to find patterns among all classesUnnormalized NormalizedTrained using a training setTested performance as in 2-class K-meansPerformed cross-validation to estimate performance

  • Unnormalized K-Means CDTPABFPCBFPUSFPCDFPABTPCBFPUSFPCDFPABFPCBTPUSFPCDFPABFPCBFPUSFP22%54% 87% 34%59%88%Anne BrontCharlotte BrontUpton SinclairCharles Dickens4-Class Confusion Matrix

  • Normalized K-Means CDTPABFPCBFPUSFPCDFPABTPCBFPUSFPCDFPABFPCBTPUSFPCDFPABFPCBFPUSFP20%67%26%67%70%67%27%Anne BrontCharlotte BrontUpton SinclairCharles Dickens4-Class Confusion Matrix

  • Additional K-means testingAlso tested K-means without word length Recall that we had perfect classification with 1D word length (see plot below) Is K-means using only 1 dimension to classify?

    Note: perfect classification only occurs between Anne B. and SinclairAnne Bront vs. Upton Sinclair

  • Unnormalized K-Means (No Word Length)K-means can classify without word length CDTPABFPCBFPUSFPCDFPABTPCBFPUSFPCDFPABFPCBTPUSFPCDFPABFPCBFPUSFP35%29%44%33%35%43%72%Anne BrontCharlotte BrontUpton SinclairCharles Dickens4-Class Confusion Matrix

  • Multinomial RegressionMultinomial distributionExtension of binomial distributionRandom variable is allowed to take on n valuesUsed multinom() to fit log-linear model for trainingUsed 248 dimensions (max limit on computer)Returns 3 coefficients per dimension and 3 interceptsFound probability that observations belongs to each class

  • Multinomial RegressionMultinomial Logit Function is

    where j are the coefficients and cj are the intercepts

    To classifyCompute probabilities Pr(yi = Dickens), Pr(yi = Anne B.), Choose class with maximum probability

  • Multinomial Regression CDTPABFPCBFPUSFPCDFPABTPCBFPCBFPCDFPABFPCBTPUSFPCDFPABFPCBFPUSFP78%86%83%93%Anne BrontCharlotte BrontUpton SinclairCharles Dickens4-Class Confusion Matrix

  • Multinomial Regression

    Multinomial regression does not require word length CDTPABFPCBFPUSFPCDFPABTPCBFPCBFPCDFPABFPCBTPUSFPCDFPABFPCBFPUSFP76%79%79%91%Upton SinclairCharles DickensAnne BrontCharlotte Bront(Without Word Length)4-Class Confusion Matrix

  • Appendix A: Glue WordsI a aboard about above across after again against ago ahead all almost along alongside already also although always am amid amidst among amongst an and another any anybody anyone anything anywhere apart are aren't around as aside at away back backward backwards be because been before beforehand behind being below between beyond both but by can can't cannot could couldn't dare daren't despite did didn't directly do does doesn't doing don't done down during each either else elsewhere enough even ever evermore every everybody everyone everything everywhere except fairly farther few fewer for forever forward from further furthermore had hadn't half hardly has hasn't have haven't having he hence her here hers herself him himself his how however if in indeed inner inside instead into is isn't it its itself just keep kept later least less lest like likewise little low lower many may mayn't me might mightn't mine minus more moreover most much must mustn't my myself near need needn't neither never nevertheless next no no-one nobody none nor not nothing notwithstanding now nowhere of off often on once one ones only onto opposite or other others otherwise ought oughtn't our ours ourselves out outside over own past per perhaps please plus provided quite rather really round same self selves several shall shan't she should shouldn't since so some somebody someday someone something sometimes somewhat still such than that the their theirs them themselves then there therefore these they thing things this those though through throughout thus till to together too towards under underneath undoing unless unlike until up upon upwards us versus very via was wasn't way we well were weren't what whatever when whence whenever where whereas whereby wherein wherever whether which whichever while whilst whither who whoever whom with whose within why without will won't would wouldn't yet you your yours yourself yourselves

  • ConclusionsAuthors can be identified by their word usage frequenciesWord length may be used to distingush between the Bront sistersWord length does not, however, extend to all authors (See Appendix C)The glue words describe genuine differences between all four authorsK-means distinguishes the same patterns that multinomial regression classifiesThis indicates that supervised training finds legitimate patterns, rather than artifactsThe Bront sisters are the most similar authorsUpton Sinclair is the most different author

  • Appendix B: CodeSee Attached .R files

  • Appendix C: Single Dimension 4-Author Classification CDTPABFPCBFPUSFPCDFPABTPCBFPUSFPCDFPABFPCBTPUSFPCDFPABFPCBFPUSFP22%46%94%6%11%54%96%Anne BrontCharlotte BrontUpton SinclairCharles Dickens4-Class Confusion Matrix3%Classification using Multinomial Regression

  • References

    [1] Argamon, Saric, Stein, Style Mining of Electronic Messages for Multiple Authorship Discrimination: First Results, SIGKDD 2003.

    [2] Mitton, Spelling checkers, spelling correctors and the misspellings of poor spellers, Information Processing and Management, 1987.