13
©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek [email protected] [email protected] (610) 647-9789

©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek [email protected] [email protected] (610) 647-9789

Embed Size (px)

Citation preview

Page 1: ©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789

©2012 Paula Matuszek

CSC 9010: Text Mining Applications

Lab 3

Dr. Paula Matuszek

[email protected]

[email protected]

(610) 647-9789

Page 2: ©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789

©2012 Paula Matuszek

Goals Goals for this lab are:

– More Python– Run a naive Bayes classifier– Evaluate the results

Page 3: ©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789

Python

The Natural Language Processing with Python book covers a lot of Python, interspersed with a lot of NLP.

We are mostly interesting in the parts relevant to text mining, so we are skipping a lot.

Unfortunately that means we skip a lot of the Python, some of which we might want.

Page 4: ©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789

(Very) Brief Python Overview

Borrowing a presentation: http://www.cis.upenn.edu/~matuszek/Concise

Guides/Concise Python.html To use the NLTK and do the homework

assignments, you don’t actually need a lot of Python. Just plunge in.

If you need more (for your project, for instance), there is a good tutorial at http://docs.python.org/tutorial/

You can also work through more of the NLTK book.

Page 5: ©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789

Getting Your Documents In

First step is to get documents into your program.

Hopefully you have all done this. You can give complete paths. If you’re

working in Windows, either use / instead of \ or use \\ (because \ is the escape character)

At this point you have one long string.

Page 6: ©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789

Breaking It Down

Most of our operations expect a list of tokens, not a single string.

NLTK has a decent default tokenizer We might also want to do things like

stem it.

Page 7: ©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789

Classifying

Basically we:– develop a feature set. NLTK classifiers

expect the input to be pairs of (hashmap of features, class)

– ({'length': 8, 'lastletter': 'e', 'firstletter': 'L'}, 'female')

Choose training and test documents Run a classifier Look at the results.

Page 8: ©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789

Classifying

Last time we:– developed a feature set. Dictionary of expect the

input to be a dictionary of (label, value) pairs and a class.

– ({'length': 8, 'lastletter': 'e', 'firstletter': 'L'}, 'female') Chosse training and test documents Ran a classifier Looked at the results. Classification task was names into male and

female

Page 9: ©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789

©2012 Paula Matuszek

Goals Goals for this lab are:

– Use NLTK Naive Bayes Classifier to classify documents based on word frequency

– Evaluate the results

Page 10: ©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789

Classifying Documents

Same set of steps Create a feature set.

– Get a frequency distribution of words in the corpus

– Pick the 2000 most common– Create a feature set of “word there”, true or

false. Classify into positive and negative reviews Evaluate results

Page 11: ©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789

Movie Reviews

The NLTK corpus includes a set of 2000 movie reviews, classified into directories of positive and negative. (From Cornell, released in 2004).

NLTK.corpus includes methods to get the categories of reviews, the fileids in each category and the words in each fileid.

Page 12: ©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789

Creating the feature set

Too many terms for us! (almost 40K) Get a frequency count and take the most

frequent. For each of the words in that list, for each

document, create a feature:– 'contains(like)': True,

Each document is a two-item list: dictionary of features, category

The featureset is a list of these documents

Page 13: ©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789

Doing this for your documents

Decide your features and your categories! Input your documents and their categories. Categories could be:

– the file they are in (like names)– the directory they are in (like movie reviews)– a tag in the document itself (first token, for instance)

Build feature list for each document: a dictionary of label-value pairs– BOW, length, diversity, number of words, etc, etc.

Create a feature set which contains for each document:– a dictionary of features: label, value pairs– a category

Randomize and create training and test sets Run it and look at results :-)