©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek [email protected] [email protected] (610) 647-9789

©2012 Paula Matuszek

CSC 9010: Text Mining Applications

Lab 3

Dr. Paula Matuszek

[email protected]

[email protected]

(610) 647-9789

mailto:[email protected]

mailto:[email protected]


Goals Goals for this lab are:

– More Python– Run a naive Bayes classifier– Evaluate the results

Python

The Natural Language Processing with Python book covers a lot of Python, interspersed with a lot of NLP.

We are mostly interesting in the parts relevant to text mining, so we are skipping a lot.

Unfortunately that means we skip a lot of the Python, some of which we might want.

(Very) Brief Python Overview

Borrowing a presentation: http://www.cis.upenn.edu/~matuszek/Concise

Guides/Concise Python.html To use the NLTK and do the homework

assignments, you don’t actually need a lot of Python. Just plunge in.

If you need more (for your project, for instance), there is a good tutorial at http://docs.python.org/tutorial/

You can also work through more of the NLTK book.

Getting Your Documents In

First step is to get documents into your program.

Hopefully you have all done this. You can give complete paths. If you’re

working in Windows, either use / instead of \ or use \\ (because \ is the escape character)

At this point you have one long string.

Breaking It Down

Most of our operations expect a list of tokens, not a single string.

NLTK has a decent default tokenizer We might also want to do things like

stem it.

Classifying

Basically we:– develop a feature set. NLTK classifiers

expect the input to be pairs of (hashmap of features, class)

– ({'length': 8, 'lastletter': 'e', 'firstletter': 'L'}, 'female')

Choose training and test documents Run a classifier Look at the results.

Classifying

Last time we:– developed a feature set. Dictionary of expect the

input to be a dictionary of (label, value) pairs and a class.

– ({'length': 8, 'lastletter': 'e', 'firstletter': 'L'}, 'female') Chosse training and test documents Ran a classifier Looked at the results. Classification task was names into male and

female


Goals Goals for this lab are:

– Use NLTK Naive Bayes Classifier to classify documents based on word frequency

– Evaluate the results

Classifying Documents

Same set of steps Create a feature set.

– Get a frequency distribution of words in the corpus

– Pick the 2000 most common– Create a feature set of “word there”, true or

false. Classify into positive and negative reviews Evaluate results

Movie Reviews

The NLTK corpus includes a set of 2000 movie reviews, classified into directories of positive and negative. (From Cornell, released in 2004).

NLTK.corpus includes methods to get the categories of reviews, the fileids in each category and the words in each fileid.

Creating the feature set

Too many terms for us! (almost 40K) Get a frequency count and take the most

frequent. For each of the words in that list, for each

document, create a feature:– 'contains(like)': True,

Each document is a two-item list: dictionary of features, category

The featureset is a list of these documents

Doing this for your documents

Decide your features and your categories! Input your documents and their categories. Categories could be:

– the file they are in (like names)– the directory they are in (like movie reviews)– a tag in the document itself (first token, for instance)

Build feature list for each document: a dictionary of label-value pairs– BOW, length, diversity, number of words, etc, etc.

Create a feature set which contains for each document:– a dictionary of features: label, value pairs– a category

Randomize and create training and test sets Run it and look at results :-)