©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek...

CSC 9010: Text Mining Applications

Dr. Paula Matuszek

Paula.Matuszek@villanova.edu

Paula.Matuszek@gmail.com

(610) 647-9789

Goals Goals for this lab are:

– More Python– Run a naive Bayes classifier– Evaluate the results

Python

The Natural Language Processing with Python book covers a lot of Python, interspersed with a lot of NLP.

We are mostly interesting in the parts relevant to text mining, so we are skipping a lot.

Unfortunately that means we skip a lot of the Python, some of which we might want.

(Very) Brief Python Overview

Borrowing a presentation: http://www.cis.upenn.edu/~matuszek/Concise

Guides/Concise Python.html To use the NLTK and do the homework

assignments, you don’t actually need a lot of Python. Just plunge in.

If you need more (for your project, for instance), there is a good tutorial at http://docs.python.org/tutorial/

You can also work through more of the NLTK book.

Getting Your Documents In

First step is to get documents into your program.

Hopefully you have all done this. You can give complete paths. If you’re

working in Windows, either use / instead of \ or use \\ (because \ is the escape character)

At this point you have one long string.

Breaking It Down

Most of our operations expect a list of tokens, not a single string.

NLTK has a decent default tokenizer We might also want to do things like

stem it.

Classifying

Basically we:– develop a feature set. NLTK classifiers

expect the input to be pairs of (hashmap of features, class)

– ({'length': 8, 'lastletter': 'e', 'firstletter': 'L'}, 'female')

Choose training and test documents Run a classifier Look at the results.

Classifying

Last time we:– developed a feature set. Dictionary of expect the

input to be a dictionary of (label, value) pairs and a class.

– ({'length': 8, 'lastletter': 'e', 'firstletter': 'L'}, 'female') Chosse training and test documents Ran a classifier Looked at the results. Classification task was names into male and

female

Goals Goals for this lab are:

– Use NLTK Naive Bayes Classifier to classify documents based on word frequency

– Evaluate the results

Classifying Documents

Same set of steps Create a feature set.

– Get a frequency distribution of words in the corpus

– Pick the 2000 most common– Create a feature set of “word there”, true or

false. Classify into positive and negative reviews Evaluate results

Movie Reviews

The NLTK corpus includes a set of 2000 movie reviews, classified into directories of positive and negative. (From Cornell, released in 2004).

NLTK.corpus includes methods to get the categories of reviews, the fileids in each category and the words in each fileid.

Creating the feature set

Too many terms for us! (almost 40K) Get a frequency count and take the most

frequent. For each of the words in that list, for each

document, create a feature:– 'contains(like)': True,

Each document is a two-item list: dictionary of features, category

The featureset is a list of these documents

Doing this for your documents

Decide your features and your categories! Input your documents and their categories. Categories could be:

– the file they are in (like names)– the directory they are in (like movie reviews)– a tag in the document itself (first token, for instance)

Build feature list for each document: a dictionary of label-value pairs– BOW, length, diversity, number of words, etc, etc.

Create a feature set which contains for each document:– a dictionary of features: label, value pairs– a category

Randomize and create training and test sets Run it and look at results :-)

©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek...

Documents

Heapsort matuszek/cit594-2008/ Based off slides by: David Matuszek

1 CSC 8520 Spring 2013. Paula Matuszek CS 8520: Artificial Intelligence Machine Learning 1 Paula Matuszek Spring, 2013

CS 8520: Artificial Intelligence Intelligent Agents and Search Paula Matuszek Fall, 2005 Slides based on Hwee Tou Ng, aima.eecs.berkeley.edu/slides-ppt,

Matuszek, Kasperek.pdf

©2012 Paula Matuszek CSC 9010: Text Mining Applications: Information Retrieval Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

Gabriela Matuszek - Muzeum Historii Polskibazhum.muzhp.pl/media//files/Pamietnik_Literacki...Gabriela Matuszek "Wzorzec modnego pisarza" : Stanisław Przybyszewski w utworach pisarzu

CSC 8520 Fall, 2005. Paula Matuszek Slides taken in part from: fkurfess/Courses/CSC-481/W03/Slides/3-Knowledge-Representation.ppt

Artificial Intelligence Paula Matuszek. ©2006 Paula Matuszek What is Artificial Intelligence l Definitions –The science and engineering of making intelligent

1 01/12/2011Knowledge-Based Systems, Paula Matuszek Intro to CLIPS Paula Matuszek CSC 9010, Spring, 2011

©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

Paula Matuszek CSC 8520, Fall, 2005 Dealing with Uncertainty

©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document Clustering l Dr. Paula Matuszek l Paula.Matuszek@villanova.edu Paula.Matuszek@villanova.edu

CS 8520: Artificial Intelligence Robotics Paula Matuszek Fall, 2008

M.N. Baudouin Matuszek, Le cheval

Regular Expressions in Java From David Matuszek matuszek/cit597- 2002/Lectures/java-regex.ppt

1 CSC 8520 Spring 2013. Paula Matuszek Slides taken from David Matuszek, matuszek/cis554-2012/Lectures/prolog-01.pptmatuszek/cis554-2012/Lectures/prolog-01.ppt

CS 8520: Artificial Intelligence Search Paula Matuszek Fall, 2005 Slides based on Hwee Tou Ng, aima.eecs.berkeley.edu/slides-ppt, which are in turn based

DOCUMENT RESUME- ED 137 417 Matuszek, Paula; Lee, Ann … · A PAPER PR2SENTED- AT THE ANNUAL MEETING OF THE AMERICAN EDUCATIONAL RESEARCH-ASSOCIATION. APRIL) 1977. AUTHORS: PAULA

CS 8520: Artificial Intelligence Search 2 Paula Matuszek Fall, 2008 Slides based on Hwee Tou Ng, aima.eecs.berkeley.edu/slides-ppt, which are in turn based