Upload
manning
View
518
Download
1
Embed Size (px)
DESCRIPTION
**You can purchase Unlocking Android at Manning.com using the promo code scribd25 to receive 25% off this title.**Text analysis is integral to collective intelligence techniques in programming. Written by Satnam Alag, and based on his book, Collective Intelligence in Action, this article outlines the process of analyzing words and phrases and then extracting intelligence from texts (blog posts, word documents, etc.).
Citation preview
Extracting Intelligence Step by Step
Excerpted from
Collective Intelligence in ActionEARLY ACCESS EDITION
Satnam AlagMEAP Release: February 2008Softbound print: August 2008 (est.) | 425 pagesISBN: 1933988312
This article is taken from the book Collective Intelligence in Action. This segment shows an example of how intelligence can be extracted from text.
Text process involves a number of steps including creating tokens from the text, normalizing the text,
removing common words that are not helpful, stemming the words to their roots, injecting synonyms, and detecting
phrases.
At this stage it is helpful to go through an example of how the term vector can be computed by analyzing text.
The intent of this section is to demonstrate the concepts and to keep things simple therefore we will develop simple
classes for this example.
Remember, the typical steps involved in text analysis is shown in Figure 4.8
1. Tokenization: parse the text to generate terms. Sophisticated analyzers can also extract phrases from the
text
1. Normalize: convert them into lower case.
2. Eliminate stop words: eliminate terms that appear very often
3. Stemming: convert the terms into their stemmed form, i.e., remove plurals
For Source Code, Sample Chapters, the Author Forum and other resources, go tohttp://www.manning.com/alag
Tokenization NormalizeEliminate
Stop WordsStemming
Figure 4.8: Typical steps involved in analyzing text
In this section we will first setup the example that we will use. We will first use a simple, but naïve way to
analyze the text – simply tokenizing the text, analyzing the body and title, and taking term frequency into account.
Next, we will show the results of the analysis by eliminating the stop words, followed by the effect of stemming.
Lastly, we will show the effect of detecting phrases on the analysis.
Setting up the ExampleLet us assume that a reader has posted the following blog entry
Title: “Collective Intelligence and Web2.0”
Body: “Web2.0 is all about connecting users to users, inviting users to participate and applying their collective
intelligence to improve the application. Collective intelligence enhances the user experience”
There are a few interesting things to note about the blog entry
The blog entry discusses: Collective intelligence, Web2.0 and is pertinent to how it effects users
Notice the number of occurrence of “user” – “users”, “users,”, “user”
The title provides valuable information about the content
We have talked about metadata and the term vector – the code for this is fully developed in Chapter 8. So as
not to confuse things for this example simply think of metadata being represented by an implementation of the
interface MetaDataVector, as shown in listing 4.1
Listing 4.1 The MetaDataVector Interface
package com.alag.ci;
import java.util.List;
public interface MetaDataVector { public List<TagMagnitude> getTagMetaDataMagnitude() ; <#1> public MetaDataVector add(MetaDataVector other); <#2>}#1 gets the sorted list of non-zero terms and their weights#2 gives the result from adding another MetaDataVector
We have two methods: first for getting the terms and their weights and the second to add another
MetaDataVector. Further, assume that we have a way to visualize this MetaDataVector after all it consists of
tags or terms and their relative weights1.
1 If you really want to see the code for the implementation of the MetaDataVector jump ahead to Chapter 8 or download the available code
For Source Code, Sample Chapters, the Author Forum and other resources, go tohttp://www.manning.com/alag
Let us define an interface MetaDataExtractor for the algorithm that will extract MetaData – in the form of
keywords or tags – by analyzing the text. This is shown in listing 4.2.
Listing 4.2 The MetaDataExtractor Interface
package com.alag.ci.textanalysis;
import com.alag.ci.MetaDataVector;
public interface MetaDataExtractor { public MetaDataVector extractMetaData(String title, String body); }
The interface has only one method extractMetaData that analyzes the title and body of text to generate a
MetaDataVector. The MetaDataVector in essence is the term vector for the text being analyzed.
Figure 4.9 shows the hierarchy of more and more complex text analyzers that we will use in the next few
sections. First, we will use a simple analyzer to create tokens from the text. Next, we will remove the common
words. This will be followed by taking care of plurals. Lastly, we will detect multi-term phrases.
For Source Code, Sample Chapters, the Author Forum and other resources, go tohttp://www.manning.com/alag
Figure 4.9: The hierarchy of analyzers used to create MetaData from text
With this background, we are now ready to have some fun and work through some code to analyze our blog
entry!
Naïve AnalysisLet’s, begin by simply tokenizing the text, normalizing it and getting the frequency count associated with each
term. We will also analyze the body and text separately and then combine the information from each. For this
we use SimpleMetaDataExtractor which is a naïve implementation for our analyzer and its implementation is
shown in listing 4. 2.
Listing 4.2 Implementation of the SimpleMetaDataExtractor
package com.alag.ci.textanalysis.impl;
import java.util.*;import com.alag.ci.*;import com.alag.ci.impl.*;import com.alag.ci.textanalysis.MetaDataExtractor;
public class SimpleMetaDataExtractor implements MetaDataExtractor { private Map<String, Long> idMap = null; <#1> private Long currentId = null; <#2> public SimpleMetaDataExtractor() { this.idMap = new HashMap<String,Long>(); this.currentId = new Long(0); } public MetaDataVector extractMetaData(String title, String body) { MetaDataVector titleMDV = getMetaDataVector(title); <#3> MetaDataVector bodyMDV = getMetaDataVector(body); return titleMDV.add(bodyMDV); } private Long getTokenId(String token) { <#4> Long id = this.idMap.get(token); if (id == null) { id = this.currentId ++; this.idMap.put(token, id); } return id; }
#1 keeps a Map of all the text/tags that are found#2 variable used to generate unique ids for tokens found#3 Uses a heuristic of placing equal weight on title and body#4 Generates unique ids for text/tags that are found
For Source Code, Sample Chapters, the Author Forum and other resources, go tohttp://www.manning.com/alag
Since, the title provides valuable information as a heuristic let us say that the resulting MetaDataVector is a
combination of the MetaDataVector for the title and the body. Note that as tokens or tags are extracted from the
text we need to provide them with a unique id and the method getTokenId takes care of it for this example. In
your application, you probably will get it from the tags table.
The following code extracts metadata for the article
MetaDataVector titleMDV = getMetaDataVector(title); MetaDataVector bodyMDV = getMetaDataVector(body); return titleMDV.add(bodyMDV);
Here, we create a MetaDataVector for the title and the body and then simply combine the two together.
As new tokens are extracted a unique id is assigned to them by the code private Long getTokenId(String token) { Long id = this.idMap.get(token); if (id == null) { id = this.currentId ++; this.idMap.put(token, id); }
return id; }
The remaining piece of code, shown in listing 4.3 is a lot more interesting.
Listing 4.3 Continuing with the implementation of SimpleMetaDataExtractor
private MetaDataVector getMetaDataVector(String text) { Map<String,Integer> keywordMap = new HashMap<String,Integer>(); StringTokenizer st = new StringTokenizer(text); <#1> while (st.hasMoreTokens()) { String token = normalizeToken(st.nextToken()); <#2> if (acceptToken(token)) { <#3> Integer count = keywordMap.get(token); if (count == null) { count = new Integer(0); } count ++; keywordMap.put(token, count); <#4> } }
MetaDataVector mdv = createMetaDataVector(keywordMap); <#5> return mdv; } protected boolean acceptToken(String token) { <#6> return true; } protected String normalizeToken(String token) { <#7> String normalizedToken = token.toLowerCase().trim(); if ( (normalizedToken.endsWith(".")) || (normalizedToken.endsWith(",")) ) { int size = normalizedToken.length(); normalizedToken = normalizedToken.substring(0, size -1); }
For Source Code, Sample Chapters, the Author Forum and other resources, go tohttp://www.manning.com/alag
return normalizedToken; }}
#1 uses a simple StringTokenizer – space delimited#2 normalizes the token#3 should we accept this token as a valid token?#4 keeps a frequency count#5 creates a MetaDataVector#6 method to decide if a token is to be accepted#7 convert to lower case and remove punctuations
Here, we will use a simple StringTokenizer to break the words into their individual form.
StringTokenizer st = new StringTokenizer(text); while (st.hasMoreTokens()) {
We want to normalize the tokens so that they are case insensitive, i.e., “user” and “User” are the same word
for us and also remove the punctuations “,” and “.”.
String token = normalizeToken(st.nextToken());
The normalizeToken simply lower cases the tokens and removes the punctuations
protected String normalizeToken(String token) { String normalizedToken = token.toLowerCase().trim(); if ( (normalizedToken.endsWith(".")) || (normalizedToken.endsWith(",")) ) { int size = normalizedToken.length(); normalizedToken = normalizedToken.substring(0, size -1); } return normalizedToken;
}
We may not want to accept all the tokens, so we have a method acceptToken to decide if a token is to be
expected. if (acceptToken(token)) {
All tokens are accepted in this implementation.
The logic behind the method is fairly simple – find the tokens, normalize them, see if they are to be accepted
and then keep a count of how many times they occur. Both title and body are equally weighted to create a resulting
MetaDataVector. With this we have met our goal of creating a set of terms and their relative weights to represent
the metadata associated with the content.
A tag cloud is very useful way to visualize the output from the algorithm. First, let us look at the title as shown
in Figure 4.10. The algorithm tokenizes the title and extracts four equally weighted terms: “and”, “collective”,
“intelligence” and “web2.0”. Note that “and” appears as one of the four terms and “collective” and “intelligence”
are two separate terms.
For Source Code, Sample Chapters, the Author Forum and other resources, go tohttp://www.manning.com/alag
Figure 4.10: The tag cloud for the title – it consists of four terms
Similarly, the tag cloud for the body of the text is shown in Figure 4.11. Notice, that the words “the”, “to”, etc
occur frequently and “user” and “users” are treated as separate terms. There are a total of 20 terms in the body.
Figure 4.11: The tag cloud for the body of the text
Combining the vectors for both the title and the body we get the resulting MetaDataVector whose tag cloud is
shown in Figure 4.12.
Figure 4.12: The resulting tag cloud obtained by combining the title and the body
The three terms “collective”, “intelligence”, and “web2.0” stand out. However, there are quite a few noise
words such as “all”, “and”, “is”, “the”, “to” that occur too frequently in the English language that they don’t add
much value. Let us next enhance our implementation by eliminating these terms.
Removing Common WordsCommonly occurring terms are also called as stop terms (see Section 2.2) and can be specific to the language and
domain. We will implement SimpleStopWordMetaDataExtractor to remove these stop words. The code for this is
shown in listing 4.4.
For Source Code, Sample Chapters, the Author Forum and other resources, go tohttp://www.manning.com/alag
Listing 4.4 Implementation of SimpleStopWordMetaDataExtractor
package com.alag.ci.textanalysis.impl;
import java.util.*;
public class SimpleStopWordMetaDataExtractor extends SimpleMetaDataExtractor { private static final String[] stopWords = {"and","of","the","to","is","their","can","all", ""}; <#1> private Map<String,String> stopWordsMap = null;
public SimpleStopWordMetaDataExtractor() { this.stopWordsMap = new HashMap<String,String>(); for (String s: stopWords) { this.stopWordsMap.put(s, s); } } protected boolean acceptToken(String token) { <#2> return !this.stopWordsMap.containsKey(token); }}
#1 dictionary of stop words#2 don’t accept the token if it is a stop word
This class has a dictionary of terms that are to be ignored – in our case this is a simple list, for your application
this list will be a lot longer.
private static final String[] stopWords = {"and","of","the","to","is","their","can","all", ""};
The acceptToken method is over written to not accept any tokens that are in the stop word list.
protected boolean acceptToken(String token) { return !this.stopWordsMap.containsKey(token); }
Figure 4.13 shows the tag cloud after removing the stop words – we now have 14 terms from the original 20
terms. The terms “collective”, “intelligence”, and “web2.0” stand out. But “user” and “users” are still fragmented
and are treated as separate terms.
Figure 4.13: The Tag Cloud after removing the stop words
To combine “user” and “users” as one term we need to stem the words.
For Source Code, Sample Chapters, the Author Forum and other resources, go tohttp://www.manning.com/alag
StemmingStemming is the process of converting words to their stemmed form. There are fairly complex algorithms for doing
this, Porter stemming being the most commonly used one.
There is only one plural in our example: “user” and “users”. For now we will enhance our implementation with
SimpleStopWordStemmerMetaDataExtractor, whose code is in listing 4.5.
Listing 4.5 Implementation of SimpleStopWordStemmerMetaDataExtractor
package com.alag.ci.textanalysis.impl;
public class SimpleStopWordStemmerMetaDataExtractor extends SimpleStopWordMetaDataExtractor { protected String normalizeToken(String token) { if (acceptToken(token)) { <#1> token = super.normalizeToken(token); if (token.endsWith("s")) { <#2> int index = token.lastIndexOf("s"); if (index > 0) { token = token.substring(0, index); } } } return token; }}
#1 if it will be rejected, don’t bother normalizing#2 normalize strings
Here, we overwrite the normalizeToken method. First, it checks to make sure that the token is not a stop
word. protected String normalizeToken(String token) { if (acceptToken(token)){ token = super.normalizeToken(token);
Then it simply removes "s" from the end of terms.
Figure 4.14 shows the tag cloud obtained by stemming the terms. The algorithm transforms “user” and “users”
as one term and bubbles “user” up.
Figure 4.14: The tag cloud after normalizing the terms
For Source Code, Sample Chapters, the Author Forum and other resources, go tohttp://www.manning.com/alag
We now have four terms: “collective”, “intelligence”, “user”, “web2.0” as terms to describe the blog entry. But
“collective intelligence” is really one phrase, so let us enhance our implementation to detect this term.
Detecting Phrases“Collective Intelligence” is the only two term phrase that we are interested in. For this we will implement
SimpleBiTermStopWordStemmerMetaDataExtractor, the code for which is shown in listing 4.6.
Listing 4.6 Implementation of SimpleBiTermStopWordStemmerMetaDataExtractor
package com.alag.ci.textanalysis.impl;
import java.util.*;
import com.alag.ci.MetaDataVector;
public class SimpleBiTermStopWordStemmerMetaDataExtractor extends SimpleStopWordStemmerMetaDataExtractor { protected MetaDataVector getMetaDataVector(String text) { Map<String,Integer> keywordMap = new HashMap<String,Integer>();
List<String> allTokens = new ArrayList<String>(); StringTokenizer st = new StringTokenizer(text);
while (st.hasMoreTokens()) { String token = normalizeToken(st.nextToken()); if (acceptToken(token)) { Integer count = keywordMap.get(token); if (count == null) { count = new Integer(0); } count ++; keywordMap.put(token, count); allTokens.add(token); <#1> } } String firstToken = allTokens.get(0); for (String token: allTokens.subList(1, allTokens.size())) { String biTerm = firstToken + " " + token; if (isValidBiTermToken(biTerm)) { <#2> Integer count = keywordMap.get(biTerm); if (count == null) { count = new Integer(0); } count ++; keywordMap.put(biTerm, count); } firstToken = token; } MetaDataVector mdv = createMetaDataVector(keywordMap); return mdv; } private boolean isValidBiTermToken(String biTerm) { <#3> if ("collective intelligence".compareTo(biTerm) == 0) { return true; } return false; }
For Source Code, Sample Chapters, the Author Forum and other resources, go tohttp://www.manning.com/alag
}
#1 store the normalized tokens in the order they appear#2 take two tokens at a time and check if they are valid#3 phrases are tested for validity against a phrase dictionary
Here, we overwrite the getMetaDataVector method. We create a list of valid tokens and store them in a list
allTokens.
Next, the following code combines two tokens to check if they are valid tokens.
String firstToken = allTokens.get(0); for (String token: allTokens.subList(1, allTokens.size())) { String biTerm = firstToken + " " + token; if (isValidBiTermToken(biTerm)) {
In our case, there is only one valid phrase “collective intelligence” and the check is done in the method.
private boolean isValidBiTermToken(String biTerm) { if ("collective intelligence".compareTo(biTerm) == 0) { return true; } return false; }
Figure 4.15 shows the tag cloud for the title of the blog after using our new analyzer. As desired there are 4
terms “collective”, “collective intelligence”, “intelligence” and “web2.0”.
Figure 4.15: Tag cloud for the title after using the bi-term analyzer
The combined tag cloud for the blog now contains 14 terms as shown in Figure 4.16. There are 5 tags that
stand out “collective”, “collective intelligence”, “intelligence”, “user”, “web2.0”.
Figure 4.16: Tag cloud for the blog after using a bi-term analyzer
For Source Code, Sample Chapters, the Author Forum and other resources, go tohttp://www.manning.com/alag
Using phrases in the term vector can help improve finding other similar content. For example, if we had
another article “Intelligence in a Child”, with tokens “intelligence” and “child’ there would be a match on the term
“intelligence”. However, if our analyzer was intelligent enough to simply extract “collective intelligence” without the
terms “collective” and “intelligence” there would be no match between the two pieces of content.
Hopefully, this gives you a good overview of how text can be analyzed automatically to extract relevant
keywords or tags and build a MetaDataVector.
Now, every Item in your application has an associated MetaDataVector. As Users interact on your site you can
use the MetaDataVector associated with the Item’s to develop a profile for the user. Finding items similar to an
item deals with finding items that have similar MetaDataVector.
For Source Code, Sample Chapters, the Author Forum and other resources, go tohttp://www.manning.com/alag