Extracting Intelligence Step by Step

Extracting Intelligence Step by Step

Excerpted from

Collective Intelligence in ActionEARLY ACCESS EDITION

Satnam AlagMEAP Release: February 2008Softbound print: August 2008 (est.) | 425 pagesISBN: 1933988312

This article is taken from the book Collective Intelligence in Action. This segment shows an example of how intelligence can be extracted from text.

Text process involves a number of steps including creating tokens from the text, normalizing the text,

removing common words that are not helpful, stemming the words to their roots, injecting synonyms, and detecting

phrases.

At this stage it is helpful to go through an example of how the term vector can be computed by analyzing text.

The intent of this section is to demonstrate the concepts and to keep things simple therefore we will develop simple

classes for this example.

Remember, the typical steps involved in text analysis is shown in Figure 4.8

1. Tokenization: parse the text to generate terms. Sophisticated analyzers can also extract phrases from the

text

1. Normalize: convert them into lower case.

2. Eliminate stop words: eliminate terms that appear very often

3. Stemming: convert the terms into their stemmed form, i.e., remove plurals

For Source Code, Sample Chapters, the Author Forum and other resources, go tohttp://www.manning.com/alag

Tokenization NormalizeEliminate

Stop WordsStemming

http://www.manning.com/alag



Figure 4.8: Typical steps involved in analyzing text

In this section we will first setup the example that we will use. We will first use a simple, but naïve way to

analyze the text – simply tokenizing the text, analyzing the body and title, and taking term frequency into account.

Next, we will show the results of the analysis by eliminating the stop words, followed by the effect of stemming.

Lastly, we will show the effect of detecting phrases on the analysis.

Setting up the ExampleLet us assume that a reader has posted the following blog entry

Title: “Collective Intelligence and Web2.0”

Body: “Web2.0 is all about connecting users to users, inviting users to participate and applying their collective

intelligence to improve the application. Collective intelligence enhances the user experience”

There are a few interesting things to note about the blog entry

The blog entry discusses: Collective intelligence, Web2.0 and is pertinent to how it effects users

Notice the number of occurrence of “user” – “users”, “users,”, “user”

The title provides valuable information about the content

We have talked about metadata and the term vector – the code for this is fully developed in Chapter 8. So as

not to confuse things for this example simply think of metadata being represented by an implementation of the

interface MetaDataVector, as shown in listing 4.1

Listing 4.1 The MetaDataVector Interface

package com.alag.ci;

import java.util.List;

public interface MetaDataVector { public List<TagMagnitude> getTagMetaDataMagnitude() ; <#1> public MetaDataVector add(MetaDataVector other); <#2>}#1 gets the sorted list of non-zero terms and their weights#2 gives the result from adding another MetaDataVector

We have two methods: first for getting the terms and their weights and the second to add another

MetaDataVector. Further, assume that we have a way to visualize this MetaDataVector after all it consists of

tags or terms and their relative weights1.

1 If you really want to see the code for the implementation of the MetaDataVector jump ahead to Chapter 8 or download the available code



Let us define an interface MetaDataExtractor for the algorithm that will extract MetaData – in the form of

keywords or tags – by analyzing the text. This is shown in listing 4.2.

Listing 4.2 The MetaDataExtractor Interface

package com.alag.ci.textanalysis;

import com.alag.ci.MetaDataVector;

public interface MetaDataExtractor { public MetaDataVector extractMetaData(String title, String body); }

The interface has only one method extractMetaData that analyzes the title and body of text to generate a

MetaDataVector. The MetaDataVector in essence is the term vector for the text being analyzed.

Figure 4.9 shows the hierarchy of more and more complex text analyzers that we will use in the next few

sections. First, we will use a simple analyzer to create tokens from the text. Next, we will remove the common

words. This will be followed by taking care of plurals. Lastly, we will detect multi-term phrases.



Figure 4.9: The hierarchy of analyzers used to create MetaData from text

With this background, we are now ready to have some fun and work through some code to analyze our blog

entry!

Naïve AnalysisLet’s, begin by simply tokenizing the text, normalizing it and getting the frequency count associated with each

term. We will also analyze the body and text separately and then combine the information from each. For this

we use SimpleMetaDataExtractor which is a naïve implementation for our analyzer and its implementation is

shown in listing 4. 2.

Listing 4.2 Implementation of the SimpleMetaDataExtractor

package com.alag.ci.textanalysis.impl;

import java.util.*;import com.alag.ci.*;import com.alag.ci.impl.*;import com.alag.ci.textanalysis.MetaDataExtractor;

public class SimpleMetaDataExtractor implements MetaDataExtractor { private Map<String, Long> idMap = null; <#1> private Long currentId = null; <#2> public SimpleMetaDataExtractor() { this.idMap = new HashMap<String,Long>(); this.currentId = new Long(0); } public MetaDataVector extractMetaData(String title, String body) { MetaDataVector titleMDV = getMetaDataVector(title); <#3> MetaDataVector bodyMDV = getMetaDataVector(body); return titleMDV.add(bodyMDV); } private Long getTokenId(String token) { <#4> Long id = this.idMap.get(token); if (id == null) { id = this.currentId ++; this.idMap.put(token, id); } return id; }

#1 keeps a Map of all the text/tags that are found#2 variable used to generate unique ids for tokens found#3 Uses a heuristic of placing equal weight on title and body#4 Generates unique ids for text/tags that are found



Since, the title provides valuable information as a heuristic let us say that the resulting MetaDataVector is a

combination of the MetaDataVector for the title and the body. Note that as tokens or tags are extracted from the

text we need to provide them with a unique id and the method getTokenId takes care of it for this example. In

your application, you probably will get it from the tags table.

The following code extracts metadata for the article

MetaDataVector titleMDV = getMetaDataVector(title); MetaDataVector bodyMDV = getMetaDataVector(body); return titleMDV.add(bodyMDV);

Here, we create a MetaDataVector for the title and the body and then simply combine the two together.

As new tokens are extracted a unique id is assigned to them by the code private Long getTokenId(String token) { Long id = this.idMap.get(token); if (id == null) { id = this.currentId ++; this.idMap.put(token, id); }

return id; }

The remaining piece of code, shown in listing 4.3 is a lot more interesting.

Listing 4.3 Continuing with the implementation of SimpleMetaDataExtractor

private MetaDataVector getMetaDataVector(String text) { Map<String,Integer> keywordMap = new HashMap<String,Integer>(); StringTokenizer st = new StringTokenizer(text); <#1> while (st.hasMoreTokens()) { String token = normalizeToken(st.nextToken()); <#2> if (acceptToken(token)) { <#3> Integer count = keywordMap.get(token); if (count == null) { count = new Integer(0); } count ++; keywordMap.put(token, count); <#4> } }

MetaDataVector mdv = createMetaDataVector(keywordMap); <#5> return mdv; } protected boolean acceptToken(String token) { <#6> return true; } protected String normalizeToken(String token) { <#7> String normalizedToken = token.toLowerCase().trim(); if ( (normalizedToken.endsWith(".")) || (normalizedToken.endsWith(",")) ) { int size = normalizedToken.length(); normalizedToken = normalizedToken.substring(0, size -1); }



return normalizedToken; }}

#1 uses a simple StringTokenizer – space delimited#2 normalizes the token#3 should we accept this token as a valid token?#4 keeps a frequency count#5 creates a MetaDataVector#6 method to decide if a token is to be accepted#7 convert to lower case and remove punctuations

Here, we will use a simple StringTokenizer to break the words into their individual form.

StringTokenizer st = new StringTokenizer(text); while (st.hasMoreTokens()) {

We want to normalize the tokens so that they are case insensitive, i.e., “user” and “User” are the same word

for us and also remove the punctuations “,” and “.”.

String token = normalizeToken(st.nextToken());

The normalizeToken simply lower cases the tokens and removes the punctuations

protected String normalizeToken(String token) { String normalizedToken = token.toLowerCase().trim(); if ( (normalizedToken.endsWith(".")) || (normalizedToken.endsWith(",")) ) { int size = normalizedToken.length(); normalizedToken = normalizedToken.substring(0, size -1); } return normalizedToken;

}

We may not want to accept all the tokens, so we have a method acceptToken to decide if a token is to be

expected. if (acceptToken(token)) {

All tokens are accepted in this implementation.

The logic behind the method is fairly simple – find the tokens, normalize them, see if they are to be accepted

and then keep a count of how many times they occur. Both title and body are equally weighted to create a resulting

MetaDataVector. With this we have met our goal of creating a set of terms and their relative weights to represent

the metadata associated with the content.

A tag cloud is very useful way to visualize the output from the algorithm. First, let us look at the title as shown

in Figure 4.10. The algorithm tokenizes the title and extracts four equally weighted terms: “and”, “collective”,

“intelligence” and “web2.0”. Note that “and” appears as one of the four terms and “collective” and “intelligence”

are two separate terms.



Figure 4.10: The tag cloud for the title – it consists of four terms

Similarly, the tag cloud for the body of the text is shown in Figure 4.11. Notice, that the words “the”, “to”, etc

occur frequently and “user” and “users” are treated as separate terms. There are a total of 20 terms in the body.

Figure 4.11: The tag cloud for the body of the text

Combining the vectors for both the title and the body we get the resulting MetaDataVector whose tag cloud is

shown in Figure 4.12.

Figure 4.12: The resulting tag cloud obtained by combining the title and the body

The three terms “collective”, “intelligence”, and “web2.0” stand out. However, there are quite a few noise

words such as “all”, “and”, “is”, “the”, “to” that occur too frequently in the English language that they don’t add

much value. Let us next enhance our implementation by eliminating these terms.

Removing Common WordsCommonly occurring terms are also called as stop terms (see Section 2.2) and can be specific to the language and

domain. We will implement SimpleStopWordMetaDataExtractor to remove these stop words. The code for this is

shown in listing 4.4.



Listing 4.4 Implementation of SimpleStopWordMetaDataExtractor


import java.util.*;

public class SimpleStopWordMetaDataExtractor extends SimpleMetaDataExtractor { private static final String[] stopWords = {"and","of","the","to","is","their","can","all", ""}; <#1> private Map<String,String> stopWordsMap = null;

public SimpleStopWordMetaDataExtractor() { this.stopWordsMap = new HashMap<String,String>(); for (String s: stopWords) { this.stopWordsMap.put(s, s); } } protected boolean acceptToken(String token) { <#2> return !this.stopWordsMap.containsKey(token); }}

#1 dictionary of stop words#2 don’t accept the token if it is a stop word

This class has a dictionary of terms that are to be ignored – in our case this is a simple list, for your application

this list will be a lot longer.

private static final String[] stopWords = {"and","of","the","to","is","their","can","all", ""};

The acceptToken method is over written to not accept any tokens that are in the stop word list.

protected boolean acceptToken(String token) { return !this.stopWordsMap.containsKey(token); }

Figure 4.13 shows the tag cloud after removing the stop words – we now have 14 terms from the original 20

terms. The terms “collective”, “intelligence”, and “web2.0” stand out. But “user” and “users” are still fragmented

and are treated as separate terms.

Figure 4.13: The Tag Cloud after removing the stop words

To combine “user” and “users” as one term we need to stem the words.



StemmingStemming is the process of converting words to their stemmed form. There are fairly complex algorithms for doing

this, Porter stemming being the most commonly used one.

There is only one plural in our example: “user” and “users”. For now we will enhance our implementation with

SimpleStopWordStemmerMetaDataExtractor, whose code is in listing 4.5.

Listing 4.5 Implementation of SimpleStopWordStemmerMetaDataExtractor


public class SimpleStopWordStemmerMetaDataExtractor extends SimpleStopWordMetaDataExtractor { protected String normalizeToken(String token) { if (acceptToken(token)) { <#1> token = super.normalizeToken(token); if (token.endsWith("s")) { <#2> int index = token.lastIndexOf("s"); if (index > 0) { token = token.substring(0, index); } } } return token; }}

#1 if it will be rejected, don’t bother normalizing#2 normalize strings

Here, we overwrite the normalizeToken method. First, it checks to make sure that the token is not a stop

word. protected String normalizeToken(String token) { if (acceptToken(token)){ token = super.normalizeToken(token);

Then it simply removes "s" from the end of terms.

Figure 4.14 shows the tag cloud obtained by stemming the terms. The algorithm transforms “user” and “users”

as one term and bubbles “user” up.

Figure 4.14: The tag cloud after normalizing the terms



We now have four terms: “collective”, “intelligence”, “user”, “web2.0” as terms to describe the blog entry. But

“collective intelligence” is really one phrase, so let us enhance our implementation to detect this term.

Detecting Phrases“Collective Intelligence” is the only two term phrase that we are interested in. For this we will implement

SimpleBiTermStopWordStemmerMetaDataExtractor, the code for which is shown in listing 4.6.

Listing 4.6 Implementation of SimpleBiTermStopWordStemmerMetaDataExtractor


import java.util.*;

import com.alag.ci.MetaDataVector;

public class SimpleBiTermStopWordStemmerMetaDataExtractor extends SimpleStopWordStemmerMetaDataExtractor { protected MetaDataVector getMetaDataVector(String text) { Map<String,Integer> keywordMap = new HashMap<String,Integer>();

List<String> allTokens = new ArrayList<String>(); StringTokenizer st = new StringTokenizer(text);

while (st.hasMoreTokens()) { String token = normalizeToken(st.nextToken()); if (acceptToken(token)) { Integer count = keywordMap.get(token); if (count == null) { count = new Integer(0); } count ++; keywordMap.put(token, count); allTokens.add(token); <#1> } } String firstToken = allTokens.get(0); for (String token: allTokens.subList(1, allTokens.size())) { String biTerm = firstToken + " " + token; if (isValidBiTermToken(biTerm)) { <#2> Integer count = keywordMap.get(biTerm); if (count == null) { count = new Integer(0); } count ++; keywordMap.put(biTerm, count); } firstToken = token; } MetaDataVector mdv = createMetaDataVector(keywordMap); return mdv; } private boolean isValidBiTermToken(String biTerm) { <#3> if ("collective intelligence".compareTo(biTerm) == 0) { return true; } return false; }



}

#1 store the normalized tokens in the order they appear#2 take two tokens at a time and check if they are valid#3 phrases are tested for validity against a phrase dictionary

Here, we overwrite the getMetaDataVector method. We create a list of valid tokens and store them in a list

allTokens.

Next, the following code combines two tokens to check if they are valid tokens.

String firstToken = allTokens.get(0); for (String token: allTokens.subList(1, allTokens.size())) { String biTerm = firstToken + " " + token; if (isValidBiTermToken(biTerm)) {

In our case, there is only one valid phrase “collective intelligence” and the check is done in the method.

private boolean isValidBiTermToken(String biTerm) { if ("collective intelligence".compareTo(biTerm) == 0) { return true; } return false; }

Figure 4.15 shows the tag cloud for the title of the blog after using our new analyzer. As desired there are 4

terms “collective”, “collective intelligence”, “intelligence” and “web2.0”.

Figure 4.15: Tag cloud for the title after using the bi-term analyzer

The combined tag cloud for the blog now contains 14 terms as shown in Figure 4.16. There are 5 tags that

stand out “collective”, “collective intelligence”, “intelligence”, “user”, “web2.0”.

Figure 4.16: Tag cloud for the blog after using a bi-term analyzer



Using phrases in the term vector can help improve finding other similar content. For example, if we had

another article “Intelligence in a Child”, with tokens “intelligence” and “child’ there would be a match on the term

“intelligence”. However, if our analyzer was intelligent enough to simply extract “collective intelligence” without the

terms “collective” and “intelligence” there would be no match between the two pieces of content.

Hopefully, this gives you a good overview of how text can be analyzed automatically to extract relevant

keywords or tags and build a MetaDataVector.

Now, every Item in your application has an associated MetaDataVector. As Users interact on your site you can

use the MetaDataVector associated with the Item’s to develop a profile for the user. Finding items similar to an

item deals with finding items that have similar MetaDataVector.



Documents

Extracting Intelligence Step by Step