Mining the wisdom of the crowds - Braintrust BASE · Mining the Wisdom of the Crowds “Detecting new product ideas by text mining and machine learning techniques” by Kasper Christensen

Mining the Wisdom of the

Crowds

“Detecting new product ideas by text mining and machine

learning techniques”

by Kasper Christensen

Number of characters: 125.853

May 2013

Supervisor: Professor Joachim Scholderer

Quantitative Analytics Group (QUANTS)

Department of Business Administration

School of Business and Social Sciences

Aarhus University

This, of course, is great idea, but these days we come to except such things

from Brian.

I've got to believe that some simple modules could be priced very affordably,

perhaps in the $15-$30 range. More complex ones, of course, could be priced

higher.

But the hook-them-together-ness would be a great selling point. It would be

most cool.

Comment: Message detected by the means of machine learning and text mining

List%of%tables%.........................................................................................................................%i!

List%of%figures%........................................................................................................................%i!

List%of%equations%..................................................................................................................%i!

List%of%appendices%..............................................................................................................%ii!

Abstract%...............................................................................................................................%iii!

1%7%Introduction%..................................................................................................................%1!1.1%7%Idea%generation%within%online%communities%...........................................................%1!1.2%7%Objectives%............................................................................................................................%2!1.3%7%Methodology%.......................................................................................................................%4!1.4%7%Delimitations%and%assumptions%...................................................................................%4!1.5%7%Structure%..............................................................................................................................%5!

2%7%Idea%generation%in%online%communities%...............................................................%6!2.1%7%Online%communities%.........................................................................................................%6!2.2%7%Collective%intelligence%.....................................................................................................%6!2.3%7%Creativity%.............................................................................................................................%7!2.4%7%Summary%..............................................................................................................................%9!

3%7%Detecting%ideas%..........................................................................................................%11!3.1%7%The%nature%of%data%in%online%communities%.............................................................%11!3.2%7%Text%mining%and%natural%language%processing%.....................................................%11!3.3%7%Machine%learning%in%the%text%classification%domain%............................................%14!3.3.1!%!The!imbalanced!learning!problem!...................................................................................!14!3.3.2!%!Feature!selection!.....................................................................................................................!15!3.3.3!%!Topic!modelling!........................................................................................................................!16!3.3.4!%!Classification!algorithms!......................................................................................................!17!3.3.5!%!Performance!measures!.........................................................................................................!20!

3.4%7%The%power%of%different%text%classification%methods%............................................%22!3.5%7%Summary%............................................................................................................................%26!

4%7%Aims%of%study%..............................................................................................................%27!

5%7%Method%.........................................................................................................................%28!5.1%7%Mining%the%online%community%of%Lugnet%.................................................................%28!

5.2%7%Construction%of%target%variable%..................................................................................%29!5.3%7%Modelling%the%concept%of%an%idea%...............................................................................%31!5.3.1!%!Data!exploration!......................................................................................................................!31!5.3.2!%!Data!partitioning!......................................................................................................................!32!5.3.3!%!Classification!algorithms!......................................................................................................!32!5.3.4!%!Term!weighting!scheme!.......................................................................................................!32!5.3.5!%!Data!processing!steps!............................................................................................................!33!5.3.6!%!Feature!selection!methods!..................................................................................................!34!5.3.7!%!Choice!of!final!model!..............................................................................................................!34!

5.4%7%Effect%of%seasonality%and%historical%events%on%idea%generation%.......................%34!

6%7%Results%..........................................................................................................................%36!6.1%7%Reliability%of%the%manual%classification%of%target%variable%................................%36!6.2%7%Detecting%ideas%................................................................................................................%37!6.2.1!%!Data!partitioning!......................................................................................................................!37!6.2.2!%!Exploratory!analysis!..............................................................................................................!37!6.2.3!%!Classifier!performance!given!term!weighting!and!processing!steps!................!38!6.2.4!%!Classifier!and!term!weighting!scheme!given!processing!steps!...........................!39!6.2.5!%!Assessing!performance!given!varying!feature!selection!methods!.....................!40!6.2.6!%!Assessing!candidate!models!...............................................................................................!43!

6.3%7%Effect%of%seasonality%on%idea%generation%.................................................................%45!6.3.1!%!Exploratory!analysis!..............................................................................................................!45!6.3.2!%!Creating!dataset!and!handling!missing!data!................................................................!46!6.3.3!%!Variables!exploration!and!variable!transformations!...............................................!47!6.3.4!%!Defining!model!and!assessing!model!assumptions!..................................................!48!6.3.5!%!Parameter!estimates!and!goodness%of%fit!.....................................................................!51!

7%7%Discussion%&%Conclusion%........................................................................................%53!

8%7%Bibliography%...............................................................................................................%57!

i

List%of%tables%Table&1&(&Overview&over&studies&reviewed&.....................................................................................................................&25!Table&2&(&Twenty&discriminative&terms&and&four&positive&and&negative&topics&from&training&set&.........&38!Table&3&(&Results&term&weighting,&processing&and&feature&selection&assessment&.........................................&42!Table&4&(&Results&of&candidate&models&performance&.................................................................................................&45!Table&5&(&Twenty&discriminative&terms&and&four&positive(&and&negative&topics&from&prediction&set&...&46!Table&6&(&Regression&results&.................................................................................................................................................&51!

List%of%figures%Figure&1&(&ROC&chart&of&candidate&models&.....................................................................................................................&43!Figure&2&(&ROC&chart&of&support&vector&machines&with&an&under(&and&oversampled&training&set&........&44!Figure&3&(&Fluctuations&in&ACTIVITY&and&IDEA&given&YEAR&..................................................................................&49!Figure&4&(&Histograms&of&ACTIVITY&and&IDEA&.............................................................................................................&49!Figure&5&(&Histogram&of&ERPM&and&box&plot&of&ERPM&from&1999&to&2012&......................................................&49!Figure&6&(&Histogram&of&LN.ERPM&.....................................................................................................................................&50!Figure&7&(&Fluctuations&in&LN.ERPM&given&MONTH&and&YEAR&.............................................................................&50!Figure&8&(&Residuals&plot&and&histogram&of&residual&distribution&.......................................................................&50!

List%of%equations%Equation&1&(&Information&gain&...........................................................................................................................................&16!Equation&2&(&Chi(square&.........................................................................................................................................................&16!Equation&3&(&Optimal&margin&classifier&objective&function&....................................................................................&17!Equation&4&(&Objective&function&of&support&vector&machine&with&slack&variables&........................................&18!Equation&5&(&Radial&basis&transformation&.....................................................................................................................&18!Equation&6&(&Bayes&theorem&................................................................................................................................................&19!Equation&7&(&Accuracy&............................................................................................................................................................&21!Equation&8&(&Recall&..................................................................................................................................................................&21!Equation&9&(&Precision&............................................................................................................................................................&21!Equation&10&(&F(measure&......................................................................................................................................................&21!Equation&11&(&Event&rate&per&month&................................................................................................................................&47!Equation&12&(&Logit&transformed&event&rate&per&month&.........................................................................................&48!Equation&13&(&Regression&model&........................................................................................................................................&48!

ii

List%of%appendices%Appendix&A&(&Message&view&at&www.lugnet.com&...........................................................................................................&A!Appendix&B&(&Message&in&.eml&format&................................................................................................................................&B!Appendix&C&(&Message&in&.txt&format&...................................................................................................................................&C!Appendix&D&(&Descriptive&statistics&of&regression&data&...............................................................................................&D!

%

iii

Abstract%The rise of Web 2.0 coupled with the availability of information technology like

computers, tablets and smartphones has created increasing opportunities for

consumers and organizations to interact. This development is predicted to

revolutionize how we understand and utilize innovation. In product innovation, the

topic of idea generation has for a long time been a topic of interest and from this

perspective, the concept of collective intelligence and crowdsourcing have become

areas of interest, in the product innovation literature. One drawback of crowdsourcing

is that it often requires special software to utilize and a large number of persons who

are dedicated to the crowdsourcing task. In this thesis we propose an alternative

method for utilizing the wisdom of the crowds, based on text mining and machine

learning. Our study shows how a classification algorithm can be trained to detect

ideas generated inside an online community. Furthermore, we study the effect of

seasonality and historical event on idea generation inside a Lego online community.

Our results suggest that holiday seasons have an impact on idea generation in online

communities in our particular case of toys. The primary implication of our results is

that organisations can use our method to tap into online sources and detect ideas even

though the source was never designed for crowdsourcing.

1 of 65

1%6 Introduction%

1.1%6 Idea%generation%within%online%communities%

The rise of Web 2.0 has led to new opportunities for organizations to interact with

consumers. The increasing availability of information technology like computers,

tablets and smartphones (Tapscott & Williams, 2008) has created “the era of big

data” (Hsinchun Chen, Chiang, & Storey, 2012, p. 1185). A key characteristic of the

big data age is that information on the Web is not structured in tables. Rather data is

stored often in the form of simple text. Sources estimate that approximately 80% of

organizations’ information is stored as text (Tan, 1999). This being the case then the

Web offers a huge potential for analytical approaches such as machine learning and

text mining, which are geared towards big data (Hsinchun Chen et al., 2012). Several

authors have pointed out the potential these massive amounts of data might offer to

organisations, suggesting that businesses will benefit if they can find ways to manage

the massive amounts of data (Johnson, 2012; LaValle, Lesser, Shockley, Hopkins, &

Kruschwitz, 2011; T. W. Malone, Laubacher, & Dellarocas, 2010; Eric Bonabeau,

2009; Argamon & Olsen, 2006). Some even go so far as claiming that information

technology is about to revolutionize innovation in all of its facets, by allowing

organisations to tap into these huge amounts of data and exploit them for innovation

purposes (Brynjolfsson, 2010).

Idea generation and innovation management are sources of value creation for

an organisation, due to the low success rate of new product developments

(Goldenberg, Lehmann, & Mazursky, 2001, Di Gangi, Wasko, & Hooker, 2010).

Although idea generation has primarily been the task of company professionals,

crowdsourcing has proven to be a suitable method for generating new ideas (Poetz &

Schreier, 2012). Some even suggest that crowdsourcing can in fact outperform ideas

created by company professionals on the attributes of novelty and customer benefit

(Poetz & Schreier, 2012).

Crowdsourcing is an open innovation model, and a way for the firm to “enrich

the company’s own knowledge base through the integration of suppliers, customers,

and external knowledge sourcing” (Enkel, Gassmann, & Chesbrough, 2009, p. 312).

From an open innovation perspective, one can be misled into believing that no

2 of 65

distinction exists between “open source” and “crowdsourcing” (Albors, Ramos, &

Hervas, 2008). However, this is not true, because for something to become open

source, the crowdsourcer must give everybody permission to modify the product.

Examples of open source include Linux and the programming language R, whereas

examples of crowdsourcing are Threadless, iStockphoto, Innocentive, Amazon’s

mechanical Turk, Youtube, etc. (Brabham, 2008; Estellés-Arolas & González-

Ladrón-de-Guevara, 2012). A recent successful example of this is the Dell IdeaStorm

community, which generates new product ideas from crowdsourcing. This community

is based on a collaborative filtering system where users are suggesting and voting on

the ideas they like. In this way Dell will be able to identify the most popular ideas,

without assigning corporate staff to create and/or assess the ideas. When the most

popular ideas have been identified, the ideas can be distributed to the relevant

departments in Dell and be used as input for developing new products or services

(Poetz & Schreier, 2012; di Gangi et al., 2010)

1.2%6 Objectives%Sufficient research documenting the potential of ideas generated by the crowd,

however our review of academic literature did not discover whether these ideas could

be detected by means other than collaborative filtering, as used in the Dell case. We

define our main research question as:

• How are ideas generated in online communities and how can one detect these

ideas by applying text mining and machine learning?

The main objective of this thesis is to assess if one can successfully detect ideas inside

an online community. We consider this relevant as it will allow organizations and

researchers to detect ideas from other sources than crowdsourcing communities.

Detecting ideas without asking people to vote will widen the scope of the

crowdsourcing concept to include all types of online communities. The data source

which we mined was as a Lego community called Lugnet1. Lugnet is a company

independent online community, where any person with an interest in Lego can post a

1 http://www.lugnet.com

3 of 65

message. The goal of Lugnet is to unite Lego fans from all over the world, and

anybody can read the messages posted on Lugnet. However in order to post messages,

one needs to pay a membership fee of $10. The forum consists of 247 sub-forums,

categorized by a variety of brands, products, countries, etc. Lugnet is a company

independent website, and so one can discuss whether this is actually a good case of

crowdsourcing. However, Lego can read what happens on the forum, and there will

assumedly be individuals in the forum writing about ideas. This makes Lugnet a good

case for this thesis.

The second objective of this thesis is to assess the extent to which seasonality and

historical events influence idea generation inside online communities, therefore we

define our second research question as:

• To what degree do seasonality and historical events influence idea generation

inside online communities?

If one can show that idea generation is not dependent on e.g. seasonal factors, it

leaves room for investigating other factors, e.g. marketing spending, that might

influence idea generation inside online communities.

To answer these two questions we first must address how ideas are generated inside

online communities, in particular how collective intelligence and consumer creativity

influences the development of ideas in online communities. Both fields are included

because collective intelligence theory focuses on group creativity, while consumer

creativity explains the creative ability of individuals. We view collective intelligence

and consumer creativity as the clockwork that generates ideas, so in order to help

answer our main question, one need a basic understanding of these two concepts.

Therefore we ask as our first sub-question:

• What defines idea generation inside online communities and how are these

ideas generated from a perspective of collective intelligence and consumer

creativity?

4 of 65

Furthermore, our two research questions will benefit from a discussion about the

nature of the data contained in online communities, as well as how one can detect

ideas hiding inside these communities through text mining and machine learning. This

topic is relevant as no literature in academia reports on how to detect ideas inside

online communities by the means we propose, therefore we ask as our second sub-

question:

• What is the nature of data created inside online communities and how can

techniques from text mining and machine learning be combined to detect ideas

generated inside online communities?

1.3%6 Methodology%

Our approach to answering the main research question is inductive. We argue that if

we can determine which factors influence idea generation inside a single online

community, then this may apply to other online communities of a similar nature.

We rely on quantitative approaches in most facets of our study, in particular

we rely on text mining and machine learning. We do however also rely on qualitative

assessments, as we use human judges to construct our target variable.

1.4%6 Delimitations%and%assumptions%

We must first acknowledge that we do not have access to what is considered a

crowdsourcing community in its most applied shape, since crowdsourcing

communities are often initiated and owned by a firm that proposes a task. Rather the

online community we use has existed since the mid-1990’s, and so allows us look at

idea generation over a rather long time period. To use our results from a

crowdsourcing perspective, we assume that crowdsourcing communities are

influenced by the same factors as online communities in general. We consider this to

be a reasonable assumption, and one might actually debate if crowdsourcing and

simple online communities are alike (Estellés-Arolas & González-Ladrón-de-

Guevara, 2012; Vukovic & Bartolini, 2010; Buecheler, Sieg, Füchslin, & Pfeifer,

2010; Brabham, 2008)

Furthermore, our secondary research question will only assess seasonality and

historical events. We consider this a natural delimitation, due to the nature of the data

we have available.

5 of 65

1.5%6 Structure%The theoretical foundations of the thesis will be addressed in chapters two and three

Chapter two will address collective intelligence and creativity from the perspective of

idea generation within online communities. Chapter three outlines text mining and

machine learning, which are the tools necessary to detect ideas in online communities.

Chapter four defines the aim of our study, chapter five covers methods used and the

study setup, chapter six reports the results and in chapter seven we discuss and

conclude on our results.

6 of 65

2%6 Idea%generation%in%online%communities%%The objective of this chapter is to understand how collective intelligence and

consumer creativity lead to the generation of ideas inside online communities. In

particular, we wish to investigate what defines idea generation inside online

communities, and how these ideas are generated in the perspective of collective

intelligence and consumer creativity. As this can be somewhat theoretical, we will

relate the theory covered to examples of ideas from the online community of Lugnet,

which we will later use as our data source.

2.1%6 Online%communities%

Online communities are a term that covers several genres of online networks.

Personal homepages, messages boards, E-mail lists and newsletters, chat groups,

weblogs and directories, as well as wikis, are genres that fits under the online

community umbrella (Bishop, 2009). The common drivers of these communities are

that they allow people to exchange goods or information, by interacting with other

people in the network (Wilson & Peterson, 2002). Online communities can be seen as

what one would associate with normal communities, where people gather based on a

common interest. However, one of the differences between online communities and

communities in their traditional shape, is that online communities are virtual and that

people do not always know each other in real life. Another difference is that since

people do not always know each other, they are not aware of who they might share

interests with inside the community (Faraj, Jarvenpaa, & Majchrzak, 2011). Even

though people inside online communities might not know each others skills and

interests, online communities have already proved useful for collective creativity and

innovation purposes (Dahlander, Frederiksen, & Rullani, 2008; Tapscott & Williams,

2008).

2.2%6 Collective%intelligence%The benefit of a crowd’s collective intelligence can be illustrated in terms of the

simple inequality of collective outcome ≥ sum of individual efforts (Fischer,

Giaccardi, Eden, Sugimoto, & Ye, 2005). Collective intelligence is by no means a

new idea (Leimeister, 2010), but what is new is the potential scale of collective

intelligence enabled by Web 2.0. To gain some perspective, one might see traditional

7 of 65

research techniques, like the application of surveys and focus groups, as an attempt to

tap into the collective intelligence of existing or potential customers. If we stay with

this comparison, surveys can be seen as a way of averaging the intelligence of a group

(Segaran, 2007), whereas focus groups provides a different form of collective

intelligence, as it allows people to interact, and thereby create solutions (Eric

Bonabeau, 2009).

Taking the concept of collective intelligence in its two separate terms of

“collective” and “intelligence”, we achieve a deeper understanding of what the

concept actually contains. The term “collective” describes a group of individuals who

are not required to have the same attitudes or viewpoints (Leimeister, 2010). The term

“intelligence” refers to the ability to learn, to understand, and to adapt to an

environment by using own knowledge (Leimeister, 2010). Collective intelligence is

defined as (T. Malone et al., 2009, p. 2)2:

“A group of individuals doing things collectively that seems intelligent.”

This definition captures the essence of collective intelligence, which is that we are

dealing with a group of people who collaborate in order to solve a problem. The

definition does not state that a group of individuals performing collective intelligence

is solving a problem, but this is assumed to be reasonable in most cases (Burroughs,

Morreau, & Mick, 2008)

2.3%6 Creativity%An important facet in problem solving is an individual’s creative ability. Especially

when designing new products and being innovative, creativity is important (Sarkar &

Chakrabarti, 2011). According to Burroughs et al. (2008), the concept of creativity

can be separated into the three perspectives of the creative process, the creative

person and the creative outcome.

i. The creative process - the generation of an idea is not a two-step

process, where a task is proposed in the first step and a perfect solution

is proposed in the next. One can instead see the generation of an idea

2 Please note that the source is only a working paper, but we do consider the main author Thomas W. Malone as reliable.

8 of 65

as a process containing four stages: Exploration, incubation,

illumination and verification. In the explorative phase, the individual

searches for known solutions to the problem. In the fixation phase, the

individual decides on a given path that is likely to lead to a solution. In

the incubation phase, the individual becomes unfocussed due to mental

exhaustion leading to new ways of looking at the problem. In the final

phase of insight, the individual reaches for a solution to the problem

(Burroughs et al., 2008).

ii. The creative person - the creative intelligence of an individual can

primarily be defined in terms of abilities, motivations and affect

(Amabile, 1983). Two factors of special importance are the knowledge

background of the person and the person’s motivation. Knowledge

background is domain specific in the sense that an individual might be

very knowledgeable within one domain, but not know anything about

another domain. Motivation can be intrinsic or extrinsic in nature.

Intrinsic motivation can be thought of as the degree an individual is

truly motivated from within to participate in a given creative activity.

Extrinsic motivation is the opposite, as the individual is motivated by

some kind of reward, often of monetary nature (Ryan & Deci, 2000;

Hennessey & Amabile, 2010).

iii. The creative outcome - An individual engaged in a creative process

should at some point generate a creative outcome, or in our case an

idea. The creative outcome can be of varying nature, which implies

that a creative outcome does not have to be for example, the work of

Einstein or Mozart, but can also include, a man getting an idea to fix

his car by using nothing else than a hairpin (Hennessey & Amabile,

2010). All are valid examples of individuals in creative processes,

producing a creative outcome, product or an idea (Reisberg, 2010).

The three text passages below are examples of ideas extracted from a poll of

messages from the online community we pulled our data from. We had two human

judges to classify pieces of text or messages from within the online community. The

first text displayed below is an example where two out of two judges classified the

text as an idea.

9 of 65

“Actually, I think Winnie the Pooh will be a big hit in the Duplo Market.

My wife is certainly looking forward to the Pooh LEGO.”

The second text passage displayed below is an example where one out of the two

judges classified the text as an idea.

“Long term, the true solution is to move to a development tool that

delivers smaller application-footprints than VB. Which is just about

anything.

I don't know of any time that would definitely be better than any other.

I'd suggest trying in the morning, your time. Most everyone around here

is asleep at that time.”

The final text passage showed below is an example where two out of two judges

classified the text as a non-idea.

“Can I build a Historic Site replica building in this scale

to show more accurate details?”

Returning to our theoretical discussion, defining the degree of creativity, is a central

problem in idea generation and creativity research, because setting the boundaries for

what constitutes an idea is difficult (Kaufmann, 2004). The fact that our judges could

not agree on categorizing the second text passage is an example of this problem. If we

look at the first text passage this is a forward case of a product idea, because it

supports that there might be a need for “Winnie the Pooh” Lego. The third text

passage is also straight forward as this can be categorized as a question, rather than

anything else. Our personal opinion about the second text message is that it is an idea,

as the terms “solution” and “suggest” occurs in the text passage. However we do

recognize that it is not a straight forward case, compared to the two other cases.

2.4%6 Summary%

Online communities can take many different shapes. Common for all online

communities is that they provide a platform for people to exchange goods or

10 of 65

information, by interacting with other people in the network. Within online

communities, idea generation is a type of problem solving, where the collective

outcome is often bigger than the sum of individual efforts. To be more specific, ideas

are often the product of an individual or group’s creativity. Creativity requires the

individual or a group to go through several phases before an idea or a solution is

reached. Creative ability is mainly determined by the two factors - domain knowledge

and motivation (both intrinsic and extrinsic). Finally the creative outcome or the idea,

can take many shapes, and can be difficult to assess.

11 of 65

3%6 Detecting%ideas%In this chapter we investigate how idea generation inside online communities can be

detected through text mining and machine learning. This we will do by answering

what is the nature of data created inside online communities and what text mining-

and machine learning techniques need to be combined in order to detect ideas

generated in online communities.

3.1%6 The%nature%of%data%in%online%communities%

When applying the term of data, we are dealing with information stored digitally in a

structured or an unstructured format and as stated in the introduction we are operating

in a Web 2.0 domain. Operating in a Web 2.0 domain also means that one will be

handling big amounts of data (Bughin, Chui, & Manyika, 2010). One might say that

data becomes big data when it is “is too big for conventional systems to handle”

(Gobble, 2013, p. 64). How data becomes big, depends on three dimensions - volume,

frequency and variety (Gobble, 2013). In this context online communities scores high

on these three dimensions, as the data quantity is often large, it changes every time

people use the community, and it is to a high degree unstructured (as most of the data

is text-based). To analyse this type of data requires the use of big data analytics, like

text mining and machine learning (Hsinchun Chen et al., 2012).

3.2%6 Text%mining%and%natural%language%processing%

To understand better the concept of text mining, one can turn towards the concept of

natural language processing (NLP). NLP was originally a mixture of artificial

intelligence and linguistics, taking its beginning in the 1950’s. Word-for-word

machine translation provides a good example of homography, which is a common

problem of NLP, as one word can have different meanings depending on context. The

complex nature of NLP led to a shift of focus in the 1980’s. This shift meant that NLP

should extract semantics or meaning, to a higher degree. This shift ultimately meant

the birth of statistical NLP, including the use of machine learning and statistics for

NLP purposes, as well as the idea of annotated corpora’s for use in machine learning

(Nadkarni, Ohno-Machado, & Chapman, 2011). An annotated corpus can be seen as

the equivalent of a training set in traditional data mining terms.

12 of 65

Text mining can be defined as the “knowledge-intensive process in which a

user interacts with a document collection over time, by using a suite of analysis tools”

(Feldman & Sanger, 2006, p. 1). The aim of text mining is to “extract useful

information from data sources through the identification and exploration of

interesting patterns” (Feldman & Sanger, 2006, p. 1). This definition is similar to the

definition of data mining (see Linoff & Berry, 2011 for further explanation of data

mining), but whereas data mining is based on data stored in database records,

normally structured by rows and columns, text mining is based on unstructured text

data, stored inside collections of documents.

The task of turning unstructured text into rows and columns results in a bag-

of-words for a given document (Erk & Padó, 2008). This allows one to apply

conventional machine learning methods to model a given concept which is often

represented by a target variable within a training set, depending on if one is doing

supervised or unsupervised machine learning (Dharmadhikari, Ingle, & Kulkarni,

2011).

The bag-of-words concept can be seen as one extreme end of a scale, while the

NLP approach is at the other end of the scale. Bag-of-words is very naive in its

nature, as it simply ignores grammar and any relations between the words. NLP on

the other hand, tries to capture semantics, represented by a set of words within a

document. The relationship between the bag-of-words approach and the NLP

approach is a trade-off, and it is important to stress that it is not necessarily a decision

about using one or the other, as there are techniques that allow one to capture more

meaning than the simplest bag-of-words approach will allow (Linoff & Berry, 2011).

Simple text processing can be seen as the removal of numbers, removal of

punctuation marks, removal of stopwords, removal of whitespaces, stemming,

tokenization, pruning, the conversion of upper case letters to lower case letters, use of

n-grams, and which term weighting scheme to apply. Tokenization refers to the

splitting of a document into smaller segments, which are often just single terms and is

an implicit step in the bag-of-words approach. (I. Feinerer, Hornik, & Meyer, 2008; ;

Zanasi, 2007). Some of these tasks are self-explanatory, but pruning, stemming, n-

grams and choice of term weighting scheme requires elaboration.

i. Pruning: the task where a word is deleted based on the number of

times it occurs within a document collection, be it extremely few or

many times. This means that one sets an upper and lower threshold for

13 of 65

how many times a given term should occur in order to be a part of the

analysis. As an example if one set the upper pruning level to 0.99, this

means that all terms occurring in more than 99% of the messages will

be omitted from the analysis. Pruning is a very aggressive way of

deleting features, and therefore one needs to be careful when setting

pruning boundaries. Pruning can be seen as a necessary evil, as even a

very small collection of documents will create many unique terms,

leading to a lot of noise as well as increased computation time.

ii. Stemming: The cutting of a word down to its stem, with a minimal loss

of information. Stemming can be seen as dimensionality reduction

method (Linoff & Berry, 2011).

iii. N-grams: A more advanced processing method that extracts ordered

sets of terms or characters. Instead of having tokens that only contain a

single term, it can contain a single term as well as several terms. One

can look at this step as taking the bag-of-words approach a step in the

direction of the NLP approach (Zanasi, 2007).

iv. Term weighting: is refers to the numerical representation of terms in

the bag-of-words. In general, a good term weighting metric should

discriminate one unstructured text source from another. There are

variety of weighting schemes, such as binary term occurrences, term

occurrences, normalized term frequency and term frequency inverse

document frequency. Binary weighting scheme assigns the value of

either one or zero to a term, regardless of how many times it occurs in

a particular document. Term occurrences can take integer values

ranging from zero to the total number of terms inside each document.

Term frequency comes in several variants, one version takes the length

of the document into account by normalizing the frequency count with

the square root of the total number of terms in that document (Salton &

Buckley, 1988). Term frequency inverse document frequency is an

expression of how discriminating terms are in comparison to the whole

document collection (Zanasi, 2007; Feinerer et al., 2008).

14 of 65

3.3%6 Machine%learning%in%the%text%classification%domain%

Machine learning can be seen as a combination of mathematics, statistics, software

engineering, and computer science (Conway & M. White, 2012). It is “concerned

with turning data into something practical and usable” (Conway & M. White, 2012,

p. 1). In particular, one can see machine learning as the problem of making inferences

about a given concept (Witten & Frank, 2005). The strength of machine learning is

that it allows us to infer patterns over large quantities of data, as you let machines do

what they do best, namely do calculations (Linoff & Berry, 2011). A common

weakness of machine learning is, that the models learned tend to over fit or

overgeneralize the concept.

3.3.1%6 The%imbalanced%learning%problem%

A general problem in machine learning, is when the class distribution of the target

variable is severely skewed. One can refer to this problem as the imbalanced learning

problem. Defining when a dataset becomes imbalanced between classes is relative,

but thresholds of 100:1, 1.000:1 and 10.000:1 are reported as common (He & Garcia,

2009). Although skewness happens for several reasons, commonly it happens because

either the classes are skewed by nature and/or it is very costly to collect the data from

one of the other classes. A consequence of skewed distributions upon the target

variable is typically a bias towards the majority class (Kao & Poteet, 2007). This does

not mean that one cannot use skewed datasets, and it has long been the assumption

that classification algorithms should be trained on data that have a similar distribution

to the one the algorithm occurs naturally (Weiss & Provost, 2001). It has however

been shown that balancing the dataset has a positive effect on classifier performance

(Weiss & Provost, 2003). Several methods exist to solve the problem of class

imbalance. Two of the most basic and widely described techniques are random

undersampling and random oversampling (Kotsiantis, Kanellopoulos, & Pintelas,

2006). Random undersampling removes cases from the majority class at random,

whereas random oversampling resamples cases from the minority class at random, in

order to create a more balanced relationship between classes. A disadvantage of

random undersampling is that one risks throwing away valuable information (Xu-

Ying Liu, Jianxin Wu, & Zhi-Hua Zhou, 2009), whereas a disadvantage of random

oversampling is that one risks over fitting the data (Chawla, 2010). In the context of

undersampling against oversampling, some research findings indicate that

15 of 65

undersampling performs best (Drummond & Holte, 2003). Other sources report that a

combination of the two techniques can be used (Estabrooks, Jo, & Japkowicz, 2004).

3.3.2%6 Feature%selection%

Feature selection is a way of reducing dimensionality. The concept falls into two

categories - statistical feature selection and arbitrary feature selection. Statistical

feature selection is the technique whereby the distribution of features counts between

classes is used to assign weights to each feature. Based on these weights, the features

with the highest weights train the classifier. Arbitrary feature is also known as the

processing steps of stopword removal, stemming and pruning (Yu, 2008), which were

discussed earlier. Statistical feature selection chooses features that discriminate best,

based on the distribution of terms between the classes of the target variable. Feature

selection can lead to increased model performance, both in terms of accuracy and the

generalizability of a model, because too many features can increase the likelihood of

over fit (Yang & Pedersen, 1997). Even though we have not yet presented the support

vector machine classifier, it deserves to be mentioned here whether support vector

machines can actually benefit from feature selection (Guyon, Weston, Barnhill, &

Vapnik, 2002), as some claim that support vector machines are so robust to over

fitting that they do not need any feature selection technique (Forman, 2003). The

outcome of applying feature selection is highly dependent on the nature of the data,

meaning that one will need to experiment, as there is no generic rule for applying

feature selection (Kao & Poteet, 2007). We will restrict ourselves to the two feature

selection methods of information gain and chi-square. We choose these two methods,

as both have been used within text classification with good results (Zhang, Zhu, &

Yao, 2004; Zheng, Wu, & Srihari, 2004; Forman, 2003; Yang & Pedersen, 1997).

Information gain and chi-square are functions of the four measures of A, B, C

and D. A is an integer count of messages belonging to !"#$$! where !"#$! occurs

minimum one time. B is an integer count of messages not belonging to !"#$$! where

!"#$! occurs minimum one time. C is an integer count of messages belonging to

!"#$$! where !"#$! does not occur. D is an integer count of messages not belonging

to !"#$$! where !"#$! does not occur. N is the total number of documents.

Information gain can be calculated by the following equation (Kao & Poteet, 2007):

16 of 65

Equation 1 - Information gain

Information!gain = − !!!! log !!!

! + !!! log!

!!! + !! log

!!!! (1)

Chi square can be calculated by the equation (Kao & Poteet, 2007): Equation 2 - Chi-square

Chi− square = ! !"!!" !

!!! !!! !!! !!! (2)

Based on a given feature selection technique, one can then calculate how good a

discriminator a given term is for a given class. In order to sort out noise one can then

order the features by the respective weights calculated, and then only use the best

features for modelling.

3.3.3%6 Topic%modelling%

In the domain of machine learning and natural language processing, probabilistic

topic modelling can be seen as the task of assigning topics labels to text documents.

Labelling should not be confused with the labelling one does when doing

classification. This is because the label has no meaning before a human being have

done the labelling of the topics. As an example one can perform topic modelling on a

collection of text documents and the result of the analysis could then be ten topics

named topic one to topic ten. Each topic would then contain a list of words assigned

with probability estimates. The probability estimates can then be interpreted as the

likelihood that a word is contained in the given topic. One can order the list by the

probability estimates and use the words with the highest probability estimates to

assign meaningful labels to the topic. One would then have to look at each list of

words and use reasoning to manually assign topics to each list of words. A topic

containing the top five words “song”, “guitar”, “sound”, “play” and “drums”, one

could then assign the label ”music”. Similarly as topic where the top five words are

“ball”, “player”, “play”, “attacker”, and “goal” one could assign the label “football”.

One may notice that the word “play” occurs in both topics. This one can refer to as

the problem of polysemy, which means that one word can have multiple meanings

depending one topic. This means that one will have to look at other words in order to

decide one the meaning of the word with multiple meanings (Steyvers & Griffiths,

2007).

17 of 65

Topic models comes in several variants. Common for all topic models are that

the apply term frequencies as term weights, which also leads to the bag-words-

assumption. Two variations are the latent Dirichlet allocation and the correlated topics

model. The main difference between the two types of models is that the latent

Dirichlet allocation assumes no correlation between topics. The correlated topics

model allows for topics to be correlated, which one can argue is more realistic (Grün

& Hornik, 2011). We choose to the latent Dirichlet allocation because we consider the

results of this model as being the easiest to interpret in our case.

3.3.4%6 Classification%algorithms%

Several algorithms are available for text classification, including k-nearest

neighbours, naïve Bayes, decision trees, decision rules and support vector machines

(Dharmadhikari et al., 2011). For this paper, we choose to focus on support vector

machines and naïve Bayes algorithms.

Support vector machines

Support vector machines were developed by Corinna Cortes and Vladimir Vapnik in

1995. Originally support vector machines were named support vector networks.

Support vector machines build upon the optimal margin classifier algorithm (Boser,

Guyon, & Vapnik, 1992). The optimal margin classifier identifies the widest hyper

plane able to separate two classes. The wider the margin, the better the generalization

ability of the classifier. In mathematical terms this can be defined as the problem of

minimizing the objective function (Ben-Hur & Weston, 2010; Scholderer, 2013): Equation 3 - Optimal margin classifier objective function

f ! = !! ! ! (3)

In their 1995 paper, Cortes and Vapnik introduce the support vector machine and the

notion of soft margins also called slack variables. A soft margin allows for class

overlap. If the cases cannot be perfectly separated, the support vector machine

algorithm seeks to identify a hyper plane able to minimize the overlap of the cases

(Hastie, Tibshirani, & Friedman, 2008). If the objective function is rewritten to

include slack variables, we get a new objective function able to deal with non linearly

separable cases. We still want to minimize the objective function (Ben-Hur &

Weston, 2010 Scholderer, 2013):

18 of 65

Equation 4 - Objective function of support vector machine with slack variables

f ! = !! ! ! + C ξ!!

!!! (4)

One can apply the linear variant of a support vector machine, which objective

function we have just defined. In its linear form, the support vector machine can be

seen as the equivalent of a linear discriminant function (Ben-Hur & Weston, 2010;

Hastie et al., 2008). One can also use a kernel. Support vector machines utilizing a

kernel transformation, work in the way that in the case of non-linearly separable data,

the data are transformed into a higher dimensional feature space (Cortes & Vapnik,

1995). The transformation is chosen beforehand, and the final result is a linearly

separable decision boundary. This transformation is also known as the “kernel trick”.

A kernel can take many different shapes, such as a Gaussian kernel which is also

knows as the radial basis function (Conway & M. White, 2012). We will limit

ourselves to the linear version of the support vector machine and the radial basis

version. The radial basis transformation is mathematically defined as (Ben-Hur &

Weston, 2010 Scholderer, 2013): Equation 5 - Radial basis transformation

k!"# !!, !! = exp! −γ !! − !!!

(5)

An advantage of the radial basis kernel is that with properly defined tuning

parameters, it can take the shape of the linear support vector machine (Keerthi & Lin,

2003). What however makes the linear kernel worth considering, is it only requires

the optimization of one parameter, whereas the radial basis kernel requires the

optimization of two parameters, adding to the computational costs (Hsu, Chang, &

Lin, 2010). A way to work with these two versions is to use the linear support vector

machine as a baseline, and use the radial basis kernel to try and improve performance

of the model at the end of the modelling phase (Ben-Hur & Weston, 2010).

The linear support vector machine depends on the parameter of the soft

margin constant C, whereas the radial basis kernel depends on both the soft margin

constant C and the hyper parameter γ (Varewyck & Martens, 2011). The function of

the C parameter is not different from the linear and the radial basis version.

Decreasing C widens margins to provide more generalized results and less over fitting

to a classifier (Ben-Hur & Weston, 2010). The C parameter plays the same role

19 of 65

regardless of kernel or no kernel, but the radial basis kernel also takes into account γ.

When optimizing γ one allows flexible margins, but as with the C parameter, this

might also lead to over fitting. The effect of changing γ follows the same rule as the C

parameter, which is that the lower γ the better the algorithm is a generalizing (Ben-

Hur & Weston, 2010).

The training of an optimal support vector machine classifier requires one to

make several decisions, about how to train the classifier. These decisions are how to

prepare the data, what kernel to use, and the parameters of the support vector machine

and the kernel. Data preparation is not different from the normal steps one undertakes

in order to prepare data for analysis (Ben-Hur & Weston, 2010). Deciding on what

kernel to use depends on the nature the data and the underlying relationships between

the features.

The naïve Bayes classifier

The naïve Bayes algorithm is an alternative approach to classification compared to

support vector machines. Naïve Bayes is considered a solid approach to text

classification tasks because it can handle large amounts of features, much like support

vector machines (Dharmadhikari et al., 2011). The naïve Bayes classifier is based on

Bayes theorem (Han & Kamber, 2006): Equation 6 - Bayes theorem

P H ! = ! ! ! !(!)!(!) (6)

The naïve Bayes classifier works in the way that when given a training set labelled

with class occurrences, it first calculates the prior probability of a given document

belonging to given class. The prior probability expressed by P(H) is the probability

that a document will belong to a class regardless of the contents of the document.

Secondly the classifier calculates the posterior probability of X conditioned on H

expressed by P(X|H). This posterior probability can thought of as the probability that

a document contains certain terms, given that we know that the document belongs to a

given class. Thirdly the prior probability P(X) is calculated. The prior probability can

be thought of as the probability that a document from our corpus contains a set of

terms. P(X) is assumed to be constant for all classes. Performing these calculations

and applying Bayes formula enables one to calculate the posterior probability of H

20 of 65

conditioned on X. This can be thought of as whether a given document belongs to a

certain class, based on the terms the document contains (Han & Kamber, 2006).

The naïve Bayes classifier comes in two different variations - the multivariate

Bernoulli model and the multinomial model. The multivariate Bernoulli model

applies binary word vectors as representation of the document. This means that a

given term is represented either as zero or one. Zero means that the term is not

contained inside a given document, one means that a term appears at least once inside

a given document, e.g. if a term appears five times inside a document, it will be

assigned the value of one. The multinomial model uses word frequencies as input.

This means that instead of assigning only values of zero and one to a given term, a

term can be assigned an integer value from zero to the length of the document. In this

way, if a given term occurs five times in a document, it is assigned the value of five.

Naïve Bayes rests on the assumption that terms inside a document are

independent of one another. This assumption is unrealistic in real world settings.

Despite this, the naïve Bayes classifier often performs well (McCallum & Nigam,

1998). It has been shown that both versions of the naïve Bayes classifier perform

equally with small vocabularies, whereas the multinomial model performs better as

the size of vocabulary increases. We choose the naïve Bayes over k-nearest

neighbours and the other learning algorithms, because of its ability to perform well on

training set with few cases (Dharmadhikari et al., 2011).

3.3.5%6 Performance%measures%%

As an introduction to model performance measures, we start with the confusion

matrix, as this is at the heart of model assessment. In short, the confusion matrix

displays the combination of true positives (TP), false positives (FP), true negatives

(TN) and false negatives (FN) in a two by two matrix (for binary classification). The

values of the main diagonal is the correct predictions, whereas the upper right and

lower left quadrant are misclassifications (Chawla, 2010).

The receiver operating characteristic curve (ROC curve) is created by plotting

the true positive rate against the false positive rate, which is the same as sensitivity

plotted against 1-specificity. Assessing the ROC curve visually enables one to assess

how well a model performs, but as a check on performance it cannot stand alone

(Chawla, 2010).

21 of 65

Accuracy can be used as an overall measure of performance, and is simply the

number of correctly classified cases as a ratio of total cases. In the case of evenly

distributed target classes, accuracy is a solid measure of model performance, but it

should not stand alone because it neglects to assess a model’s ability to predict

positive and negative cases. Accuracy is defined as (Witten & Frank, 2005): Equation 7 - Accuracy

Accuracy = ! !"!!"!"!!"!!"!!"! (7)

Recall can be interpreted as how well the model is at finding all the cases within a

given class. Precision can be interpreted as how big a part of the cases within a given

class actually belong to that class. Precision and recall are of special importance, as

this tells us to what degree we can trust our model to classify positive cases correctly,

and depending on the task, one might want to prefer higher recall or higher precision.

Recall and precision are defined as (Witten & Frank, 2005): Equation 8 - Recall

Recall = ! !"!"!!"! (8)

Equation 9 - Precision

Precision = ! !"!"!!"! (9)

The F-measure seeks to balance precision and recall, and serves as a good

performance measurement in the case of an unbalanced dataset. If one trains a model

without getting an increase in performance on both recall and precision, one is just

adjusting the trade-off between precision and recall (Forman, 2003), whereas the F-

measure provides a single measurement of this trade-off. The F-measure is defined as

(Witten & Frank, 2005): Equation 10 - F-measure

F = ! !∗!"#$%%∗!"#$%&%'(!"#$%%!!"#$%&%'( (10)

22 of 65

3.4%6 The%power%of%different%text%classification%methods%

This section will identify and comment on binary text classification task literature.

Until now, we have identified and discussed different text mining and machine

learning processing and modelling tools. It is important to mention to our readers that

the aim of this literature study was never to make perfect guesses about the choice of

modelling parameters from the very beginning. Rather we wished too set boundaries

of the approach we will later take, to modelling the concept of an idea. In the studies

chosen we will identify the aim of study, processing methods, term weighting

schemes, feature selection methods, performance assessment measures and choice of

classification algorithm. We will rank the methods according to performance in the

given study. At the end of the section we comment on any shortcomings of the

studies and summarize our findings.

Starting with the most resent study, Yu (2008) showed that support vector

machines and naïve Bayes perform equally well classifying sentimental novels, while

naïve Bayes outperforms the support vector machines on an erotic poem task. The

setup was binary term weighting schemes for both classifiers, term frequency for the

naïve Bayes, normalized term frequency for the support vector machine and term

frequency inverse document frequency for the support vector machine. For

processing, stopword removal and stemming were used. For feature selection, the

support vector machine used support vector weights and naïve Bayes used odd ratio.

The authors argued that stemming eliminated highly discriminative terms, reducing

performance of the support vector machine.

Lai (2007) compared the support vector machine, naïve Bayes and k-nearest

neighbours for text classification, where the support vector machine outperformed all

other methods, while naïve Bayes performed consistently. The setup was variations of

stemming and stopword removal; it showed no significant effect of stemming and a

small improvement of stopword removal. Two weighting schemes were applied: one

weighting scheme was used as baseline and was not reported, whereas the term

frequency inverse document frequency score was reported together with the support

vector machine. This combination gave a slight improvement in performance.

Zhang, Zhu, and Yao (2004) investigated the performance of the five

classifications methods of support vector machines, maximum entropy model,

memory based learning, naïve Bayes and ada boost. Maximum entropy models

23 of 65

nearest neighbours and ada boost we will not comment on further, but we

acknowledge that they were a part of the study. The results showed that support

vector machine outperformed the other methods, while naïve Bayes underperformed.

The setup reported was document frequency, information gain and chi-square for

feature selection.

Sculley and Wachman (2007) did a spam classification study with support

vector machines, trying to prove that focussing too much attention on tuning C is

unnecessary, as it does not increase performance to a such degree that it can make up

for the extra computational costs incurred. The setup was no feature selection and

binary weighting scheme. This paper is relevant because the character versions of 3-

grams and 4-grams were a part of the setup. 4-grams gave the best performance,

supporting the use of n-grams in the modelling process.

Another spam classification task was undertaken by Webb, Chitti, and Pu (2005).

Here support vector machine, naïve Bayes and logit boosting algorithms were

compared against a well know spam filter called Spam Probe. The support vector

machine and logit boosting in general performed better than naïve Bayes, but only

with a small margin. The setup was information gain for feature selection and no

stemming and stopword removal were applied. We noticed that only one feature

selection method was applied and no weighting scheme was reported.

Critical assessment

Taking a critical stance on the reviewed studies, we noticed the lack of information

about feature selection methods in the Sculley and Wachman (2007) and Lai (2006)

papers. Even though we feel it is a downside of both studies we cannot claim that this

missing information is problematic, as one does not need to apply a feature selection

technique. However, missing information about the term weighting scheme in the

Webb et al. (2005) paper and the missing information about processing steps in Webb

et al. (2005) and Zhang et al. (2004) does deserve criticism, as the choice one makes

on these parameters can have a significant impact on classifier performance. It also

deserves critique that Yu (2008), Sculley and Wachman (2007) and Lai (2006) apply

accuracy as a metric of performance without stating the distribution of the target

classes in their datasets. An overview of the applied setup with regards to

classification algorithm, processing steps, choice of term weighting scheme, feature

selection method and performance measures is shown in Table 1.

24 of 65

Summary

From these studies we learned that the classifiers of support vector machines and

naïve Bayes seems to be widely applied with good results. There seems to be no

consistency in the text processing steps, whereas term frequency, inverse document

frequency score, and binary weighting scheme seem to be the preferred choice for

term weighting schemes. There seems to be a minor preference towards information

gain for feature selection, where the choice of the chi-square statistic for feature

selection is more doubtful because it is not supported by more than one study. For

performance measurement, the F-measure, accuracy, precision and recall are also

widely used. We realize that other performance measurements were also applied in

the studies, but we limit ourselves to the ones chosen.

25 of 65

Table 1 - Overview over studies reviewed

Paper Aim of study Classification algorithms Processing Term weight Feature

selection Performance

measures Performance

ranking

Yu (2008) Classification of novels and erotic poems

SVM NB

Stemming Stopword

BIN TF

NTF TF-IDF

SVM OR Accuracy NB

SVM

Sculley & Wachman

(2007) Spam SVM

Words 2-grams 3-grams 4-grams

BIN

Accuracy Precision

Recall F

SVM

Lai (2006) Spam SVM NB

K-NN

Stemming Stopword TF-IDF Accuracy

SVM NB

KNN

Webb et al. (2005) Spam

SVM NB LB

NA IG WAcc SVM LB NB

Zhang et al. (2004) Spam

SVM MEM ADA K-NN

NB

BIN DF IG

CHI

Precision Recall

F WAcc TCR

SVM MEM ADA NB

MBL Abbreviations: SVM = Support vector machine, NB = Naïve Bayes, K-NN = K-nearest neighbours, LB = Logit boosting, MEM = Maximum entropy model, ADA =

Ada boosting, BIN = Binary weighting scheme, TF = Term frequency, NTF = Normalized term frequency, TF-IDF = Term frequency inverse document frequency,

OR = Odds ratio, IG = Information gain, DF = Document frequency, CHI = Chi-square, WAcc = Weighted accuracy, TCR = Total cost ratio

26 of 65

3.5$% Summary$

In answer to sub question two, we state that the data created inside online

communities is characterized as being in an unstructured format, and is often big in

terms of volume, frequency and variety. This means that one must apply text mining

and machine learning in order to model the concepts of interest. One can organize

unstructured data by the mean of a bag-of-words. If one wishes to transform

unstructured textual data into structured data, then a variety of processing steps must

be undertaken, including pruning, tokenization, stemming, creation of n-grams and

choice of term weighting scheme. The results within the reviewed literature are

inconclusive when it comes to a choice between stemming, stopword removal and n-

grams, but the weighting schemes of term frequency inverse document frequency and

binary weighting scheme are well supported.

When applying machine learning one needs to be aware of imbalance in the

target variable. Class imbalance can be partially solved by random oversampling

and/or random undersampling. One must choose which feature selection methods to

include, as it can enhance performance and prevent over fit. Methods supported by

literature include information gain and the chi-square statistic. Also, one must choose

a classification algorithm where we pointed out support vector machines or naïve

Bayes as alternatives. To assess the performance of the trained classifier one must

decide which measures to assess, especially when assessing an unbalanced training or

test set. The F-measure especially makes a good performance measurement, but one

might also use accuracy, recall, precision and ROC assessment.

27 of 65

4$% Aims$of$study$

In light of the reviewed theory as well as our main and secondary research questions,

we have set up a study to assess how the concept of an idea can be captured inside a

target variable. Based on this target variable, we will use text mining and machine

learning to detect ideas generated in our online community of Lugnet. Having

detected the ideas, we will assess to what degree seasonality and historical events

influence idea generation inside our online community

Our study is divided into three parts. Firstly, we will create a training set that

captures the concept of an idea within its target variable. Secondly, we will use the

training set to train a classifier to detect ideas from messages contained within our

online community. In order to do so, we will choose the best settings for two

particular types of classifiers (Support vector machines and naïve Bayes), and finally

choose the best classifier given the settings chosen. Lastly, we will use the trained

model to detect ideas generated in the online community of Lugnet and determine to

what degree idea generation inside the online community can be explained by

seasonality and historical events.

28 of 65

5$% Method$

The first section of this chapter will describe how we mined the data from Lugnet.

The second section describes how the training set and target variable were created. As

mentioned already, ideas are complex and therefore it becomes relevant to assess the

reliability of the concept we extract. The importance of reliability should be seen

relative to the importance of having a training set of sufficient balance. We created

our target variable through several rounds of manual classification, and assessed

reliability through the measure of Cohens Kappa. The third section outlines how the

concept of an idea was modelled. That is how the right combination of weighing

schemes were selected, and how the processing steps and feature selection method in

combination with a classification algorithm were chosen. This process is important in

order to create the best possible model. The fourth and final section describe how we

used the model to filter the entire document collection and built a dataset for assessing

the variations of idea generation in the online community, and how and by which

means this was done.

5.1$% Mining$the$online$community$of$Lugnet$

At the point we downloaded the forum it had a volume of 529,040 messages and each

individual message contains a variety of information. An example of a message

displayed by a regular internet browser, is shown in Appendix A, whereas the same

message in .eml format and .txt format, is shown in Appendix B and Appendix C.

Each message contains the text together with the metadata belonging to each text. We

were only interested in the text, the unique identifier, and the date for each post. The

unique identifier is the Message-ID which for a random post is of the form

<[email protected]>, and the date are of the format Tue, 29 Sep 1998

18:51:35 GMT for the same random message. We would need the unique identifier in

order to merge our predictions from our model with the dates.

As mentioned the forum had a total number of 529,040 messages that we

stored as .eml files in different folders according to their sub forum. Lugnet’s

messages were downloaded as separate .eml files, and handled by the software

package R. The files were stored with the name of the subject of the individual

message. In order to give all messages their own unique identifier, we created a piece

29 of 65

of code in R that loads all .eml files from each folder and moves them into only one

folder. This process assigns each message with a name corresponding to their

“Message-ID” instead of their subject title (Feinerer & Kurt, 2012; Ingo Feinerer,

2012). Some of the messages posted contain citations, which is a piece of text that

refers to a topic in an earlier message posted. We decided to remove citations, as we

did not consider them as valid information. The removal of the citations was

accomplished through text mining software in R (Feinerer & Kurt, 2012; Ingo

Feinerer, 2012). This left us with a total of 440,036 messages, which is a reduction of

16.8%, happening automatically as each duplicate was overwritten in the process.

The additional pieces of information, shown in Appendix C, were removed during

this process, leaving only the Message-ID and the text without citations for further

analysis.

5.2$% Construction$of$target$variable$The concept of an idea is a complex concept. In light of this, the task of creating a

reliable target variable becomes important for the performance of the classifier. We

define our classification task as a binary classification task, as most of the reviewed

research has been within this particular type of task, as well as we consider it very

reasonable to handle our problem in this way.

We used human judges to create the training set, and Cohens Kappa statistic

(κ) to assess the reliability of our target variable. This measure was chosen because it

corrects for the agreement that occurs simply by chance, and what κ does is to adjust

this chance agreement and provide a measure of reliability for the concept assessed. κ

relies on three assumptions, that is (1) that units of observation are independent. (2)

that categories of the scale are independent, mutually exclusive and exhaustive. (3)

that judges operate independently (Cohen, 1960). We used the benchmark scale

suggested by Landis and Koch (1977) and define κ < 0 as poor, 0 < κ ≤ 0.20 as slight,

0.20 < κ ≤ 0.40 as fair, 0.4 < κ ≤ 0.60 as moderate, 0.60 < κ ≤ 0.80 as substantial, and

0.80 < κ ≤ 1 as almost perfect (Landis & Koch, 1977). The process of constructing the

target variable would result in a training set containing the classifications from each

judge and the Message ID as unique identifier.

The process of constructing the target variable was divided into five steps. In

the first step, we extracted 300 posts, classified them manually with the help of three

judges. In the second step, we built an intermediate model by modelling the 300 cases

30 of 65

training set we created. In the third step, we extracted 3,000 posts with the help of the

intermediate model we created. These posts were more than likely to contain an idea.

In the fourth step, we first had three judges classify 300 messages, in order to pick the

two judges with the highest pairwise κ and then these two judges would manually

classify the 3,000 messages. Finally in the fifth step, we assessed κ of the

classifications of the two judges. Each step is explained in details below.

Step 1: Initial assessment

In the first step we extracted 300 documents from a sub forum within Lugnet called

“Dear Lego”. The purpose of this sub-forum is to allow people to send open letters to

Lego, implying that there might be a higher likelihood that people will express ideas

within this sub forum, compared to the likelihood of extracting an idea outside the sub

forum. We had three judges (People connected to the project) to classify the

messages. We assessed κ in order to identify whether it is possible to extract ideas

with minimum a slight degree of reliability.

Step 2: Intermediate model

Based on the 300 classifications, we created an intermediate model to help increase

the likelihood of attaining a training set with a higher degree of balance. This model

was created with a binary weighting scheme, 3-grams, stopword removal, pruning

above 0.99 and below 0.01, information gain for feature selection and a linear support

vector machines optimized with regards to the soft-margin hyper parameter C.

Step 3: Extract 3,000 cases with improved event frequency

In this step we applied the intermediate model to the entire corpus of 440,036

messages and selected the 1,500 messages that were most likely to contain an idea,

and 1,500 documents are chosen at random. Before extracting the 3,000 messages, we

discarded all messages that were ± 2.5 standard deviations away from the mean with

regards to token number, which corresponds to the size of the messages. The

exclusion based on message length was done, because some messages were extremely

short and some were extremely long. The arguments for taking message length into

consideration was that reading long posts was more likely to tire our judges out and

short messages would contain no information.

31 of 65

Step 4: Manually classify 300 cases to assess classification reliability

In this step, three student helpers were recruited, whereas only two of them would get

to do the classifications of the 3,000 cases. First we created at dataset of 300 messages

and distribute them to each student helper. Each student helper was instructed to do

the classifications independently. We used κ to assess reliability between judges, and

the two judges with the highest pairwise κ, then did the rest of the classifications.

Step 5: Manually classify 3000 cases to construct the training set

Having chosen two of the three judges, these two judges then classified the 3,000

messages. Again both judges were instructed to do the classifications independently,

and again κ was used to assess the reliability of the judges. As this was the final step

in the process of creating our target variable, we would after this step have our

training set, which we could use for modelling the concept of an idea. We decided to

use the ideas identified by minimum one judge, as positive cases in our training set.

5.3$% Modelling$the$concept$of$an$idea$

Having created our training set, we used this for training our model, whereas this

section will describe which steps we performed, in order to build our model. We did

this in seven steps, which are described in detail in the on-going subsections.

5.3.1$% Data$exploration$

In order to get a feeling with our data, we used the feature selection method of

information gain to filter the twenty most discriminative terms between the two

classes of positives and negatives. This gave us an idea of how our judges had made

their decisions, and what terms are good predictors of whether a message contains an

idea. For the task of assessing the twenty most discriminative terms, we removed stop

words, extracted 3-grams and pruned above 0.99 and below 0.01. Hereafter, we

looked at the topics discussed in the extracted cases. We looked at the positive cases

and negative cases separately in order to assess if we could identify a pattern in what

people discuss, within ideas and none-ideas. We extracted four topics and five terms

for each topic. We used the topic models package in R in order to perform this

analysis and we removed stop words, we used stemming and we used term

occurrences for term weighting scheme (Grün & Hornik, 2011). We used the entire

sample of positive cases and negative cases.

32 of 65

5.3.2$% Data$partitioning$

As a first step, we partitioned our data and created two datasets, a training set and a

test set. Included in the training set was a validation set, and the test set was to be put

aside for assessing our candidate models. We did a 70/15/15 split, meaning that our

the training set consisted of 70% of the total cases, our validation set consisted of

15% of the cases, and our test set consisted of 15% of the cases. We used the two

validation techniques of 10-fold cross validation and split validation.

When we used cross validation, we undersampled the majority class to fit the

minority class and assessed performance on the validation set. When choosing among

candidate models, we undersampled as well, but we used split validation instead and

assessed performance on the test set, rather than the validation set.

As debated earlier, it is not obvious if either undersampling or oversampling

produce the best results, therefore we created a training set where we oversampled.

This training set we did not use before we had our final model. We stress that we

made sure that training set and validation set were kept separate from the test set

under the entire process and the test set was only used in the end for choosing among

candidate models.

5.3.3$% Classification$algorithms$

Based on the results presented in Table 1 as well as the arguments presented earlier,

we used the support vector machine and naïve Bayes as classification algorithms. We

did not expect the naïve Bayes to perform better than the support vector machine, but

as stated earlier, naïve Bayes can provide a baseline as well as achieve a high

performance relative to a small training set. We used the linear support vector

machine to choose modelling settings, and in the end we added a support vector

machine with an radial basis kernel. All our support vector machine classifiers were

trained by using a grid search to search for the optimal C and the optimal γ value.

5.3.4$% Term$weighting$scheme$

We could not assume that the problems mainly researched in the existing literature

(Classifying spam and poems) are similar to classifying ideas, so we needed to assess

which weighting scheme performs better on the particular task of detecting ideas. Our

setup included four different weighting schemes - binary weighting scheme, term

occurrences, normalized term frequency and term frequency inverse document

33 of 65

frequency. Neither could we expect the support vector machine and the naïve Bayes

algorithm to have the same preferences with regards to weighting schemes, so we

applied all four weighting schemes on both classifiers. We choose to apply stopword

removal, 3-grams and pruning above 0.99 and below 0.01 as the setup for each

individual weighting scheme, giving us a total of eight different scenarios and 2,241

features in all the settings. We assessed the performance of the weighting schemes by

comparing accuracies using a Bonferroni adjusted paired t-test. We denoted the mean

accuracy !!and !! respectively. In order to extract the accuracy measures we applied

ten-fold cross validation and based on these measures, we calculated the mean

difference denoted as ! for each weighting scheme and tested if ! was significantly

different from zero (Yu, 2008). We then choose the single best weighting schemes for

each classifier.

5.3.5$% Data$processing$steps$

For choosing among processing steps we applied the best term weighting scheme

from the previous step to each classifier. We focused on the two processing steps of

n-grams and stemming. We used n-grams to extract more semantics, and we decided

to use 3-grams which also includes 2-grams and uni-grams. We removed stop words

to get a lower number of terms and thereby reduce computational costs, as well as

filter out noise. This led to three setups, which were only stemming, only 3-grams and

neither of them. We pruned above 0.99 and below 0.01 and we removed stopword in

all settings. We applied a paired t-test with Bonferronis correction, as with the term

weighting schemes, and we picked one processing setup out of the three for each

classifier. This left us with two different classifiers, with an individual weighting

scheme and an individual processing setup to continue with.

An important point about this approach is that we also applied a feature

selection technique in order to keep the feature level constant. We have earlier

discussed that classifier performance can be influenced by the number of features, and

as we faced different levels of features given 3-grams (2,241 features), stemming

(1,018 features) and none of these (1,234 features), we needed to keep the feature

level constant in order to avoid the potential effect of feature number on classifier

performance. We did this by using information gain for feature selection to all

settings and only use the top 1,018 features.

34 of 65

5.3.6$% Feature$selection$methods$

For feature selection we used information gain and the chi-square statistic. We

assessed performance in terms of accuracy, recall and precision, and set up a grid

search to pick the best percentage threshold of features for each classifier. This gave

us two setups for each classification algorithm, from which we picked the best

performing feature selection technique for each classifier.

5.3.7$% Choice$of$final$model$

Having assessed the performance given term weighting scheme, processing steps, and

feature selection methods, we had the settings of how to train two classifiers. As it has

been debated if the support vector machines actually benefits from feature selection,

we also set up a linear support vector machine without any feature selection

technique, and a support vector machine with radial basis kernel, also without any

feature selection technique. This gave us a total of four classifier setups, of which the

first was a naïve Bayes with a given feature selection method, the second was a linear

support vector machines with a given feature selection method, the third and the

fourth was a linear support vector machines and a support vector machine with an

radial basis kernel, with no feature selection technique. We assessed ROC, F-measure,

recall, precision and accuracy on the test set, and we used split validation with the

undersampled training set and applied the optimal feature threshold from the previous

step on the classifiers, which utilize feature selection.

Based on the assessment of the four candidate models, we picked our final

model. Instead of using the under-sampled training set, we used the oversampled

training set, and assessed if there were any improvement in performance. Having

chosen our final model we applied the model to all messages from the forum, giving

us a dataset with a date variable and a prediction variable, allowing us to determine to

what degree variability in idea generation within our online community can be

explained by month and/or year over a given time period.

5.4$% Effect$of$seasonality$and$historical$events$on$idea$generation$How can idea generation within crowdsourcing communities be explained by

seasonality and historical events? In this section we will answer this question by (1)

exploring the data in the same manner as earlier; (2) creating our dataset and handle

missing data; (3) assessing the nature of our variables and perform the necessary

35 of 65

variable transformations; (4) defining the regression model; and (5) assessing

parameter estimates and goodness-of-fit.

We performed the same data exploration as we did after extracting the training

set. In particular we extracted an even number of positive and negative cases and

assessed the twenty most discriminative terms. This second comparison is relevant for

two reasons. Firstly we could assess if our model actually extracted the same pattern

as our judges, and secondly we could assess the difference between messages

containing ideas and non-ideas. We extracted a sample of 2,500 ideas and 2,500 non-

ideas and used the same setup as earlier.

Having trained a model, we applied the model to the entire corpus of 440,036

messages in order to detect the ideas in the entire forum. Based on the dates of the

particular messages, a dataset was created containing a count of ideas and a count of

total messages posted within a given month and year. This gave us a dataset with the

four variables, MONTH, YEAR, IDEA and ACTIVITY, where IDEA and

ACTIVITY are counts of messages containing ideas and total message count

respectively. In this step we also assessed missing data and omitted years with too

many missing months.

In order to adjust for the relationship between number of ideas created and

number of messages posted on the forum, we derived a variable which we named

event rate per month (ERPM), by calculating IDEA as a ratio of ACTIVITY for each

row in our dataset. To create a dependent variable that fulfilled the requirements of

linear regression, we did a logit transformation on ERPM (LN.IDEA), giving us a

continuous dependent variable and two categorical independent variables, MONTH

and YEAR.

Since the forum has been inactive for the last four years, we omitted these

years from our analysis, then we assessed if our model fulfilled the assumptions of

linear regression. We assessed model assumptions by inspecting how the error terms

were distributed in a standard residuals plot and a histogram of the residuals. After

this we defined our model.

Finally we assessed the results of our regression model, where we reported the

total amount of variance explained (!!). We reported coefficient estimates and their

corresponding p-values.

36 of 65

6$% Results$

6.1$% Reliability$of$the$manual$classification$of$target$variable$

The initial assessment of the coding scheme was performed on 300 randomly selected

posts from the ‘Dear Lego’ sub-forum. Judge one classified 10% of the posts to be

containing an idea, judge two classified 6.7%, and judge three classified 3%. The

average number of ideas extracted was 6.6%. Calculating κ, we found that κ = 0.391 ±

0.18 at α = 0.05 for judge one and judge two, κ = 0.382 ± 0.193 at α = 0.05 for judge

one and judge three, and κ = 0.604 ± 0.21 at α = 0.05 for judge two and judge three,

which we considered substantial agreement.

Based on the classifications from the initial assessment, we constructed an

intermediate model to extract 3,000 messages with higher likelihood of containing an

idea than if sampled at random. Of these 3,000 messages, 300 were selected at

random and distributed to three new judges. Of the 300 cases, judge one classified

13.3%, judge two classified 6.7%, and judge three classified 11% as containing an

idea. (Average = 10.33%). κ = 0.451 ± 0.161 at α = 0.05 for judge one and judge two,

κ = 0.444 ± 0.173 at α = 0.05 for judge two and judge three and κ = 0.486 ± 0.15 for

judge one and judge three at α = 0.05, which we consider moderate reliability.

We received the remaining cases with a minor note from both judges, saying

that two posts were duplicates of other posts, despite having different Message-ID’s.

These two posts were excluded, leaving us with a training set of 2,998 cases. From

the 3,000 cases, judge one classified 8.7% of the cases as containing an idea, and

judge two classified 6.90% as containing an idea. Calculating the average of ideas

extracted yielded an average of 7.84% compared to 6.56% from the messages

extracted in the initial assessment, which were extracted from a sub-forum with a

higher likelihood of ideas occurring. For the two remaining judges κ = 0.548 ± .056 at

α = 0.05, which we consider moderate agreement. As a final comment we note that

both judges agreed on 137 positive cases, whereas one judge of the two considered

200 cases as an idea. This gave us a training set of 337 positive cases and 2,661

negative cases. We decided to continue with as many positive cases as possible as we

considered a training set with 137 positive cases too small.

37 of 65

Summary of results

The important results for this part of the study include the training set of 2,998 cases,

with 337 of them positive, and a κ statistic on 0.548 ± .056 at α = 0.05, which we

consider as moderate reliability.

6.2$% Detecting$ideas$Having created a training set with a reliable target variable, we built our model by (1)

partitioning the data based on the distribution of our target class; (2) performing an

exploratory analysis to assess potential patterns in the data; (3) assessing the

consequences of varying the term weighting scheme, (4) assessing the consequences

of varying the processing steps; (5) assessing the consequences of varying the feature

selection methods; and finally (6) assessing the performance of our candidate models

and picking the best performing model.

6.2.1$% Data$partitioning$

Our training set was 2,998 cases large with the distribution of 337 positive cases and

2,661 negative cases. This gave us an unbalanced dataset, as the positive ratio was

approximately 11:100. Based on a 70/15/15 split we achieved a training set with 236

positive cases and 1,863 negative cases. For the validation and test set this would

leave 51 positive cases and 399 negative cases, respectively. Applying random

undersampling our cross validation training set had 286 positive cases and 286

negative cases, while using random oversampling, it had 2,262 positive cases and

2,262 negative cases. The test set we left untouched no matter what technique we

applied in order to balance the dataset.

6.2.2$% Exploratory$analysis$

Before doing the actual modelling, we explored the data extracted. Table 2 shows the

twenty most discriminative terms or n-grams, as well as four topics extracted from the

pool of positive cases and four topics extracted from the pool of negative cases.

38 of 65

Table 2 - Twenty discriminative terms and four positive and negative topics from training set

Top$20$discriminative$terms$

1"to"5" 6"to"10" 11"to"15" 16"to"20"would_be" that_would" it_would" tlg"

lego" you_could" com" like_to_see"idea" i_would" ideas" to_see"

could_be" see" etc" that_would_be"sets" nice" theme" be_a"

Idea$topics$ None4idea$topics$

Topic"1" Topic"2" Topic"3" Topic"4" Topic"1" Topic"2" Topic"3" Topic"4"Idea" Robotics" Trains" Lego" Lego" Lego" Robotics" Construct"lego" sensor" train" brick" peopl" lego" robot" brick"build" robot" model" lego" dont" build" train" build"idea" motor" build" color" time" post" build" piec"

product" control" track" build" lego" time" program" lego"castl" program" look" piec" that" dont" lego" plate"

From Table 2 we can see that many of the discriminative terms are n-grams. This

makes sense because when people have an idea or are being creative they use words

like “could be”, “would be”, “you could”, “like to see” and obviously “idea”. The text

piece below shows an example of a message containing some of the identified terms.

That's an awesome idea!

You could make a Buddha, or a Renaissance man!

That would be totally cool.

If we assess the results of the topic modelling we labelled the idea-topics “Idea”,

“Robotics”, “Trains” and “Lego”. The none-idea topics we labelled “Lego”, “Lego”,

Robotics and “Construct”. We will not claim that there is a clear distinction between

idea topics and none-idea topics. For example the topic of robotics seems to be widely

debated, in an idea and a none-idea domain.

6.2.3$% Classifier$performance$given$term$weighting$and$processing$steps$

When we assessed the results of the support vector machines we found that the

support vector machine with normalized term frequency performed significantly

better than the support vector machine with binary weighting scheme. We discovered

39 of 65

that the support vector machine with term frequency inverse document frequency

performed significantly better than the support vector machine with binary weighting

scheme. The support vector machine with term frequency inverse document

frequency performed significantly better than the support vector machine with term

occurrences. From these results we could exclude the support vector machine with

binary weighting scheme and the support vector machine with term occurrences, but

we decided to proceed with the support vector machine with term frequency inverse

document frequency. Our reason for this was that the support vector machine with

normalized term frequency fails to achieve a significant difference when compared to

the support vector machine with term occurrences.

When assessing the results for the naïve Bayes classifiers we found that the

naïve Bayes with binary weighting scheme performed significantly better than the

naïve Bayes with normalized term frequency. We also discovered that the naïve

Bayes with binary weighting scheme performed significantly better than the naïve

Bayes with term frequency inverse document frequency. From these results we

excluded the naïve Bayes with normalized term frequency and the naïve Bayes with

term frequency inverse document frequency, whereas we decided to proceed with the

naïve Bayes with binary weighting scheme, due to the fact that we failed to obtain any

significant performance measures with regards to the naïve Bayes with term

occurrences. The results of the mean comparisons for the ten-fold cross validation are

reported in Table 3.

6.2.4$% Classifier$and$term$weighting$scheme$given$processing$steps$

Assessing the results of the support vector machine with term frequency inverse

document frequency, we found that the support vector machine with term frequency

inverse document frequency and 3-grams performed significantly better than the

support vector machine with term frequency inverse document frequency and

stemming. Moreover, we determined that the support vector machine with term

frequency inverse document frequency with no processing step performed

significantly better than the support vector machine with term frequency inverse

document frequency and stemming. From these results we excluded the support

vector machine with term frequency inverse document frequency and stemming, but

our results could not distinguish between the support vector machine with term

frequency inverse document frequency with 3-grams and the support vector machine

40 of 65

with term frequency inverse document frequency and no processing step. However,

we decided to proceed with the support vector machine with term frequency inverse

document frequency and 3-grams. Our reason for this was that the argument for

choosing the support vector machine with term frequency inverse document

frequency with no processing step would be a less likelihood of over fit due to the

lower number of features. But as we in the next step considered feature selection and

feature selection thresholds we preferred keeping in as much information as possible

at that point.

Assessing the results of the naïve Bayes with binary weighting scheme, we

found that naïve Bayes with binary weighting scheme and stemming performed

significantly better than the naïve Bayes with binary weighting scheme and 3-grams.

We also found that the naïve Bayes with binary weighting scheme and stemming

performed significantly better than the naïve Bayes with binary weighting scheme and

no processing step. From these results we excluded the naïve Bayes with binary

weighting scheme and 3-grams and the naïve Bayes with binary weighting scheme

and no processing step, and we decided to proceed with the naïve Bayes with binary

weighting scheme and stemming. We discovered that the naïve Bayes classifier

seemed to be performing better with a dimensionality reduction method, which

stemming can be considered as. The results of the mean comparisons for the 10-fold

cross validation are reported Table 3.

6.2.5$% Assessing$performance$given$varying$feature$selection$methods$

Given the results so far we decided that the support vector machine should be

modelled with the term frequency inverse document frequency as a weighting scheme

and with 3-grams as a processing step. With regards to the naïve Bayes algorithm, we

decided that this classifier should be modelled with binary weighting scheme and

stemming as a processing step.

Assessing the results of the support vector machine with term frequency

inverse document frequency and 3-grams given information gain for feature selection,

we found that the optimal feature percentage threshold was 90% features. At this

feature threshold the support vector machine with term frequency inverse document

frequency and 3-grams performed accuracy = 0.892 , recall = 0.902 and

precision = 0.885 . When using chi-square statistic for feature selection we

calculated accuracy = 0.873, recall = 0.863 and precision = 0.880. We decided to

41 of 65

proceed with information gain for feature selection as this method performed better

than with the chi-square statistic.

Assessing the results for the naïve Bayes with binary weighting scheme and

stemming with information gain for feature selection we discovered that the optimal

feature percentage threshold was 90% features. At this feature threshold the naïve

Bayes with binary weighting scheme and stemming performed accuracy = 0.833,

recall = 0.882 and precision = 0.804 . Given chi-square statistic for feature

selection we calculated accuracy = 0.833, recall = 0.882 and precision = 0.804. It

serves as a note that these results were completely alike, which might seem a bit

strange. But if we recall that the dataset extracted by the means of stemming gave

1,018 features, it is not unreasonable that the two feature selection methods

(information gain and chi-square statistic) have the same 10% features which

contribute the least to performance, and thereby give similar results. As chi-square

statistic is less computationally expensive we decided to continue with this feature

selection method.

42 of 65

Table 3 - Results term weighting, processing and feature selection assessment

Term$weighting$Comparison" !!!" !!" !" pCvalue"

SVM_NTF"vs."SVM_BIN" 0.811" 0.778" 0.033" 0.048$SVM_TFCIDF"vs."SVM_BIN" 0.824" 0.778" 0.046" 0.019$SVM_TFCIDF"vs."SVM_NTF" 0.824" 0.811" 0.012" 1.000"SVM_TO"vs."SVM_BIN" 0.766" 0.778" 0.012" 1.000"SVM_TO"vs."SVM_NTF" 0.766" 0.811" 0.045" 0.139"SVM_TO"vs."SVM_TFCIDF" 0.766" 0.824" 0.058" 0.015$NB_NTF"vs."NB_BIN" 0.645" 0.694" 0.049" 0.015$NB_TFCIDF"vs."NB_BIN" 0.643" 0.694" 0.051" 0.023$NB_TFCIDF"vs."NB_NTF" 0.643" 0.645" 0.002" 1.000"NB_TO"vs."NB_BIN" 0.689" 0.694" 0.005" 1.000"NB_TO"vs."NB_NTF" 0.689" 0.645" 0.044" 0.064"NB_TO"vs."NB_TFCIDF" 0.689" 0.643" 0.046" 0.150"Processing$steps$

Comparison" !!" !!" !" pCvalue"SVM_TFCIDF_NONE"vs."SVM_TFCIDF_3G" 0.820" 0.820" 0.000" 1.000"SVM_TFCIDF_STEM"vs."SVM_TFCIDF_3G" 0.771" 0.820" 0.049" 0.022$SVM_TFCIDF_STEM"vs."SVM_TFCIDF_NONE" 0.771" 0.820" 0.049" 0.004$NB_BIN_NONE"vs."NB_BIN_3G" 0.706" 0.692" 0.014" 0.454"NB_BIN_STEM"vs."NB_BIN_3G" 0.736" 0.692" 0.044" 0.007$NB_BIN_STEM"vs."NB_BIN_NONE" 0.736" 0.706" 0.030" 0.004$

Comments: These tables display results of the different assessments regarding term weighting schemes,

processing steps and feature selection methods. Regarding term weighting scheme and processing step,

one can read the comparison column as !!vs. !! and a positive ! value corresponds to !! performing

higher accuracy than !!. One can interpret the p-value as the degree to which one can be certain that

the true mean accuracy is !.

Abbreviations: SVM = Support vector machine, NB = Naïve Bayes, NTF = Normalized term frequency,

TF-IDF = Term frequency document frequency, BIN = Binary, TO = Term occurrences, NONE = No

processing step, 3G = 3-grams, STEM = Stemming

43 of 65

6.2.6$% Assessing$candidate$models$

Having excluded and chosen term weighting schemes, processing steps and feature

selection methods for our two classifiers, we trained the classifiers and assessed the

classifiers on our test set. Recall that we decided to add a support vector machine with

a radial basis kernel, as well as a linear support vector machine with no feature

selection method. This gave us four setups - a linear support vector machine, a

support vector machine with a radial basis kernel, a linear support vector machine

with information gain for feature selection and a naïve Bayes with the chi-square

statistic for feature selection. The processing setups for the support vector machines

were term frequency inverse document frequency weighting scheme and processing

by 3-grams and stopword removal (Yields 2,241 features), and for the naïve Bayes it

was a binary weighting scheme with stemming (Yields 1,018 features). A ROC chart

of the four classifiers displayed in Figure 1

Figure 1 - ROC chart of candidate models

From the ROC chart one can see that, especially in the beginning, the naïve Bayes

classifier with the chi-square statistic as the feature selection technique performed the

worst. The two linear support vector machines performed more or less equally,

whereas the support vector machine with the radial basis kernel performed the best.

The numerical assessment based on the undersampled training set of the

classifiers is displayed in Table 4. We decided to proceed with the linear support

vector machine, due to the fact that this model has the best F-measure. We recognized

that the ROC assessment is in support of the support vector machine with the radial

44 of 65

basis kernel, but this model was not very precise, and as we weighed the combination

of recall and precision higher than the true positive rate, we decided to continue with

the linear support vector machine.

We then applied our oversampled training set, in order to see if we could

enhance performance of the linear support vector machines by adding more

information in terms of the excessive negative cases we had available. We compared

the model trained on the oversampled training set and the original under sampled

training set. The ROC assessment of these two models is displayed in Figure 2.

Figure 2 - ROC chart of support vector machines with an under- and oversampled training set

The above ROC assessment shows that the linear support vector machine with the

oversampled training set performed slightly better than the linear support vector

machine with the undersampled training set. We do not claim that the difference in

performance is large, but it is noteworthy. The numerical assessment of the

undersampled and oversampled models is displayed in Table 4 as well as the counts

of true positives, false positives, false negatives and true negatives. The model

performed mediocre on recall and precision. But as these two measures are balanced,

the model achieved a relative high performance on the F-measure, which is

noteworthy compared to the earlier candidate models. Finally we note that this

classifier achieved accuracy = 0.911 which is also the highest among the models we

trained. Therefore we select the linear support vector machine with an oversampled

training set as our final model.

45 of 65

Summary of results

The most important results of this section is that we trained a model with a linear

support vector machine with term frequency inverse document frequency as term

weighting and terms 3-grams. We used no feature selection method and we used

oversampling. This particular model perform F = 0.608 , recall = 0.608 ,

precision = 0.608, and accuracy = 0.911. All results are displayed in Table 4. Table 4 - Results of candidate models performance

Performance$measure$

SVM$4$Linear$ SVM$4$RBF$ SVM$4$Linear$IG$ NB$4$CHI$ SVM$4$Over$

FCmeasure" 0.515" 0.385" 0.490" 0.411" 0.608"Recall" 0.824" 0.922" 0.745" 0.726" 0.608"Precision" 0.375" 0.244" 0.365" 0.287" 0.608"Accuracy" 0.824" 0.667" 0.824" 0.764" 0.911"TP"#" 42" 47" 38" 37" 31"FP"#" 70" 146" 66" 92" 20"FN"#" 9" 4" 13" 14" 20"TN"#" 329" 253" 333" 307" 379"

Abbreviations: SVM - Linear = Linear support vector machine with undersampled training set, SVM -

RBF = Support vector machine with radial basis kernel with under sampler training set, SVM - Linear

IG = Linear support vector machine with information gain for feature selection method with under

sampler training set, NB - CHI = Naïve Bayes with chi-square statistic for feature selection with under

sampler training set, SVM- Over = Linear support vector machine with oversampled training set, TP #

= True positive count, FP # = False positive count, FN # = False negative count, TN # = True

negative count

6.3$% Effect$of$seasonality$on$idea$generation$This section will be divided into five sections. In the first section, we describe the

results of the explorative analysis classified by our model. In the second section, we

describe how we created our regression data set and how we handled missing data. In

the third section, we explore our regression variables and explain how we transformed

the dependent variable. In the fourth section, we define the regression model and in

the fifth section, we assess the results of our regression analysis.

6.3.1$% Exploratory$analysis$

Table 5 shows the twenty most discriminative terms and n-grams as well as the four

topics extracted from the poll of ideas and four topics extracted from the poll of none-

ideas.

46 of 65

Table 5 - Twenty discriminative terms and four positive- and negative topics from prediction set

Top$20$discriminative$terms$1"to"5" 6"to"10" 11"to"15" 16"to"20"

would_be" that_would" it_would" tlg"lego" you_could" com" like_to_see"idea" i_would" ideas" to_see"

could_be" see" etc" that_would_be"sets" nice" theme" be_a"

Idea$topics$ None4idea$topics$Topic"1" Topic"2" Topic"3" Topic"4" Topic"1" Topic"2" Topic"3" Topic"4"Robotics" Lego" Color" Idea" Writing" Writing" Robotics" Lego"motor" lego" black" idea" set" wrote" dat" lego"sensor" brick" space" lego" brick" write" wrote" lugnet"robot" piec" piec" build" wrote" look" motor" site"control" color" white" wrote" lego" lego" file" post"lego" castl" reduc" post" write" time" dont" version"

We see the same pattern in the discriminative terms as displayed earlier in Table 2,

where n-grams as “would_be” and “could_be” etc., are the most discriminative terms.

Regarding the results of the topic modelling, topic one of the idea topics is robotics.

The robotics topic is also a topic within the none-ideas.

6.3.2$% Creating$dataset$and$handling$missing$data$

In order to predict which posts contains an idea we applied the linear support vector

machine with the oversampled training set to our entire document collection. In the

process of merging the predictions with the date information based on Message-ID,

624 messages were removed because they did not contain the necessary information

to merge3. This left us with 439,412 observations, which we collapsed by month and

year, which yields a dataset of 206 observations from January 1995 to November

2012. We noted that the time period from January 1995 to September 2012 yields

more than 206 months, but in the start-up period of the forum, there were several

months with no activity. This was a problem, which we dealt with in the next step.

Besides the two variables, YEAR and MONTH, the dataset contained a variable

3 When we searched for the errors causing the 624 missing messages we discovered that the structure of the .eml files is not consistent. Therefore the R routine created to extract the meta information, was not able to detect the Message-ID for the missing 624 posts.

47 of 65

named ACTIVITY which was a count measure of how many messages were posted in

a given month and year. The data set also contained a fourth variable named IDEA

which was a count measure of how many messages were posted in a given month and

year, that contained an idea. This resulted in a total of four variables in the initial data

set. An example of an observation in this data set is 5,349 (ACTIVITY) posts were

written during February (MONTH) 2002 (YEAR), and 184 (IDEA) of these posts

were classified as containing an idea.

6.3.3$% Variables$exploration$and$variable$transformations$

We started by exploring the YEAR variable. There were twelve observations for each

year from 1996 to 2011, equivalent to twelve months per year. For the year 1995 and

was only three observations. For 2012, there were eleven observations because the

messages were downloaded in November 2012. As we could not explain the low

number of observations in the year 1995, we omitted this year from the analysis. Our

argument for doing so was that so few observations for a given unit of analysis would

be to sparse to model. For the years 1996, 1997 and 1998 there were only very few

observations, and even fewer ideas, which meant that we also omitted these years

from the analysis. This left us with 167 observations evenly distributed between the

years of 1999 and 2011, whereas we had only 11 observations for 2012. Figure 3

illustrates the activity and idea generation inside the forum between January 1999 and

November 2012.

The histograms in Figure 4 displays the distribution of ACTIVITY and IDEA.

Both are dominated by low values, which we consider reasonable, especially in the

light of the relative length of the period where the forum has been considered inactive.

As it is very reasonable to assume that there is a relationship between ACTIVITY and

IDEA, we derived a variable where we accounted for this relationship event rate per

month (ERPM), which is the IDEA count for a given month within a given year as a

ratio of ACTIVITY for the corresponding month within a given year. We define

ERPM as: Equation 11 - Event rate per month

!"#$ = ! !"#$!"#$%$#& (11)

48 of 65

Figure 5 shows that ERPM is right skewed, and that the variability of ERPM within

each year seems to grow over time. We decided to omit the years 2012, 2011, 2010

and 2009. Our argument for doing so, was that we assumed that the large variability

within these years is a consequence of to few posts (i.e. the community is dead). This

meant that instead of 167 observations from 1999 to 2012, there were 120

observations from 1999 to 2008. In order to normalize our dependent variable we

applied a logit transformation to ERPM. This gave us a new variable that we named

LN.ERPM which we define as: Equation 12 - Logit transformed event rate per month

LN.ERPM = ln !"#$!!!"#$ (12)

Figure 6 shows that LN.ERPM is normally distributed. As a final note all of the

descriptive statistics of our variables are displayed in Appendix D.

6.3.4$% Defining$model$and$assessing$model$assumptions$

We created our regression model with YEAR and MONTH as predictor variables: Equation 13 - Regression model

y = a+ b! ∗MONTH!"# + b! ∗MONTH!"# + !… !!+ b!" ∗MONTH!"# +! (13)

b!" ∗ YEAR!""" + !b!" ∗ YEAR!""" + !… !!+ !b!" ∗ YEAR!""# + !ϵ!

From Figure 7 we see that the mean of LN.ERPM given month was stable, whereas

January, February and July had higher means than the rest of the months. With

regards to LN.ERPM for a given year, we noticed the large fluctuations in the mean

of LN.ERPM for the years of 1999, 2006 and 2008. There was a high variability in

the year 2006, low variability in the year 2007 and then again a large variability in

2008. From the two plots in Figure 8 we see that there were an equal variance of

residuals and normal distributed residuals.

49 of 65

Figure 3 - Fluctuations in ACTIVITY and IDEA given YEAR

Figure 4 - Histograms of ACTIVITY and IDEA

Figure 5 - Histogram of ERPM and box plot of ERPM from 1999 to 2012

50 of 65

Figure 6 - Histogram of LN.ERPM

Figure 8 - Residuals plot and histogram of residual distribution

Figure 7 - Fluctuations in LN.ERPM given MONTH and YEAR

51 of 65

6.3.5$% Parameter$estimates$and$goodness%of%fit$

The results of the regression are shown in Table 6. We used March and 2005 as the

baseline meaning that the intercept can be interpreted as the predicted value of

LN.ERPM if month is March and Year is 2005.

Table 6 - Regression results

Coefficients$ Estimate$ Std.$Error$ t$ value$ Pr(>|t|)$(Intercept)" C3.588" 0.131" C27.447" 0.000" ***"MONTHJan" 0.381" 0.140" 2.726" 0.008" **"MONTHFeb" 0.252" 0.140" 1.803" 0.074" ."MONTHApr" 0.074" 0.140" 0.529" 0.597"

"MONTHMay" 0.212" 0.140" 1.520" 0.132""MONTHJun" C0.006" 0.140" C0.046" 0.963""MONTHJul" 0.368" 0.140" 2.632" 0.010" *"

MONTHAug" 0.200" 0.140" 1.429" 0.156""MONTHSep" 0.217" 0.140" 1.550" 0.124""MONTHOct" 0.126" 0.140" 0.899" 0.371""MONTHNov" 0.207" 0.140" 1.480" 0.142""MONTHDec" 0.180" 0.140" 1.287" 0.201""YEAR1999" 0.446" 0.128" 3.494" 0.000" ***"

YEAR2000" C0.031" 0.128" C0.245" 0.807""YEAR2001" C0.169" 0.128" C1.325" 0.188""YEAR2002" C0.059" 0.128" C0.465" 0.643""YEAR2003" C0.174" 0.128" C1.361" 0.177""YEAR2004" 0.004" 0.128" 0.028" 0.978""YEAR2006" C0.283" 0.128" C2.220" 0.029" *"

YEAR2007" C0.268" 0.128" C2.098" 0.038" *"YEAR2008" C0.241" 0.128" C1.886" 0.062" ."Significance$codes:$$0$'***'$0.001$'**'$0.01$'*'$0.05$'.'$0.1$'$'$1$$

Residual$standard$error:$0.313$on$99$degrees$of$freedom$

Multiple$!!:$0.401$F4statistic:$3.376$on$20$and$99$DF,$$p4value:$0.0001$$

!! indicates that our model explains 0.401 of the overall variance in the dependent

variable (p < 0.0001)" level. This we consider as a noteworthy amount. The

coefficients of the months January, February and July are significant compared to the

baseline. With regards to years the coefficients of 1999, 2006, 2007 and 2008 are

significant when compared to the baseline.

52 of 65

Based on these observations we make the assumption that that Christmas and

summer holidays have an effect on idea generation in this domain. The reason for this

might be that people generate ideas when they get new toys and have time to play

with them.

53 of 65

7$% Discussion$&$Conclusion$Main research question

• How are ideas generated in online communities and how can one detect these

ideas by applying text mining and machine learning?

Idea generation in online communities can be seen as a type problem solving process.

Online communities allow large groups of people to interact, which sometimes results

in novel ideas being generated. Ideas are a product of a creative process. This process

requires the individual to go through several phases before the individual or group

generates an idea or a solution. The creative ability of the individual is mainly

determined by domain knowledge and intrinsic and extrinsic motivation, whereas the

creative outcome or the idea can take many shapes, and so be quite difficult to assess.

Data created in online communities are typically in an unstructured form. The

data are often “big” in terms of volume, frequency and variety, necessitating the use

of text mining and machine learning techniques. One can organize the unstructured

data by the mean of a bag-of-words model, and in order to transform the unstructured

textual data into structured data, it is necessary to perform a variety processing steps

(pruning, tokenization, stemming the creation of n-grams and choice of term

weighting scheme). When applying machine learning techniques one needs to be

aware of imbalance in the target variable which can be solved by random

oversampling and/or random undersampling. Choice of feature selection methods can

influence performance and prevent over fit. One will also have to choose the

appropriate classification algorithm – the support vector machine provides a state of

the art algorithm but naïve Bayes also is a reasonable alternative. To assess the

performance of the trained classifier, one will have to decide which measures to

assess. The F-measure especially makes a good performance measurement for skewed

datasets, but one might also apply accuracy, recall, precision and ROC assessment.

In order to create a training set for machine learning, one can capture idea

generation by extracting messages from an online forum in textual format, and have

judges manually classify these messages as containing an idea or not. In our particular

study we managed to extract 337 ideas out of 2,998 cases with reliability measure of

54 of 65

0.548 ± .056 at α = 0.05, which we turned into a classification model characterized by

a linear support vector machine where we used term frequency inverse document

frequency as term weighting scheme, terms 3-grams and stopword removal as well as

an oversampled training set. In our particular study this resulted in an F-measure of

0.608, a recall of 0.608, a precision of 0.608 and accuracy of 0.911 for our final

model.

Secondary research question

• To what degree do seasonality and historical events influence idea generation

inside online communities?

From our regression study we learned that 0.401 of the variation in our target variable

could be explained by seasonality and historical events. In particular, we noticed the

significant deviation of January, February and July. Implying that Christmas and

summer holidays enhance idea generation in our Lego case. We also learned that the

effect on idea generation of the years 1999, 2006, 2007 and 2008 was significantly

different from baseline. This implies that other factors than seasonality might have

played a role in the forums ability to create ideas seen over a period of years.

Implications of method for creating training set

Our approach to creating the training set gave us a reasonably sized and balanced

training set. However, the method of building an intermediate model and using it to

extract a higher number of positive cases has implications. The intermediate model

we trained was developed from 12 positive cases and 288 negative cases, a rather

small, and skewed data set. This may have created bias towards a certain type of idea,

which is a downside of our method. However, we did adjust for this problem as we

only used the intermediate model for extracting 1,500 cases, whereas the rest 1,500

cases were chosen at random.

Another issue related to the training set is our decision to classify a message as

a positive case if at least one judge had classified the message as an idea.

Alternatively, we could have set the threshold higher by required both judges to agree.

This would have yielded even higher skewness in our target variable, and a problem

of fewer event cases.

55 of 65

Business implications of study

From our study we learned that ideas created inside an online community can be

detected through text mining and machine learning. People tend to use certain

expressions (n-grams) when being creative. The fact that people write these

expressions enables us to detect the ideas by the means we propose. We identified this

pattern in our explorative studies, by applying an information gain criteria to filter the

twenty most discriminative terms. Many of these terms were n-grams. This confirms

that n-grams are more informative than single words, when one seeks to extract

meaning from text. Reflecting upon this we believe that the task of detecting ideas in

online communities via machine learning is primarily a task of extracting semantics.

Our method will allow organizations to filter ideas from old as well as new

data sources. For example, an organization with a corporate Facebook page would be

able to download all the messages posted on its Facebook page and use our model to

filter the posts containing ideas. As debated earlier, crowdsourcing often requires a

specific software platform, designed to allow a crowd to solve problems. This

platform relies on the crowd to do the filtering, and as we demonstrated, having such

platform might not be necessary. Our model will allow organizations to utilize the

wisdom of the crowd based on other platforms as for example a corporate Facebook

page or a message board as Lugnet.

If e.g. Lego would like implement such model Lego would first need to train

the model. Next Lego would need to download, the text content they would like to

apply the model to, and do the text processing steps described in this thesis. Based on

these two steps, one can apply the model to the new text data and filter messages,

which the model classifies as an idea. As the model is not perfect, it would be

necessary to have human judges sorting the ideas detected by the model. The primary

task of the judges is to assess if the content of the detected ideas is useful, and to

which product development team within Lego, the ideas might be useful. Doing this

manual classification would also improve and keep the model up to date, because one

would then be able to retrain the model each time an idea is confirmed or

disconfirmed by the human judge.

The organization utilizing crowdsourcing for new product development,

should consider the deviations in creativity due to seasonality and historical events.

As we have shown there are certain time dependent events where people are more

likely to discuss ideas in online communities. In a business context this means that if

56 of 65

Lego would like to utilize crowdsourcing for producing new ideas, they should take

these events into account. As an example one can imagine if Lego are following a

yearly product launch cycle, they should consider that their crowdsourcing

community is very likely to produce more ideas around summer holiday and

Christmas. Assuming this is the case, it makes sense to schedule future product

launches with this in mind. Giving a specific example, the development of next years

Lego Christmas toy collection (If such exist), should start in February, as the

crowdsourcing community would have peaked in its production of ideas at this point.

Detecting ideas by means of text mining and machine learning

As a final personal note it is our impression that ideas can be detected in online

communities, as shown in this thesis. However, one might consider reframing the

concept of our classification task to “detecting the creative process in online

communities”, because n-grams like “you could”, “one could”, “we could” etc., are

expressions of individuals who are currently in the early stages of the creative

process. Framing the task as “detecting ideas” emphasises the final outcome more

than the creative process, which we do not believe is realistic to detect by the means

applied in this thesis. We do however believe that detecting messages which is a part

of a creative process is very realistic as shown in this thesis. For future research this

can help us discover where ideas are generated, as we consider it a reasonable

assumption that if one can detect creative messages within a thread on e.g. Facebook,

the thread is likely to contain one, or several ideas.

$

57 of 65

8$% Bibliography$

Albors, J., Ramos, J. C., & Hervas, J. L. (2008). New learning network paradigms:

Communities of objectives, crowdsourcing, wikis and open source.

International Journal of Information Management, 28(3), 194–202.

Amabile, T. M. (1983). The social psychology of creativity: A componential

conceptualization. Journal of Personality and Social Psychology, 45(2), 20.

Argamon, S., & Olsen, M. (2006). Toward meaningful computing. Communications

of the ACM, 49(4), 33–35.

Ben-Hur, A., & Weston, J. (2010). A user’s guide to support vector machines.

Methods in Molecular Biology, 609, 223–239.

Bishop, J. (2009). Enhancing the understanding of genres of web-based communities:

The role of the ecological cognition framework. International Journal of Web

Based Communities, 5(1).

Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal

margin classifiers. In COLT ’92 Proceedings of the fifth annual workshop on

Computational learning theory.

Brabham, D. C. (2008). Crowdsourcing as a model for problem solving an

introduction and cases. Convergence: The International Journal of Research

into New Media Technologies, 14(1), 75–90.

Brynjolfsson, E. (2010). The four ways IT is driving innovation. MIT Sloan

Management Review, 51(3).

Buecheler, T., Sieg, J. H., Füchslin, R. M., & Pfeifer, R. (2010). Crowdsourcing, open

innovation and collective intelligence in the scientific method: a research

agenda and operational framework. In Artificial life XII. Proceedings of the

58 of 65

twelfth international conference on the synthesis and simulation of living

systems, Odense, Denmark (pp. 19–23).

Bughin, J., Chui, M., & Manyika, J. (2010). Clouds, big data, and smart assets: Ten

tech-enabled business trends to watch. McKinsey Quarterly, 56.

Burroughs, J. E., Morreau, C. P., & Mick, D. G. (2008). Toward a psychology of

consumer creativity. In Handbook of consumer psychology (pp. 1011 – 1038).

New York, NY: Psychology Press, Taylor & Francis Group, LLC.

Chawla, N. V. (2010). Data mining for imbalanced datasets: An overview. Data

Mining and Knowledge Discovery Handbook, 875–886.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and

Psychological Measurement, 20(1), 37–46.

Conway, D., & M. White, J. (2012). Machine learning for hackers (First edition.).

Sebastopol, CA: O’Reilly Media, Inc.

Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3),

273–297.

Dahlander, L., Frederiksen, L., & Rullani, F. (2008). Online communities and open

Innovation. Industry & Innovation, 15(2), 115–123.

Dharmadhikari, S. C., Ingle, M., & Kulkarni, P. (2011). Empirical studies on machine

learning based text classification algorithms. Advanced Computing: An

International Journal (ACIJ), 2(6), 161–169.

Di Gangi, P. M., Wasko, M. M., & Hooker, R. E. (2010). Getting customers’ ideas to

work for you: Learning from Dell how to succeed with online user innovation

communities. MIS Quarterly Executive, 9(4), 213–228.

59 of 65

Drummond, C., & Holte, R. C. (2003). C4. 5, class imbalance, and cost sensitivity:

Why under-sampling beats over-sampling. In Workshop on Learning from

Imbalanced Datasets II.

Enkel, E., Gassmann, O., & Chesbrough, H. (2009). Open R&D and open innovation:

exploring the phenomenon. R&D Management, 39(4), 311–316.

Eric Bonabeau. (2009). Decisions 2.0: The power of collective intelligence. MIT

Sloan Management Review, 50(2), 45–52.

Erk, K., & Padó, S. (2008). A structured vector space model for word meaning in

context. In Proceedings of the Conference on Empirical Methods in Natural

Language Processing (pp. 897–906).

Estabrooks, A., Jo, T., & Japkowicz, N. (2004). A multiple resampling method for

learning from imbalanced data sets. Computational Intelligence, 20(1), 18–36.

Estellés-Arolas, E., & González-Ladrón-de-Guevara, F. (2012). Towards an

integrated crowdsourcing definition. Journal of Information science, 38(2),

189–200.

Faraj, S., Jarvenpaa, S. L., & Majchrzak, A. (2011). Knowledge Collaboration in

Online Communities. Organization Science, 22(5), 1224–1239.

doi:10.1287/orsc.1100.0614

Feinerer, 2012, & Kurt, H. (2012). Ingo Feinerer (2012). tm: Text Mining Package.

R package version 0.5-7.1. Journal of Statistical Software, 25(5).

Feinerer, I., Hornik, K., & Meyer, D. (2008). Text mining infrastructure in R. Journal

of Statistical Software, 25(5), 1–54.

Feinerer, Ingo. (2012). Ingo Feinerer (2012). tm.plugin.mail: Text Mining E-Mail

Plug-In. R package version 0.0-5. http://CRAN.R-

project.org/package=tm.plugin.mail.

60 of 65

Feldman, R., & Sanger, J. (2006). The text mining handbook: Advanced approaches

in analyzing unstructured data. Cambridge University Press.

Fischer, G., Giaccardi, E., Eden, H., Sugimoto, M., & Ye, Y. (2005). Beyond binary

choices: Integrating individual and social creativity. International Journal of

Human-Computer Studies, 63(4-5), 482–512.

Forman, G. (2003). An extensive empirical study of feature selection metrics for text

classification. The Journal of Machine Learning Research, 3, 1289–1305.

Gobble, M. M. (2013). Resources: Big data: The next big thing in innovation.

Research-Technology Management, 56(1), 64–67.

Goldenberg, J., Lehmann, D. R., & Mazursky, D. (2001). The idea Itself and the

circumstances of its emergence as predictors of new product success.

Management Science, 47(1), 69.

Grün, B., & Hornik, K. (2011). Topicmodels: An R package for fitting topic models.

Journal of Statistical Software, 40(13), 1–30.

Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer

classification using support vector machines. Machine learning, 46(1), 389–

422.

Han, J., & Kamber, M. (2006). Data mining: Concepts and techniques (2. edition.).

San Francisco, CA: Morgan Kaufmann.

Hastie, T., Tibshirani, R., & Friedman, J. (2008). The elements of statistical learning -

data mining, inference and prediction (Second edition.). Stanford, CA:

Springer.

He, H., & Garcia, E. A. (2009). Learning from imbalanced data. Knowledge and Data

Engineering, IEEE Transactions on, 21(9), 1263–1284.

61 of 65

Hennessey, B. A., & Amabile, T. M. (2010). Creativity. Annual Review of

Psychology, 61(1), 569–598.

Hsinchun Chen, Chiang, R. H. L., & Storey, V. C. (2012). Business intelligence and

analytics: From big data to big impact. MIS Quarterly, 36(4), 1165–1188.

Hsu, C.-W., Chang, C.-C., & Lin, C.-J. (2010). A practical guide to support vector

classification (p. 16). Taiwan: National Taiwan University.

Johnson, J. E. (2012). Big data + Big analytics = Big opportunity. Financial

Executive, 28(6), 50–53.

Kao, A., & Poteet, S. R. (2007). Natural language processing and text Mining.

London, UK: Springer-Verlag.

Kaufmann, G. (2004). Two kinds of creativity–but which ones? Creativity and

innovation Management, 13(3), 154–165.

Keerthi, S. S., & Lin, C. J. (2003). Asymptotic behaviors of support vector machines

with Gaussian kernel. Neural computation, 15(7), 1667–1689.

Kotsiantis, S., Kanellopoulos, D., & Pintelas, P. (2006). Handling imbalanced

datasets: A review. GESTS International Transactions on Computer Science

and Engineering, 30(1), 25–36.

Lai, C.-C. (2007). An empirical study of three machine learning methods for spam

filtering. Knowledge-Based Systems, 20(3), 249–254.

Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for

categorical data. Biometrics, 33(1), 159–174.

LaValle, S., Lesser, E., Shockley, R., Hopkins, M. S., & Kruschwitz, N. (2011). Big

data, analytics and the path from insights to value. MIT sloan management

review, 52(2), 21–32.

62 of 65

Leimeister, J. M. (2010). Collective intelligence. Business & Information Systems

Engineering, 2(4), 245–248.

Linoff, G., & Berry, M. (2011). Data mining techniques: For marketing, sales, and

customer relationship management (3. Edition.). Indianapolis, IN: Wiley

publishing.

Malone, T., Laubacher, R., & Dellarocas, C. (2009). Harnessing crowds: Mapping the

genome of collective intelligence. Retrieved from

http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1381502

Malone, T. W., Laubacher, R., & Dellarocas, C. (2010). The collective intelligence

genome. MIT Sloan Management Review, 51(3), 13.

McCallum, A., & Nigam, K. (1998). A comparison of event models for naive bayes

text classification. In AAAI-98 workshop on learning for text categorization

(Vol. 752, pp. 41–48).

Nadkarni, P. M., Ohno-Machado, L., & Chapman, W. W. (2011). Natural language

processing: An introduction. Journal of the American Medical Informatics

Association, 18(5), 544–551.

Poetz, M. K., & Schreier, M. (2012). The value of crowdsourcing: can users really

compete with professionals in generating new product ideas? Journal of

Product Innovation Management, 29(2), 245–256.

Reisberg, D. (2010). Cognition: Exploring the science of the mind (4. edition.). New

York, NY.

Ryan, R. M., & Deci, E. L. (2000). Intrinsic and extrinsic motivations: Classic

definitions and new directions. Contemporary Educational Psychology, (25),

54–67.

63 of 65

Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text

retrieval. Information processing & management, 24(5), 513–523.

Sarkar, P., & Chakrabarti, A. (2011). Assessing design creativity. Design Studies,

32(4), 348–383.

Scholderer, J. (2013). Support vector machines. Presented at the Data mining lecture

on support vector machines, Aarhus, Denmark.

Sculley, D., & Wachman, G. M. (2007). Relaxed online SVMs for spam filtering. In

Proceedings of the 30th annual international ACM SIGIR conference on

Research and development in information retrieval (pp. 415–422).

Segaran, T. (2007). Programming collective intelligence (1. edition.). Sebastopol,

CA: O’Reilly Media, Inc.

Steyvers, M., & Griffiths, T. (2007). Probabilistic topic models. Handbook of latent

semantic analysis, 427(7), 424–440.

Tan, A. H. (1999). Text mining: The state of the art and the challenges. In

Proceedings of the PAKDD 1999 Workshop on Knowledge Disocovery from

Advanced Databases (pp. 65–70).

Tapscott, D., & Williams, A. D. (2008). Wikinomics - How mass collaboration

changes everything (Expanded edition.). London, UK: Atlantic Books.

Varewyck, M., & Martens, J.-P. (2011). A practical approach to model selection for

support vector machines with a Gaussian kernel. IEEE Transactions on

Systems, Man, and Cybernetics, Part B (Cybernetics), 41(2), 330–340.

Vukovic, M., & Bartolini, C. (2010). Towards a research agenda for enterprise

crowdsourcing. In Leveraging Applications of Formal Methods, Verification,

and Validation (pp. 425–434). Berlin, Germany: Springer.

64 of 65

Webb, S., Chitti, S., & Pu, C. (2005). An experimental evaluation of spam filter

performance and robustness against attack. In Collaborative Computing:

Networking, Applications and Worksharing, 2005 International Conference on

(p. 8).

Weiss, G. M., & Provost, F. (2001). The effect of class distribution on classifier

learning: an empirical study (Technical report No. ML-TR-44). Newark, NY:

Department of computer science, Rutgers university. Retrieved from

ftp://ftp.cs.rutgers.edu/http/cs/cs/pub/technical-reports/work/ml-tr-44.pdf

Weiss, G. M., & Provost, F. J. (2003). Learning when training data are costly: The

effect of class distribution on tree induction. J. Artif. Intell. Res. (JAIR), 19,

315–354.

Wilson, S. M., & Peterson, L. C. (2002). The anthropology of online communities.

Annual Review of Anthropology, 31(1), 449–467.

Witten, I. H., & Frank, E. (2005). Data Mining: Practical machine learning tools and

techniques (2. edition.). San Francisco, CA: Morgan Kaufmann publishers.

Xu-Ying Liu, Jianxin Wu, & Zhi-Hua Zhou. (2009). Exploratory undersampling for

class-imbalance learning. IEEE Transactions on Systems, Man, and

Cybernetics, Part B (Cybernetics), 39(2), 539–550.

Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text

categorization. In Machine learning-international workshop then conference

proceedings (pp. 412–420).

Yu, B. (2008). An evaluation of text classification methods for literary study. Literary

and Linguistic Computing, 23(3), 327–343. doi:10.1093/llc/fqn015

Zanasi, A. (2007). Text mining and its applications to intelligence, CRM and

knowledge management (1. edition.). Southampton, UK: WIT Press.

65 of 65

Zhang, L., Zhu, J., & Yao, T. (2004). An evaluation of statistical spam filtering

techniques. ACM Transactions on Asian Language Information Processing

(TALIP), 3(4), 243–269.

Zheng, Z., Wu, X., & Srihari, R. (2004). Feature selection for text categorization on

imbalanced data. ACM SIGKDD Explorations Newsletter, 6(1), 80–89.

A

Appendix A - Message view at www.lugnet.com

$

Source: “http://news.lugnet.com/dear-Lego/?n=10”

Comment: This document is shown through a regular Internet browser

B

Appendix B - Message in .eml format

Comment: This document is shown through a regular mail software

C

Appendix C - Message in .txt format

Comments: This document is shown through a text editor

D

Appendix D - Descriptive statistics of regression data

Variable( n( mean( sd( median( trimmed( mad( min( max( range( skew( kurtosis( se(ACTIVITY' 120.00' 3495.95' 2373.25' 3322.50' 3315.27' 2989.66' 338.00' 9369.00' 9031.00' 0.50' 30.76' 216.65'IDEA' 120.00' 114.59' 88.28' 98.00' 105.74' 94.89' 4.00' 420.00' 416.00' 0.87' 0.40' 8.06'ERPM' 120.00' 0.03' 0.01' 0.03' 0.03' 0.01' 0.01' 0.11' 0.10' 2.52' 11.93' 0.00'LN.IDEA' 120.00' 33.48' 0.37' 33.50' 33.48' 0.31' 34.74' 32.07' 2.67' 0.17' 2.02' 0.03'

Documents

Mining the wisdom of the crowds - Braintrust BASE · Mining the Wisdom of the Crowds “Detecting new product ideas by text mining and machine learning techniques” by Kasper Christensen