60
Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) 1 Department of Information Systems and Computing BSc (Hons) Information Systems (Business) Academic Year 2013 – 2014 Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering) Adebowale Nadi 1008089 A report submitted in partial fulfilment of the requirements for the degree of Bachelor of Science Brunel University Department of Information Systems and Computing Uxbridge Middlesex UB8 3PH United Kingdom T: +44 1895 203397 F: +44 (0) 1895 251686

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Embed Size (px)

Citation preview

Page 1: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

1

Department of Information Systems and Computing

BSc (Hons) Information Systems (Business)

Academic Year 2013 – 2014

Digital Prosumer - Identification of Personas through Intelligent

Data Mining (Clustering)

Adebowale Nadi

1008089

A report submitted in partial fulfilment of the requirements for the degree of

Bachelor of Science

Brunel University Department of Information Systems and Computing

Uxbridge Middlesex

UB8 3PH United Kingdom

T: +44 1895 203397 F: +44 (0) 1895 251686

Page 2: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

2

Abstract

The main objective of the paper is to explore the idea of prosumption and how digital

personhood data that we produce can be extracted, filtered and analysed and given back to

us [prosumers] in a way that is commodifiable, subsequently empowering citizens to utilize

data that they produce. One aspect of this hypothesis is the identification of personas through

clustering which is facet of intelligent data analysis. With the sole aim being of building a

Persona Identification Application (PIA) which sole purpose is to be able to deduce personas

from data stores.

In 2011 it was estimated that 274.2 million Americans were connected to the internet

leading to 81 billion minutes being spent on social networking sites and blogs. In the same

year 117.6 million people visited the internet via a mobile phone accounting for $246 billon

being spent making online purchases (Palis, 2012). Well renowed mangement consultency

firm Boston Consulting Group projects that the Internet Econmoy will contribute $4.2 billion

to G20 total GDP by 2016. This lead co-author David Dein to emphasise that “If it were a

national economy [internet economy], it would rank in the world’s top five, behind only the U.S.,

China, India, and Japan, and ahead of Germany,” (Dein, 2012). With the rise of the internet

economy coupled with the increased rise of mobile devices connected to the internet,

faciliating an unprecedently amount of data being held, intelligent data analysis needs to be

used to be able to isolate the key information thus producing personas that can be later

traded on a futures market.

This paper will look at the rise of the internet economy coupled with the emergance of the

digital prosumer. In addtion clustering will be look at in finite detail, looking at the various

clustering techniques that can be used in the purposed application, looking into the

advantages and disadvantages of each before deciding on which is the appropriate method

for this project. Furthmore this paper will detail the step by step implementation of the

application detailing all the design and requirement analysis that took place before hand.

Finally a detailed evaluation will be explained and executed relaying the findings from the

application and seeing if, infact, the application meets the aim in a coherent and

chomprehensible manner.

Page 3: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

3

Acknowledgements

First and foremost I would like to take this opportunity to thank my Lord Jesus Christ for

guiding me through this project and giving me the strength to be able to conclude this

dissertation. I would also like to thank my Mum & Dad for their indubitable and

unconditional support given to me throughout my time working on this project. In addition,

all the people that helped, supported and assisted me in anyway shape or form in putting this

dissertation together I would like to personally thank and extend my sincere gratitude

towards. (There are too many to name personally but they know who they are). Last but

certainly not least, I would like to personally thank my supervisor Panos Louvieris and his

assistant Natalie Clewley for all their support rendered to me throughout this project. This

dissertation was, no doubt, the biggest challenge I have faced in all my 19 years in education,

but definitely the most rewarding, learning a highly complex topic (data mining) and learning

to code in a completely new software environment with no prior experience. I truly wouldn’t

have been able to complete it without their guidance, assistance and motivation. In closing I

would like to wish Panos and his team the best of luck in completing their EPSRC sponsored

project Digital Personhood: Digital Prosumer.

Total Words: 15,500

I certify that the work presented in the dissertation is my own unless referenced.

Signature Adebowale Olatunde Nadi

Date 24/03/2014

Page 4: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

4

Table of Contents

Abstract ................................................................................................................................................................ ........... 2

Acknowledgements .................................................................................................................................................... 3

Table of Contents ........................................................................................................................................................ 4

List of Tables ................................................................................................................................................................ . 7

List of Figures ............................................................................................................................................................... 7

1 Introduction ........................................................................................................................................................ 9

1.1 Problem Definition .................................................................................................................................. 9

1.2 Aims and Objectives ............................................................................................................................... 9

1.3 Project Approach .................................................................................................................................. 10

1.4 Dissertation Outline ............................................................................................................................ 11

2 Literature Review .......................................................................................................................................... 12

2.1 Personal Data ......................................................................................................................................... 12

2.2 Value of Personal Data ....................................................................................................................... 12

2.3 The Internet [Digital] Economy ...................................................................................................... 13

2.3.1 Midata .................................................................................................................................... 13

2.3.2 Information Economy Strategy (IES) ........................................................................ 13

2.4 What is a Persona? ............................................................................................................................... 14

2.5 What is a Prosumer? ........................................................................................................................... 14

2.5.1 The Rise of the Digital Prosumer ................................................................................ 15

2.6 Data Mining ............................................................................................................................................. 15

2.6.1 Knowledge Discovery from Data [KDD] .................................................................. 16

2.7 Cluster Analysis ..................................................................................................................................... 17

2.7.1 Partitioning Technique ................................................................................................... 17

2.7.2 Advantages and Disadvantages ................................................................................... 17

2.7.3 Hierarchical Technique ................................................................................................... 18

2.7.4 Advantages and Disadvantages ................................................................................... 18

2.8 Critical Discussion ................................................................................................................................ 19

2.9 Summary .................................................................................................................................................. 20

3 Methodology..................................................................................................................................................... 21

3.1 Design Science ....................................................................................................................................... 21

3.2 Positivist Approach (Positivism) ................................................................................................... 22

3.3 Interpretive Approach ........................................................................................................................ 23

3.4 Critical Discussion ................................................................................................................................ 23

3.5 Software Development Lifecycle Models .................................................................................... 24

Page 5: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

5

3.5.1 Rapid Application Development (RAD) ................................................................... 24

3.5.2 Analysis ................................................................................................................................. 25

3.6 Waterfall Model ..................................................................................................................................... 25

3.7 Analysis ..................................................................................................................................................... 26

3.8 User Interface Evaluation .................................................................................................................. 26

3.8.1 Nielsen Heuristics ............................................................................................................. 27

3.8.2 Advantages and Disadvantages ................................................................................... 28

3.9 Critical Discussion ................................................................................................................................ 28

3.9.1 Cognitive Walkthrough ................................................................................................... 29

3.10 Critical Discussion ................................................................................................................................ 30

3.11 Summary .................................................................................................................................................. 30

4 Requirements Analysis and Design ........................................................................................................ 31

4.1 Customer Requirements .................................................................................................................... 31

4.2 Functional Requirements .................................................................................................................. 31

4.3 Non-Functional Requirements ........................................................................................................ 32

4.4 Requirements Summary .................................................................................................................... 32

4.5 Design ........................................................................................................................................................ 32

4.6 Activity Diagram .................................................................................................................................... 33

4.7 Use Case .................................................................................................................................................... 34

Summary ................................................................................................................................................................ 34

5 Implementation .............................................................................................................................................. 35

5.1 Software Environment – R ................................................................................................................ 35

5.2 Software Environment - MatLab .................................................................................................... 35

5.3 Persona Identification Application Implementation ............................................................. 35

5.3.1 Application Coding Screenshots ................................................................................. 36

5.3.2 Application Interface Screenshots ............................................................................. 39

5.4 Assumptions ........................................................................................................................................... 40

5.5 Summary .................................................................................................................................................. 40

6 Results and Evaluation................................................................................................................................. 41

6.1 Data Pre-Processing ............................................................................................................................ 41

6.2 Results Summary .................................................................................................................................. 43

6.3 Evaluation ................................................................................................................................................ 45

6.3.1 Participant selection ........................................................................................................ 46

6.4 Black-Box Testing ................................................................................................................................. 46

6.5 Evaluation Results ................................................................................................................................ 47

6.6 Black Box Testing Results ................................................................................................................. 48

Page 6: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

6

6.7 Evaluation Summary ........................................................................................................................... 48

7 Conclusion ......................................................................................................................................................... 49

7.1.1 Aim - Identify individual personas from prosumers personal information.

49

7.1.2 Objective 1 - Undertake a state-of-the-art literature review to inform,

create a design specification for an identifying personas/Investigate in greater detail the

pros and cons of clustering with reference to appropriate literature ..................................... 49

7.1.3 Objective 2 - Build a persona identification application. .................................. 50

7.1.4 Objective 3 - Evaluate the application. ..................................................................... 50

7.2 Future Development ........................................................................................................................... 50

Appendix A Personal Reflection ........................................................................................................... 51

A.1 Reflection on Project ........................................................................................................................... 51

A.2 Personal Reflection .............................................................................................................................. 51

Bibliography ............................................................................................................................................................... 53

A.3 Appendices .............................................................................................................................................. 57

A.4 Appendices .............................................................................................................................................. 57

Page 7: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

7

List of Tables

Table 1 – User Requirements .............................................................................................................................. 31

Table 2 - Functional Requirements .................................................................................................................. 32

Table 3 - Non-Functional Requirements ........................................................................................................ 32

Table 4 - Use Case Narrative ............................................................................................................................... 33

List of Figures

Figure 1 - Fayyad KDD representation ........................................................................................................... 16

Figure 2 - Example of a word sorting dendrogram output from:

http://www.macs.hw.ac.uk/texturelab/people/thomas-methven/ ....................................... 18

Figure 3 - Design Science Guideline from MIS Quarterly Research Essay. ...................................... 21

Figure 4 - The Engineering Cycle ...................................................................................................................... 22

Figure 5- Epistemological Assumptions for Qualitative and Quantitative Research from

http://dstraub.cis.gsu.edu:88/quant/2philo.asp ............................................................................. 23

Figure 6 - RAD Diagram......................................................................................................................................... 25

Figure 7 - Waterfall Model ................................................................................................................................... 26

Figure 8 - Activity Diagram of Persona Identification Application ..................................................... 34

Figure 9 - Use Case Diagram of Persona Identification Application ................................................... 34

Figure 10 - Import csv file plus description .................................................................................................. 36

Figure 11 – Choose variables plus description ............................................................................................ 36

Figure 12 – Standardize data and run k-means plus description ........................................................ 37

Figure 13 – Choose K function plus description ......................................................................................... 37

Figure 14 – Show analysis results plus description .................................................................................. 38

Figure 15 – Download results csv file plus description ........................................................................... 38

Figure 16 - Screenshot of Persona Application Interface 1.0 ................................................................ 39

Figure 17 – Screenshot of Persona Identification Application 2.0 ...................................................... 39

Figure 18 – Evidence of data pre-processing Results ............................................................................... 41

Figure 19 - Screenshot of results out CSV file .............................................................................................. 42

Figure 20 - Identifying Personas Breakdown .............................................................................................. 42

Figure 21 –Percentage Calculator Example .................................................................................................. 43

Figure 22 - Persona Percentage Results (Test 1) ....................................................................................... 43

Figure 23- Persona Percentage Results (Test 2) ........................................................................................ 44

Figure 24 - System Usability Questionnaire ................................................................................................. 45

Figure 25 - Graph showing the optimum number of evaluators .......................................................... 46

Figure 26 - Functional Test Questionnaire.................................................................................................... 47

Page 8: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

8

Figure 27 - Table of Usability Questionnaire Results ............................................................................... 47

Figure 28 - Bar Chart of Usability Questionnaire Results ....................................................................... 47

Figure 29 Bar Chart showing average usability questionnaire results ............................................. 48

Figure 30 - Results of System Functionality Questionnaire ................................................................... 48

Page 9: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

9

1 Introduction

This dissertation will be looking at the digital prosumer; in particular, concentrating on the

identification of personas gained from wholesome prosumer data stores which can be used

as valuable commodities to sell on the ‘futures’ market. I plan to execute this by identifying

specific personas from a digital vault of prosumer personal information by using intelligent

data analysis, in this case, clustering. During the course of this dissertation I expect to isolate,

analyze and categorize raw prosumer data and present it in a way were I can link it to a

persona. Also I expect to find the best clustering technique, through an extensive literature

review analyzing both the advantages and disadvantages of each selected method before

coming to a conclusion on the best technique to use. I will also develop a persona

identification application, which will be used to analyze the data and set them into clusters

which can then be classified into personas. Then finally I will be undertaking a

comprehensive evaluation of the app to scope the overall effectiveness of the application.

1.1 Problem Definition

Personal data can generate unprecedented economic and social value for governments,

organizations and individuals in many ways. By 2020 it is estimated that more than 50 billion

devices may be connected to the Internet (Nagel, 2013) and more than 40 times as many

personal data records stored. With the large amounts of data collected from prosumers,

smarter data mining techniques need to be employed to efficiently analyze the data and

identify personas for which data can be traded on a data exchange.

Data mining is the search for valuable information within large volumes of data by

systematically exploring underlying patterns, trends, and relationships hidden in available

data. Data mining techniques can generally be categorized into: (i) classification and

prediction; (ii) clustering; (iii) outlier prediction; (iv) association rules; (v) sequence

analysis; (vi) time series analysis; and (vii) text mining.

1.2 Aims and Objectives

The aim of this project is to identify individual personas from prosumers personal

information stored in a digital vault using an intelligent data analysis technique, Clustering.

To aid me in achieving this aim within this project I have set out a list of objectives that will

help develop the body of this dissertation as well as assist me in determining whether the

project aim has been successfully satisfied.

Page 10: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

10

• Undertake a state-of-the-art literature review to inform, create a design specification

for an identifying personas from digital personhood data using intelligent data

analysis techniques (Clustering).

• Investigate in greater detail the pros and cons of clustering with reference to

appropriate literature

• Build a persona identification application (e.g. using MatLab or R).

• Evaluate the application.

1.3 Project Approach

In order to successfully complete this project I have adopted a five-step approach. At each

stage there will be a set of deliverables I will set that will help achieve my aims and

objectives and also to assist me in completing this project on time.

The first step will be to conduct a state-of-the-art literature review. This review will look at

different cluster analysis techniques from a variety of different physical and online sources.

This will enable me to inform the design of my application, which is the cornerstone of this

project. In addition I will look at what has been done in terms of cluster analysis and try to

synthesize that information and relate it back to my project. The second step will be to

looking at different methodology principles and models, picking the most appropriate

method for this project with appropriate reference to literature. Selecting the right

methodology is pivotal to the success of this project. The third stage will be to analyses the

user requirements and talk about the design of my application and evaluating the GUI. After

this has been discussed and illustrated then I will proceed in coding my application, which

will be done in R-Studio. The fourth stage will be ascertaining the results of the application

and trying to find personas out of the dataset clustered. The way I went about de-cyphering

the information and deducing personas will be shown and explained at this stage. The final

stage of this project will involve evaluating the application and the project as a whole. This

will be coupled with personal reflection on my experiences on putting together this project

Page 11: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

11

1.4 Dissertation Outline

Chapter 2: Literature Review – This chapter will look into pervious literature that will

equip me to gain a deeper understanding into my research problem. Subsequently it will

help inform my design of my application.

Chapter 3: Methodology - This chapter will look at different methodologies principles as

well as software development lifecycle models and critically discussing each of their

strengths as well as weaknesses before isolating a principle and SDLC that will be the most

appropriate for my project.

Chapter 4: Requirement Analysis and Design – This chapter will look at the requirements of

the application set out by the user and analyzing the functional and non-functional

requirements. In addition I will be going through the design process of my application and

how I intend to put it all together.

Chapter 5: Implementation – This chapter will demonstrate the coding of the logic of my

application in R and the coding of the interface using R-Shiny. I will be including fully

annotated screenshots depicting evidence of implementation.

Chapter 6: Results and Evaluation – This chapter will be showing the results of the

application as well as showing how I went about deducing personas from the application. I

will also be looking into evaluating the app and seeing if it has met the aims and objectives

set out at the beginning.

Chapter 7: Conclusion – This chapter will be drawing conclusions to all the findings brought

about in this project. I will be concluding my aims as well as all 3 of my objectives. In addition

I will be evaluating my application from a subjective point of view as well as the project in

its entirety. I will also be suggesting future work to make my application even better.

Page 12: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

12

2 Literature Review

In this chapter I will be discussing and reviewing the different clustering methodologies

available, analyzing the advantages and disadvantages of each technique with reference to

the appropriate literature. This, along with personal evaluation, will fortify me in concluding

which chosen technique is the most appropriate in executing this project by giving me the

adequate justification for that chosen method. In addition to this I will be looking into further

detail into what personal data is as well as how it has metamorphosed into being an

increasing important aspect of a to economic growth and corporate supremacy, consequently

delivering a new breed of prosumers, the digital prosumer.

2.1 Personal Data

If we look at the European Data Protection Directive [Article 2] we see that personal data is

defined “by reference to whether information relates to an identified or identifiable individual”

(Information Commissioner Office, 2010) in other words personal data is any given piece of

information that can be used to in identify and individual or individual characteristic. The

Data Protection Act of 1998 adds a different dimension to the EDPD definition of ‘data’ by

taken into account the way the information was processed before it can be regarded as data

e.g. processed automatically or processed non automatically. The EDPD and Data Protection

Act have a common consensus on what personal data/information is;

- Information processed, or intended to be processed, wholly or partly by

automatic means (that is, information in electronic form) (ICO, 2010)

- Information processed in a non-automated manner which forms part of, or is

intended to form part of, a ‘filing system’ (that is, manual information in a

filing system) (ICO, 2010)

2.2 Value of Personal Data

Personal information is an increasingly important asset in the twenty-first century, both in

terms of corporate monetary value and government efficiency as well as economic prowess.

Coincidentally, corporate companies around the world have begun the transition into

investing greatly in software that helps facilitate the collation of consumer data (Schwartz,

2003). It’s estimated that everyday people across the world send 10 billion text messages

daily, coupled with that 1 billion posts to a blog or social media sites are made leading to a

new type of economy emerging, The Internet economy. It is estimated that that the Internet

economy within the G20 amounted to $2.3 trillion or 4.1% total GDP in 2010 (Group, 2012).

Page 13: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

13

2.3 The Internet [Digital] Economy

Sometimes called the digital or web economy the Internet Economy is a concept based on

digital technologies fusing with the traditional economy. First established by Don Tapscott in

his critically acclaimed book; The Digital Economy: Promise and Peril in the Age of

Networked Intelligence’’, it is widely believed that the internet economy is positioning itself

as the new cornerstone for any emerging or established economy (Tapscott, 1997) This is

evident by the recent figures released by the Boston Consulting Group their Digital Manifesto

Report which states that currently the value of the internet economy is larger than that of

countries like Brazil and Italy and that by the year 2016 the Internet economic value is

expected to double to $4.2 trillion. The report also goes on to say that ‘’no company or country

can afford to ignore this [Internet economy] phenomenon’’. (David Dean, 2012) The rise in

the amount of data being produced is strongly linked to the innovation of mobile technology,

from the turn of the millennium, allowing more devices than ever to be able to make a

connection with the cyber-world that is the Internet. Steve Wojtowecz, Vice President of

storage software development at IBM, stated that by the year 2015 over a trillion devices

would be connected to the internet (King, 2011). As a consequence the UK government has

started up two initiatives, Midata and Information Economy Strategy (IES) to aid prosumers

with improved and sufficient access to their own personal data that companies hold about

them. (BIS, 2011).

2.3.1 Midata

These are the key principles [aims] of the Midata initiative outlined in its government report:

(Department for Business, Innovation & Skills , 2013)

- Get more private sector businesses to release personal data to consumers

electronically

- Make sure consumers can access their own data securely

- Encourage businesses to develop applications (apps) that will help

consumers make effective use of their data

2.3.2 Information Economy Strategy (IES)

These are the key principles [aims] of the IES project outlined in its government report:

(Department for Business, Innovation and Skills, 2013)

- A strong, innovative, information economy sector exporting UK excellence to the

world

Page 14: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

14

- UK businesses and organizations, especially small and medium enterprises

(SMEs), confidently using technology, able to trade online, seizing technological

opportunities and increasing revenues in domestic and international markets

- Citizens with the capability and confidence to make the most of the digital age

and benefiting from excellent digital services.’’

Long-term success will be underpinned by:

- A highly skilled digital workforce (whether specialists who create and develop

information technologies, or non-specialists who use them)

- The digital infrastructure (both physical and regulatory) and the framework for

cyber security and privacy necessary to support growth, innovation and

excellence.’’ (Department for Business, Innovation and Skills, 2013)

It’s important to remember that both these government initiatives are being reinforced by

reviews and changes to legislation such as the Data Protection Act, Consumer Rights Bill

[Both UK and EU level] and the Enterprise and Regulatory Reform Act 2013. Reason being is

that this will necessitate companies to disclose customers’ personal data to them if they opt

not to do so voluntarily. (Department for Business, Innovation & Skills , 2013)

2.4 What is a Persona?

Typically used as marketing tool and human centered design [HCD] personas are

hypothesized groups of users that illustrate similar behavioral patterns in their use of

technology, lifestyle decisions, customer service preferences as well as their purchasing

decisions. Angus Jenkinson first came up with a top down analytical approach that works by

‘grouping’ focusing on a synthetic, clustering process leading to ‘customer communities’ and

the creation and preservation of loyalty within these communities in his 1994 journal

Beyond Segmentation (Jenkinson, 1994). This concept was refined five years later by Alan

Cooper in his pioneering book The Inmates Are Running the Asylum in which Cooper creates

the actual concept called ‘persona’ that is used today to identify customer relative behavior

and consumption patterns. (Cooper, 1998)

2.5 What is a Prosumer?

It is widely considered that Alvin Toffler is the creator of concept of prosumption, he goes on

to define it in his book ‘The Third Wave’ as people who “produce some of the goods and

services entering their own consumption” (Toffler, 1980) (Kotler, 1986). In other words

people that produce and consume their own products and services are prosumers. In the 21st

Page 15: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

15

century the prosumer has become more and more prominent replacing the traditional

consumers of the Industrial Age, this lays credence to Toffler’s own prediction that; as society

moves to towards the Post-Industrial Age the number of pure consumers will decline being

replaced with “prosumers” (Toffler, 1980).

2.5.1 The Rise of the Digital Prosumer

Consequently as we divulge deeper into the Information Age and the Internet Economy

continues to evolve into an economic juggernaut, a new type of prosumer has emerged, the

digital prosumer. The digital prosumer is a person that creates and consumes his or her own

data. As of today the biggest benefactors of personal data produced are the depicted as the

big 3 data companies, which are; Google, Facebook and Twitter making upwards of $1200

from a user profile. (Madrigal, 2012)

2.6 Data Mining

Data mining is the iterative process of extracting or “mining” knowledge from excessive

amounts of data stores, which can be put into perspective and exported into useful

information. Data mining is thought to involve six common classes of that lead to prediction

and description, which is one of the primary goals of data mining: (Wikipedia, 2011)

(Kamber, 2006)

• Classification – is learning a function that classifies a single data item into one of

several predefined classes. Examples of classifications techniques:

- Bayesian classifiers

- K-nearest neighbor

- Linear classifiers

• Regression – is learning a function that maps a data item to a prediction variable.

In other words regression estimates the relationship between any two variables.

Some examples of regression models are:

- Percentage regression

- Bayesian linear regression

- Nonparametric regression

• Clustering- is a descriptive task that works by aiming to identify cluster or

categories that seek to describe data. Examples of clustering techniques are:

- Hierarchical

- Partitioning

- Density-Based

- Centroid-Based

Page 16: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

16

• Summarization – is a method for finding a cohesive description of a data set, this

includes analytical representation such as visualization and report generation

• Dependency modeling – is a method that consists of finding a model that depicts

significant dependencies between variables

• Change and deviation detection – is a method that focuses on finding the most

significant changes from previously measured data. (Usama Fayyad, 2008)

2.6.1 Knowledge Discovery from Data [KDD]

KDD can often be misconstrued as data mining in itself; however it’s safe to say that data

mining is an essential part of the knowledge discovery. Usama Fayyad purposed the

methodology of KDD in 1995 with the purpose of making data produced by companies useful

to their business needs. (Deutsch, 2010)

Figure 1 - Fayyad KDD representation

Knowledge discovery takes an iterative sequence approach to its philosophy, which consists

of; (Kamber, 2006)

• Data Cleaning – to remove noise and inconsistent data

• Data Integration – where multiple data sources may be combined

• Data Selection - where data relevant to the analysis task are retrieved from the

database

• Data Transformation - where data are transformed or consolidated into forms

appropriate for mining

• Data Mining – an essential process where intelligent methods are applied in order to

extract data pattern

• Pattern Evaluation – to identify the truly interesting patterns representing

knowledge

• Knowledge Presentation – where visualization and knowledge representation are

used to present the finished knowledge to the user

Page 17: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

17

2.7 Cluster Analysis

Cluster analysis can be defined as the process of grouping a set of physical or abstract objects

into classes that have similar objects. In other words a cluster can be depicted as collection of

data objects that a similar to object within the same cluster or dissimilar to objects in another

cluster. An advantage of clustering or cluster analysis is that it can single out useful features

that define characteristics within different groups, which, in turn, will help me in my aim of

identifying personas from prosumer data (Kamber, 2006). They’re a various different

cluster analysis techniques such as; Partitioning, Hierarchical (Agglomerative and Divisive)

and The Single Link Method (Raza Ali, 2004)

2.7.1 Partitioning Technique

Partitioning methods aims to relocate clusters of data from one cluster to another; this is

usually started by the initial partitioning. The method also requires the number of clusters to

be pre-set by the user. It is also commonly cited that to achieve global optimality in this type

of clustering an exhaustive enumeration process of all possible partitions is needed, because

of this necessity most applications choose one of two popular algorithms, K-means and K-

medoids algorithms (Kamber, 2006):

• K-Means Algorithm

K-means enables the user to mine data by representing each cluster

by the mean value (usually K) of the objects present in the cluster

• K-Medoids Algorithm

K-medoids on the other hand, enables each cluster to be represented

by one of the objects located nearer to the center of the cluster.

2.7.2 Advantages and Disadvantages

Now the K-means technique has advantages as well as disadvantages, one of the main

advantages is that k-means work well for finding spherical-shaped clustering within small

to medium-sized data stores. Another advantage of k-means is that the method tends to

produce tighter, more compact clusters than say hierarchical clustering. (Lior Rokach,

2010)

However there are also disadvantages to this technique, one of them being that it is very

limited to the type of cluster model the algorithm is applied to. The effectiveness of the

algorithm is predicated on the spherical shaped clusters, sometimes called globular, as this

enables the mean value to be positioned closer towards the center of the cluster. This

consequently means that clusters that aren’t a similar size or have large datasets won’t work

Page 18: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

18

well with this algorithm. Another disadvantage to this algorithm is that it is very sensitive to

noisy data and outliners, which can increase the squared error significantly; this leads to

the user mandated to know the number of clusters beforehand, which is a very tedious task.

(Improved Outcomes Software (ios), 2009)

2.7.3 Hierarchical Technique

Hierarchical methods aim to create a hierarchical decomposition of the given sets of data

objects. This method can be sub-partitioned into two techniques; Agglomerative and

Divisive. The agglomerative method, which is also called the bottom up approach, works by

each data object forming a separate group, after this is done the clusters are successively

merged until the desired cluster structure is achieved. The divisive method, which is also

called the top-down approach, works by all the data objects being in the same cluster then

partitioned into sub-clusters, which in turn is partitioned further sub-clusters. This

sequential process is repeated until the desired cluster structure is obtained. One of the

intriguing things about hierarchical clustering is that it provides a decipherable visual of the

algorithm plus data; this is called a Dendrogram. This is a resourceful summarization tool

that makes hierarchical clustering extremely popular. (Lior Rokach, 2010)

Figure 2 - Example of a word sorting dendrogram output from: http://www.macs.hw.ac.uk/texturelab/people/thomas-methven/

2.7.4 Advantages and Disadvantages

It’s important to remember that hierarchical techniques have many advantages as well as

disadvantages. One of the advantages is that it is very versatile; methods like single-link

work maintain a strong performance on datasets delivering well-separated, chainlike and

concentric clusters. Another advantage to hierarchical methods is the fact that they produce

multiple partitions, this is particular resourceful for users that want to choose different

Page 19: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

19

partitions from those already nested in the overall cluster according to the desired similarity

level chosen by the user.

On the other hand the disadvantages to this particular technique are quite evident.

Hierarchical algorithms are notorious for their inability to scale well; the algorithm is also

accredited to causing high I/O costs when trying to cluster a large number of objects. Another

disadvantage to the hierarchical technique is that its rigidity, simply put, once one step is

done in the sequence it can never be undone or modified. (Lior Rokach, 2010)

2.8 Critical Discussion

Having reviewed the advantages and disadvantages of hierarchal and partitioning techniques

it’s important to offer an analysis of both techniques, in relation to this project, in order for to

be able to distinguish the most appropriate technique for clustering. From my research I can

see that partitioning clustering works well on small sized data sets as opposed to bigger data

sets, the dataset used in this project is fairly large containing data from 2,500 household’s

weekly shop. Partitioning clustering also goes about making tighter, more cohesive, clusters

through its k-means algorithm, which makes it easier to depict the key features within the

cluster, which in turn defines persona characteristics. On the other hand, for users not to

encounter noisy data while clustering it is advantageous for them to know the number of

clusters in advance, this is near on impossible with the size of the database in question.

Looking on the other side of the coin we see that the Hierarchical technique is very versatile

offering different methods such as single link, complete link and average link, which,

consequently, delivers separate clusters. This I believe will work well in this project, as it will

aid in presenting persona’s from the dataset provided. In addition to this the hierarchical

technique has a very good quality assurance type algorithm to ensure quality of cluster such

as Chameleon which will be good in ensure that the personas defined are validated. On the

other hand the hierarchical technique is very rigid so if erroneous decisions occur it is nearly

impossible for it to be corrected which provides a big disadvantage to this project as

identifying personas will need a great deal of flexibility as parameters for personas can

change at any given time.

In light of all the information reviewed it’s fair to say there are a number of advantages and

disadvantages that both offer however in order to obtain the best and more concise results I

believe consensus clustering would be the best option. However due to time constraints and

lack of expertise in coding, I have decided to use the K-Means algorithm to provide the logic

to my application. I intend to then build an interface, which simplifies the steps of the K-

Means algorithm and puts it in a way that is easy to administer for the user. The choice of

Page 20: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

20

which software environment I will use to code the interface as well as the justifications for it

will be made in Chapter 5.

2.9 Summary

In this chapter I have spoken about personal data and its value, I have also looked into the

definition of personas coupled with the rise of the prosumer and Internet economy.

Furthermore I have discussed in detail what is cluster analysis is looking in particular at two

clustering techniques (Hierarchical and Partitioning), offering an in-depth critical discussion

about my chosen technique to take forward into my application. The findings of the chapter

will further equip me into meeting my aims and objectives set out for this project. In addition

it will assist me in constructing a design specification for my application

Page 21: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

21

3 Methodology

This chapter will be exploring different research methodologies and coming up with the

appropriate justification for applying the chosen methodology to this project the three

methods in question will be; Design Science, Positivist and Interpretive. The methodology I

have decided to use is the design science approach. The justification will be validated through

the appropriate reference to literature sourced, as well as a personal analysis of the different

approaches.

3.1 Design Science

As previously mentioned the design science approach is my chosen methodology for this

project. Design science simply put is the methodical form of designing or research design.

First established by American inventor Richard Buckminster Fuller in 1963, the concept of

design science proceeded to be further developed by Gregory in his 1966 book “The Design

Method” in which he demarcates the relationship between design method and scientific

method. He further accentuates his view that design is not inherently a science and that the

actual term design science pertains to the scientific study of design. As technology continued

to evolve at the turn of the century design science started becoming more integrated into

Information systems research and software design projects. Alan Hevner in 2004 produced a

seven-guideline framework, with the aim to assist information system researchers to;

conduct, evaluate and present design-science research. (Alan R. Hevner, 2004)

Figure 3 - Design Science Guideline from MIS Quarterly Research Essay.

Further refinement this framework by Peffers, was later made in order to explain how the

regulative cycle fits into the design science research framework.

Page 22: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

22

Figure 4 - The Engineering Cycle

This framework is widely used today by information system researchers as it provides

researchers a medium to analyze and de-cipher an existing problem and offer a solution

design or solution hypotheses. After which they can then look at whether their solution or

hypotheses is effective or meets the specified criteria, this can be executed through a pilot

scheme or prototyping after which the full implementation can take place. (Roel Wieringa,

2010). This principle in particular would suit my project the most in my opinion, as I aim to

design a software solution (clustering program), design it, and then evaluate the

effectiveness of the solution.

3.2 Positivist Approach (Positivism)

The positivist approach is a methodology based on an objective hypotheses based on

introspection or intuition validated or dis-proved by scientific testing and experimentation

(Sage Publications, 2009). In other words a positivist approach will have a hypotheses

validating a subject area or discrediting it then going on to prove the hypotheses by

experimentation or building a solution (University of the West of England, 2007). The

origins of the method lie with sociologist Auguste Comte who coined and developed the term

in the early 19th century. Today the positivist approach is used increasingly in IS and

software engineering projects (Sociology Guide, 2008). Some of the advantages of the

positivist approach are that it relies heavily on quantitative data as opposed to qualitative

data which is seen as more scientific thus being a more reliable source to base hypotheses on.

Another advantage to the positivist approach is the fact that it follows a very stringent

structure, as the positivist approach believes that there are guidelines in place that need to

be adhered to, which as a consequence should minimize room for error. This ideology makes

positivist believe that the reduced room for error will make the whole approach more

accurate when it pertains to experiments and applications. However on the other hand there

Page 23: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

23

are drawbacks to the approach one of them being human behavior. Positivists strongly

believe in objective based assumptions however there is no guarantee that bias or subjective

analysis won’t corrupt the study. (Johnson, 2010) (Wikipedia, 2014)

Figure 5- Epistemological Assumptions for Qualitative and Quantitative Research from

http://dstraub.cis.gsu.edu:88/quant/2philo.asp

3.3 Interpretive Approach

The Interpretive approach is a qualitative research method that is based on subjective

assumptions with the knowledge derived from value-laden socially constructed

interpretations (Packer, 2007). In a stark contrast to the positivist approach interpretivist

researchers aim to understand and interpret human behavior as opposed to generalizing and

predicting cause and effect. The impact this has on information system and software design

projects is that the researcher will aim to ask several open ended questions generally

through questionnaires or unstructured / semi-structured interviews and sometimes

observations to gather as much primary information as possible once the scope of the project

has been defined (WordPress, 2012). This particular approach also enables the researcher

to open to new ideologies throughout the duration of the project as opposed to that of the

positivist approach who believe in a pre-ordained rules and guidelines. With that being said

there are many advantages as well as disadvantages to this approach. One advantage is that

the research methodology is highly qualitative based meaning that the data gathered will be

in more depth. However a drawback will be that interpretivists have a subjective view about

the project this into which will lead to bias getting in the way of ascertaining the correct

results or the best methods to apply in completing the project. (Institute of Public &

International Affairs, 2009) (Slideshare, 2013)

3.4 Critical Discussion

Having looked out all three research approaches in appropriate detail, highlighting the

advantages and disadvantages of each, it’s safe to say that all have adequate potential in

being the framework for any information systems project. However I believe that the best

approach to adopt for this particular project will be the Design Science approach as this

offers the strongest correlation between what I am trying to achieve in this project and the

Page 24: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

24

actual design science approach itself (design, build evaluate). However with that being said I

believe that I can still look at this project from a positivist point of view. The reason I say this

is that the idea of using data mining to develop ‘personas’ is a relatively novel idea, so using a

hypotheses I am trying to positively prove that it is possible and can be done.

3.5 Software Development Lifecycle Models

There are many models that can be used to develop a software project. All of these models

follow the design science principle of design, build evaluate. What I aim to achieve in this

section will be to identify and describe two common models, offering adequate analysis on

each. After which I will isolate the best model that can be adopted to my project.

3.5.1 Rapid Application Development (RAD)

Rapid Application Development is an iterative model that favors rapid, early software

prototyping as opposed to traditional planning. This approach consequently allows the

development of software to take place much sooner. It also keeps stakeholders at the heart of

the development process and allows requirement changes to take place easily. RAD typically

follows four phases in it model; Requirements Planning Phase, User Design Phase,

Construction Phase and Cutover phase. (Wikipedia, 2014) (David C. Yen, 1999)

1. Requirements Planning Phase – The inaugural phase of the project were the

project team meet with the stakeholders to go over the business needs of the client,

the project scope, system requirements and constraints. This is then preceded by an

agreement of the key issues that need to be addressed after which the relevant

authorization needs to be obtain in order to proceed

2. User Design Phase – The second phase of the project aims for the stakeholders to

maintain dialogue with the project analysts to develop prototype models of the

system that shows clear representation of all system input and output features plus

all the processes within the system. This phase of RAD is perceived to be a continuous

interactive process that allows the stakeholders to play an active role in

understanding, modifying and consequently approving a working prototype model

once they see a model that caters to their business needs

3. Construction Phase – The penultimate phase of project continues to focuses on

program and application development. Stakeholders further participate in suggesting

changes and improvement to any user interfaces or reports that are typically

developed at this phase. Unit-integration, system testing, programming and

application development is done at this phase of RAD.

Page 25: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

25

4. Cutover Phase – The final phase of RAD is typically when the whole project is

brought to a head. Tasks such as testing, data conversion, user training and system

changeover is done at this stage. The compression of all this tasks that the final stage

enables the new system to be delivered back to the stakeholders in a much quicker

timeframe.

Figure 6 - RAD Diagram

3.5.2 Analysis

The RAD model comes with many advantages as well as disadvantages. However the key is to

be able to synthase them and be relate it back to my project. One of the common advantages

of the RAD model is that it drastically reduces the time need for requirement analysis and

software requirement software requirement. Also all prototypes created can be stored for

future use; this will consequently speed up the software development of the product.

Relatively speaking heavy prototyping is not necessary for my project as it’s a fairly short,

small project with strict user requirements. (Rouse, 2007) (ISTQB Exam Certification,

2012)

3.6 Waterfall Model

The waterfall model is a sequential design model that establishes software development

through downward flow of task/activities through several phases (reminiscent of an actual

waterfall). It differs from conventional agile development models as it seeks to fully describe

the application through written documents before actual software development commences.

Originally developed by Royce in 1970 the waterfall model follows seven sequential phases.

(The Waterfall Development Methodology, 2012)

1. Requirements Specification – The requirements are gathered from the

stakeholders and agreed on in principle with development team.

2. Design – The blueprint of the project is drawn up and given to the developers to

commence coding and start implementation

Page 26: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

26

3. Implementation - The actual system is developed at this stage, all coding is

completed resulting in the actual program being developed

4. Integration – The system created is integrated in the environment agreed on in the

preliminary phase

5. Testing – Full testing of the integrated system is performed at this stage, debugging

also happens at this stage with the view of determining any bugs and work on

potential fixes and patches

6. Installation – Installing of the system including the removal of the old system is done

at this stage. This stage also includes training for all stakeholders and staff members

7. Maintenance – The installed system is maintained through continuous updates and

patches being developed and installed.

The waterfall model follows a strict principle that you can only move forward to the next

phase once the existing phase has been completed and worked to perfection meaning that

once a phase is completed it cannot be looked at again. (ISTQB Exam Certification, 2012)

Figure 7 - Waterfall Model

3.7 Analysis

The waterfall model comes with many advantages. One of the most common is that

sequential nature of the model, which makes it very easy to understand and execute. Another

advantage is that it works well on projects that are fairly small with strict set-in-stone

requirements, which suit my project adequately. Another reason I favor this SDLC is that it

seems to go hand in hand with the design science approach (design, build & evaluate). (

Select Business Solutions, Inc., 2010)

3.8 User Interface Evaluation

One of the most integral parts of any software project is to be able to coherently evaluate the

design of the artefact. Like previously stated the user requirements are used to inform the

design of the application, once this is done a framework or principle needs to be

implemented in order to evaluate it. One of the most popular techniques for usability

Page 27: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

27

evaluation is the Nielsen Heuristics. In this section of the report I aim to talk about the

Nielsen Heuristics in detail as well as another usability inspection method, The Cognitive

Walkthrough, in order to draw qualitative comparisons to both methods. This in turn will

help me decide on the most suitable approach in evaluate the usability of the Persona

Identification Application.

3.8.1 Nielsen Heuristics

As previously stated the Nielsen Heuristics is one of the most popular usability evaluation

techniques and one of the most used today. It’s important to remember that heuristic

evaluation bridges the gap between conventional user testing. This is achieved by providing a

template or set of principles that help uncover problems a user will likely come across does

this. Looking back it was Jakob Nielsen work with Rolf Molich in the 1990’s that helped

originate the heuristics that is widely used today. However it was in his 1994 publication

Usability Engineering that the actual ten heuristics were published for the first time.

(Nielsen, 1994)

(Some of the heuristics have been shortened for brevity)

1. Simple and Natural Dialogue – The dialogue should not contain information that is

irrelevant or rarely needed

2. Speak the User’s Language – The dialogue should be expressed clearly in words,

phrases, and concepts familiar to users rather than in system oriented terms

3. Minimize the User’s Memory Load – The user should not have to remember

information from one part of the dialogue to another

4. Consistency – Users should not have to wonder whether different words, situations

or actions mean the same thing

5. Feedback – The system should always keep users informed about what is going on,

through appropriate feedback within reasonable time.

6. Clearly Marked Exits – Users often choose system functions by mistake and would

need a clearly marked ’emergency exit’

7. Shortcuts (Accelerators) – Unseen by the novice users by often speed up the

interaction for expert users.

8. Good Error Messages – They should be expressed in plain language (no code) to

precisely indicate the problem

9. Prevent Errors – Even better than good error messages is a careful design that

prevent a problem from occurring in the first place

Page 28: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

28

10. Help and Documentation –Even though it is better if the system can be used

without documentation, it may be necessary to provide help and documentation. Any

such information should be easy to search, be focused on the user’s tasks, list

concrete steps to be carried out and not be too large

3.8.2 Advantages and Disadvantages

Nielsen heuristics comes with many advantages as well as disadvantages. Some of the

advantages to this principle are that it’s a very useful and relative inexpensive way of

providing some quick feedback to designers, which can reduce the overall turnover time that

a product is in the usability evaluation stage. Furthermore it can be a good way of obtaining

qualitative feedback EARLY in the design process. Another advantage to the heuristics

evaluation is that it can help immensely in suggesting the best corrective measures for

designers provided that the correct heuristic has been assigned in the first place. This would

prove to be helpful when designing the user interface for the Persona Identification

Application (PIA). Looking deeper into Nielsen Heuristics there is a few disadvantages to this

evaluation principle. One being that it requires specialist knowledge and competent

experience for it the application of the heuristics to be effective. Moreover usability experts

trained to administer the heuristics effectively and hard to come by and can be relatively

expensive to source. Another disadvantage to the heuristics is that it can tend to be

misleading in that it can identify more of the minor issues and less of the actual major issues

with the design. (Usability.Gov, 2010) (Nielsen, 1994)

Moving forward it is important to remember that heuristic evaluation does not replace

conventional usability testing and it should not be seen as an alternative to it. Many of the

benefits and drawbacks have been highlighted above and with all being discussed I’m in no

doubt that Nielsen Heuristics is the perfect evaluation metric for evaluating the user interface

for the application. Reason being is that, in essence, it evaluates all the basic requirements set

by the stakeholders and also it gives me things to consider while designing the app i.e.

accelerators and consistency etc. as well as things to evaluate on at the end of the design

process

3.9 Critical Discussion

Nielsen heuristics comes with many advantages as well as disadvantages. Some of the

advantages to this principle are that it’s a very useful and relative inexpensive way of

providing some quick feedback to designers, which can reduce the overall turnover time

that a product is in the usability evaluation stage. Furthermore it can be a good way of

obtaining qualitative feedback EARLY in the design process. Another advantage to the

Page 29: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

29

heuristics evaluation is that it can help immensely in suggesting the best corrective

measures for designers provided that the correct heuristic has been assigned in the first

place. This would prove to be helpful when designing the user interface for the Persona

Identification Application (PIA). Looking deeper into Nielsen Heuristics there is a few

disadvantages to this evaluation principle. One being that it requires specialist knowledge

and competent experience for it the application of the heuristics to be effective. Moreover

usability experts trained to administer the heuristics effectively and hard to come by and

can be relatively expensive to source. Another disadvantage to the heuristics is that it can

tend to be misleading in that it can identify more of the minor issues and less of the actual

major issues with the design. Moving forward it is important to remember that heuristic

evaluation does not replace conventional usability testing and it should not be seen as an

alternative to it. Many of the benefits and drawbacks have been highlighted above and with

all being discussed I’m in no doubt that Nielsen Heuristics is the perfect evaluation metric

for evaluating the user interface for the application. Reason being is that, in essence, it

evaluates all the basic requirements set by the stakeholders and also it gives me things to

consider while designing the app i.e. accelerators and consistency etc. as well as things to

evaluate on at the end of the design process. The way I intend to go about this heuristic

evaluation is to construct a usability questionnaire as well as system functionality test in

order to be able to coherently ascertain the usability of the system, also to be able to test

the functionality of the system, thus validating the user requirements.

3.9.1 Cognitive Walkthrough

In order to balance the argument for which evaluation technique to use it’s imperative to

draw on a comparison. One of the direct comparisons to the Nielsen Heuristics is the

Cognitive Walkthrough approach. Cognitive Walkthrough was developed as an additional

tool in usability engineering. The technique involves a group of evaluators undertaking a set

of tasks on the interface to evaluate its ease of learning and understandability. Lewis and

Polson first set out the concept of cognitive walkthrough, and it works by tasking the

evaluators with four questions; (usabilityfirst, 2011) (Cathleen Wharton, 1994)

• Will the user try to achieve the right effect?

• Will the user notice that the correct action is available?

• Will the user associate the correct action with the effect to be achieved?

• If the correct action is performed will the user see that the progress is being made

toward solution of the task?

After all these questions are ascertained the evaluator attempt to conjure a ‘success story’ for

each incremental step of the process. If this turns out to be impossible then the evaluator will

Page 30: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

30

then create a ‘failure story’, which aims to assess why the user cannot accomplish the task

based on the GUI. The findings from the walkthrough are later aggregated and used to make

improvements on the application, in this case the Persona Identification App. Like the

heuristics stated earlier cognitive walkthrough has many advantages as well as

disadvantages. One of the main advantages is that it’s useful for identifying problems early in

the design phase as well as help define users goals and assumptions with fewer resources’

that say full user testing would demand. This technique fits well with the scope of my project

as it provides a short and concise evaluation of the user interface I will be designing it also

provides a user centered perspective similar to what the heuristics offer in comparison.

However one of the main issues with cognitive walkthrough is more susceptible to subjective

bias from the evaluators, which may hinder the main issues not being covered. Another issue

is that it can be very difficult for a seasoned evaluator to assume the perspective of an

inexperienced user of the system. (Lewis, 1997)

3.10 Critical Discussion

Like the heuristics stated earlier cognitive walkthrough has many advantages as well as

disadvantages. One of the main advantages is that it’s useful for identifying problems early in

the design phase as well as help define users goals and assumptions with fewer resources’

that say full user testing would demand. This technique fits well with the scope of my project

as it provides a short and concise evaluation of the user interface I will be designing it also

provides a user centered perspective similar to what the heuristics offer in comparison.

However one of the main issues with cognitive walkthrough is more susceptible to

subjective bias from the evaluators, which may hinder the main issues not being covered.

Another issue is that it can be very difficult for a seasoned evaluator to assume the

perspective of an inexperienced user of the system.

3.11 Summary

In this chapter I have looked in depth at three design principles, evaluating each of them

and choosing the most appropriate one for my project. In addition I looked into software

development lifecycle and picked out the waterfall model as the most efficient lifecycle for

this project. Finally I looked into user interface evaluation choosing Nielsen heuristics as

my way of evaluating the application interface. The findings of this chapter have helped me

choose the appropriate methodology and evaluation for this project.

Page 31: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

31

4 Requirements Analysis and Design

In this chapter I will be reviewing and discussing the fundamental requirements of this

project. There are many types of requirements categories that can be used. In this project I

will be using three; Customer requirements, Functional and Non-Functional requirements.

In addition to this I will be discussing the design process of my project making use of

activity diagrams, use case diagrams and narrative to help illustrate the design of my

application

4.1 Customer Requirements

Customer requirements are direct statements or expectations that come from the principle

stakeholders or the prime actors of the project being developed. They directly impact scope

of the project and have unequivocal ramifications on the key features of the system being

developed. In this particular case I spoke directly to some of the principle stakeholders for

the Persona Identification Application who told me directly what their mission

statement/requirements were the following:

1. To be able to use wholesome dataset (Excel)

2. To be able to cluster the dataset through an application interface

3. Be given back a visual representation of the clustering results through the application interface

4. To be able to download a CSV table that show the clustering results which can help facilitate the identification of personas

Table 1 – User Requirements

4.2 Functional Requirements

Functional requirements are the mandatory tasks and activities that need to be fulfilled in

order to exert the full functionality of the app. In others words it should depict what the

system should do and the features it should provide to its users. The table below shows the

functional requirements for the Persona Identification Application.

Page 32: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

32

Table 2 - Functional Requirements

4.3 Non-Functional Requirements

Non-functional requirements are the requirements that depict the functionality of the

system, in this case the Persona Identification Application. The table below shows the non-

functional requirements for this system.

Table 3 - Non-Functional Requirements

4.4 Requirements Summary

Thus far, one of the key things to remember is that requirement gathering and analysis is

that it plays a crucial role in informing the design of the software solution. The

requirements along with research conducted in the literature review will assist me in

putting together an adequate design of the system, which will be shown in the second half

of this chapter.

4.5 Design

In this part of the chapter I will be concentrating on the design aspect of the Persona

Identification Application. As previously stated the outcomes of my literature review

Page 33: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

33

coupled with the results from the requirement analysis have helped put this part of the

chapter together. I will draw up different diagrams such to clearly show the interaction

with the user and the system. I will also be providing reasoning behind why each method

was chose.

4.6 Activity Diagram

One of the important UML models, an activity diagram illustrates the workflow of a

business process. In this case the diagram below shows the set of incremental steps that an

end user would need to achieve to get to attain his or her end goal. Along the way there are

different decision points that a customer will face which will ultimately lead them to the

same main deliverable. One of the reasons I opted to construct an activity diagram it is one

of the most comprehensible diagrams offering a clear understanding of the business flow

within the system not only to the developers but to them stakeholders as well. (Wang

Linzhang, 2004

Page 34: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

34

Figure 8 - Activity Diagram of Persona Identification Application

4.7 Use Case

Another important UML model the use case aims to offer the simplest way of demonstrating

the user’s interaction with the proposed system. The diagram below shows the user

interactions with the Persona Identification App. In addition to the diagram I put together

a use case narrative, which basically provides a more in depth description to the use case

diagram. The reason I chose to implement a use case diagram and narrative is that it

provides an abstract view of the application from the user perspective. (Elenburg, 2005)

Figure 9 - Use Case Diagram of Persona Identification Application

Page 35: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

33

Table 4 - Use Case Narrative

Page 36: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

34

Summary

This chapter has looked at the requirements set out by the user setting out the functional and

non-functional of the application. Also this chapter has shown how I went about designing

the application; in addition to this I have been able to discuss different techniques in

evaluating the usability of the application interface and functionality. The findings in this

chapter will help me greatly in implementing the application taking into consideration the

requirements from the users; equally it will help me evaluate the application as a whole. This

will be explained more in Chapter 6.

Page 37: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

35

5 Implementation

In this chapter I will be discussing the implementation of the Persona Identification App. In

particular I will be looking into the software environment I chose to implement the

application in, which in this project is R, providing adequate justification for why my selected

software environment was chosen. In addition to this I will be detailing the full functionality

of the application by way of screenshots with adequate description of each point.

5.1 Software Environment – R

R is a free command line based programming language specifically for statistical computing

and data mining. Its software environment enables its users to construct statistical software

as well as graphical user interfaces. As previously stated R is a command-based line

programming language meaning it runs through a MS-DOS style display; however several GUI

platforms have been developed to use alongside R such as R-Studio. One of the main reasons

I decided to use R to implement this system is that it was a free meaning that I could use it at

will as opposed to having to obtain a license. Another reason I chose to use it was because I

felt quite comfortable using a command line based system due to my prior experience with

MS-DOS. Subsequently R offers a good and easy to understand package in developing

interactive web-based interfaces (R-Shiny) which I used to develop the interface.

5.2 Software Environment - MatLab

MatLab is a high level, interactive programming environment written in a bevy of

programming languages such as Java, C and C++. One of the advantages of MatLab is that it

allows its users to access a world of different features such as plotting and mapping functions

and data, implementing algorithms and using built in math functions. Furthermore MatLab

allows its user to create graphical user interfaces to work hand in hand with the programs

coded in its environment. One of the main reasons I chose not to use MatLab to develop and

implement the Persona Identification App was because I was unable to obtain a license to use

it at home from the university, meaning that every time I wanted to work on development I

would have to come onsite which is not feasible or indeed efficient.

5.3 Persona Identification Application Implementation

As previously stated I developed the persona identification program in R then subsequently

developed the interface using R’s own package Shiny. In order to do this I had to code in

different functions then put it together in Shiny based application. I have enclosed below

screenshots of the coding of the most important functions with annotations to help depict

what each function is doing. For convince sake I have also listed the functions below:

Page 38: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

36

5.3.1 Application Coding Screenshots

1. Import CSV File

Figure 10 - Import csv file plus description

2. Choose variables

Figure 11 – Choose

variables plus description

1. Import CSV file and convert to data matrix

2. Choose variables

3. Standardize data option and cluster data

4. Show within groups sum of errors squared (Number of

clusters)

5. Show results

Page 39: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

37

3. Standardize data and run K-Means algorithm

Figure 12 – Standardize data and run k-means plus description

4. Show within group’s sum of errors squared (Number of clusters)

Figure 13 – Choose K function plus description

Page 40: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

38

5. Show Analysis Results

Figure 14 – Show analysis results plus description

6. Download cluster results CSV file

Figure 15 – Download results csv file plus description

Page 41: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

39

5.3.2 Application Interface Screenshots

This part of this chapter I will be presenting screenshots depicting the actual interface of the

application. This will add a visual impression to the lines of code explained earlier. The

screen shots will further be annotated to provide more in-depth descriptions on what is

transpiring within the application.

Figure 16 - Screenshot of Persona Application Interface 1.0

Figure 17 – Screenshot of Persona Identification Application 2.0

Page 42: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

40

5.4 Assumptions

In order to run the application successfully there needs to be some prerequisites that need to

be adhered to. One of them is that all the data that is in the csv file needs to be numeric else

the K-Means algorithm will just throw errors. In addition the data imputed has to be pre-

processed in order to gain tangible results. This will be further discussed in chapter 6. Finally

when running this application in R the shiny library needs to unpackaged and run after this is

done a simple command line of runApp(“.”) needs to be entered to run the application.

5.5 Summary

This chapter has shown the implementation of the application as well as the reasoning

behind why I chose the software environment to code it in. I have also discussed the

prerequisites that need to be fulfilled in order for the application to work. The findings in

this chapter have demonstrated my ability to code an application and present it in a user-

friendly manner.

Page 43: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

41

6 Results and Evaluation

In this chapter I will be looking at the results gained from the application developed. I will

also be detailing how I went about gaining personas from the results data. It’s important to

remember that this application can work with any dataset as long as its numeric and for the

purposes of this project I have focused on a dataset containing 500 families weekly shop over

a 2 month period. Furthermore I will be evaluating the application usability through the

Nielsen Heuristics principle and conducting black-box testing to test the system functionality.

6.1 Data Pre-Processing

As previously stated data preprocessing is an essential part of the data mining process as it

helps lay the foundation for more concise result analysis. It also helps clear up the so-called

‘garbage’ data that may spew the results. To pre-process the data used for this project I first

choose the two most important variables that will help me identify personas from the

Dunhummby dataset, which in this case was household key (hkey) and product category

(prodcatID). I used a technique called “Quota Sampling” to select which data I wanted to use

for this analysis (Riley, 2012). After which I created my own data subset to make with the

two variables only in the CSV file. Finally, to adhere to the rule of K-Means, I assigned each of

the 22 product categories to a numeric value and inputted them into the data subset keeping

a reference of the category and the numeric value its assigned to which can be seen below.

For ease of understanding I used the product category as the “personas” e.g. GROCERY will be

a grocery persona etc.

Figure 18 – Evidence of data pre-processing Results

Page 44: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

42

Once the results CSV file is downloaded the contents show four columns; kclust, which shows

how, many clusters there are hkey and prodcatID, these are the two variables we chose to

analyze and finally fit.cluster which show where each of the variables assigned fit in each

cluster.

Figure 19 - Screenshot of results out CSV file

I can see from here that the prodcatID and hkey have been assigned to a fit.cluster, which has

been set by the user already (see. From this I can then filter the rows in the csv file to see how

many numeric variables e.g. 1001, 1002 are in each cluster. Once I have found out how many

of each variable are in each cluster, I aggregate the total amount, which in turn helps me

work out a persona percentage on each category in each cluster. I make sure all the results

are documented which can be seen below.

Figure 20 - Identifying Personas Breakdown

Page 45: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

43

The formula I used to work out the percentage was relatively straightforward. After I

aggregate the total amount a calculated the instances of variables against the total amount

within the cluster. For example 1001(Grocery) has 2050 instances in cluster 1, I run that

number against the total amount of instances in cluster one using an online percentage

calculator.

Figure 21 –Percentage Calculator Example

6.2 Results Summary

To be able to identify personas, thus meeting my aim, I conducted some tests on my own data

sub-set (Figure 11). The first test I ran was with K (Number of Clusters) set to 3, which is the

optimum number of clusters for this dataset (see Figure 10). After mining the raw data

based on the method stated above, the following results were found:

Figure 22 - Persona Percentage Results (Test 1)

Page 46: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

44

From the results found I can say that the GROCERY persona was the most consistent and

populous persona found in the data set averaging around 60-65% in terms of persona

percentage. The next best persona found was the DRUG GM persona, averaging around 10-

11% persona percentage. This tells me that the dataset is heavily populated with GROCERY

Personas with very little other variances of personas following. To validate this finding I ran

the application again on that same dataset, however this time with K = 4. The results were as

follows:

Figure 23- Persona Percentage Results (Test 2)

Page 47: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

45

From this particular test I can see some sort of correlation with the first test I conducted with

K set at 3. I can deduce that the GROCERY persona is averaging between 63-66% persona

percentages spread across 4 clusters, which is very similar to the first test run. The DRUG GM

persona keeps its mark with around 10% persona percentage, with PRODUCE coming in at

around 9-10% average in terms of persona percentage. This indicates to me that the dataset

is densely populated with GROCERY personas

6.3 Evaluation

As previously mentioned in chapter 3.8.1 I have chosen to use the Nielsen heuristics to

evaluate the usability of the application interface. To go about this I have used a System

Usability Scale questionnaire, which was developed by John Brooke (Brooke, 2011). The

questionniare itself is ten questions long based on a likert scale scoring system (1= Strongly

disagree, 2= Strongly agree) if the particitpant is uncertain of an answer than they will select

3. The reason for me choosing this questionnarie is that the questions asked are similar to

that of Nilesen 94’ huerisitcs which is what I planned to use to evaluate the system with to

begin with. In addtion using a likert scale system makes it more choerent and easier for the

participents to complete, thus saving time (Dane Bertram, 2012). Below is an example of

the questionniare that will be given to the participants;

Figure 24 - System Usability Questionnaire

Page 48: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

46

6.3.1 Participant selection

Selecting the number of participant to evaluate the application is very important especially

when it pertains to this project. In an ideal world the more evaluators I have the better as

different evaluators can pick up different usability issues. However according to Nielsen the

most optimum number for evaluating a software system are 5 evaluators or at least 3.

(Nielsen, 1995).

Figure 25 - Graph showing the optimum number of evaluators

The above figure (23) shows that optimum number of evaluators against the proportion of

usability problems found. I can see here that 5 evaluators can find 75% of usability problems.

6.4 Black-Box Testing

Black box testing is a form of functional testing which aims to test if the software developed

does what it is supposed to do. The way I went about this was to create a questionnaire

which is based on the functional requirements, which the same participants that are testing

the usability would have to fill out. (Williams, 2006)

Page 49: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

47

Figure 26 - Functional Test Questionnaire

The reason I chose to design the questions this way (figure 24) was to be able to gauge

whether or not the functional requirements have been met with a straightforward yes or no

response. This directly has a knock on effect as the outcome of this questionnaire will

indicate to me how far I have gone in meeting the user requirements.

6.5 Evaluation Results

After the evaluation was completed I put all the results from the questionnaire and deduced a

bar chart from it to add a visual representation to the evaluation results. The first thing I did

was to put all the answers from each participant in a table which can be seen below (Figure

25). After this I was able to construct a bar chart using Excel.

Figure 28 - Bar Chart of Usability Questionnaire Results

To make the output more meaningful to me I aggregated the results and draw up a bar chart

to give a visual representation of the average score of the usability questionnaire

Figure 27 - Table of Usability Questionnaire Results

Page 50: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

48

Figure 29 Bar Chart showing average usability questionnaire results

6.6 Black Box Testing Results

As previously stated the system functionality testing (black box) was conducted concurrently

with the usability testing. Everyone that took part reported back that they execute all the

functionalities that the system offered. The results is illustrated below in figure 28

Figure 30 - Results of System Functionality Questionnaire

6.7 Evaluation Summary

To conclude this chapter I can say that the usability and system evaluation was highly

successful, in particular the black box testing. From all 5 subject experts who conducted the

evaluation, their response was highly positive which tells me that, from an expert point of

view, the application is very useable and does what its set out to do. On the functionality side

5/5 evaluators answered YES to all 7 functionality questions (Figure 28). This tells me that

the system functionality is fit for purpose and crucially it validates the customer

requirements set out in Chapter 4.

Page 51: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

49

7 Conclusion

This dissertation has covered a lot of topics as well as fresh, novel ideas i.e. persona

identification. However it’s important to be able to competently draw conclusions from the

findings of this project, offering appraisal on the positives found and being able to offer

constructive critique on the weaker aspects of the dissertation.

7.1.1 Aim - Identify individual personas from prosumers personal information.

To answer this question I can say that I was able to identify individual “personas” from

prosumer data, however there were issues that I came across during in regards to this.

The first issue was the strength of the persona. The main personas found on the dataset

tested were the GROCERY “persona” however this could be deemed by some analyst as too

vague or not in depth enough. Thorough my own investigation into this perception I found

out that a much deeper pre-processing method, e.g. using sub-product categories instead of

main product categories, would be required in order to fish out much more ‘features’ within

the clusters. This will help facilitate more diverse and meaningful “personas”. It’s important

to stress that this could have been achieved within the boundaries of this particular project

however I believed that deriving personas from main product categories i.e. grocery,

produce, nutrition etc. would be a much better way of obtaining good individual personas.

However from hindsight I believe a deeper pre-processing method would have produced

more meaningful persona. Nevertheless I believe this shouldn’t take away from the fact that I

was able to identify individual “personas” which was the ultimate aim of this dissertation.

7.1.2 Objective 1 - Undertake a state-of-the-art literature review to inform, create

a design specification for an identifying personas/Investigate in greater

detail the pros and cons of clustering with reference to appropriate

literature

To conclude this objective I can confidently say that a state-of-the-art literature review was

undertaken (See Chapter 2) carefully analyzing two of the main clustering methods

(hierarchical and partitioning) drawing advantages and disadvantages and relating it back to

how it would impact my aim of this project. In addition I looked into the importance of

personal data and how it has risen to be the new “oil”, also I looked at the rise of the digital

prosumer, in particular, how prosumption is poised to take over typical consumption laying

credence to Toffler prediction on how prosumption is going to take over consumption by the

turn of the 21st century. This all provided the necessary justification for undertaking the

project and exposed the potential value in building an application that can identify personas.

Page 52: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

50

In essence I believe this objective was met at a high standard making use of various white

literatures. This subsequently enabled me to create a design specification for my application.

7.1.3 Objective 2 - Build a persona identification application.

The particular part of the project was by far the most challenging yet the most rewarding.

First off I was tasked with choosing the appropriate software environment in which the

application will be coded in, after this was ascertained then the code development begun.

Although this was a very tedious task, involving numerous failed attempts and heavily

bugged versions, a final version was created bringing to life all the research and personal

hypotheses set out at the beginning of the project. (See Chapter 5) Overall I was hugely

satisfied with the implementation of the application despite the fact that it took a huge

amount of time and resources to put together, I believe it was a very strong and well put

together application that was indeed fit for purpose

7.1.4 Objective 3 - Evaluate the application.

The final part of this dissertation required me to evaluate the application, to not only provide

validation against my aim but to validate the customer requirements defined in Chapter 4. I

went about this by, first evaluating the usability of the system; this was done via a

questionnaire which was very heavy influenced by the Nielsen heuristic principle. After this a

black-box test was put together to evaluate the functionality of the application. Both test

were a huge success, as I was using experts to evaluate the system, there was a lot of extra

scrutiny laid on both the usability and functionality. The feedback was highly positive which

went a long way in validating my aim and user requirements. (See Chapter 6)

7.2 Future Development

One of the most underrated aspects of any project is to negate things that haven’t been done,

due to time or resources, and over-emphasis the things that have been achieved in a project. I

believe that there is a world of benefits to be unlocked once we can sit back and look at what

can be developed in the future to make this project even better.

There are a number of things that can be achieved with future work/development that would

enhance the application even further. The first is obviously a much deeper pool of personas

which was explained in the chapter. Another future development would be adding more

algorithms to the application instead of just the single K-Means. This was explained in more

detail in Chapter 2.8. Another development would be the ability to but the application on a

server and connect it to a database, this will enhance the application even more as it would

mean that data from the data lockers could be stored on the databases and be called into the

application via a database query etc. making the application more robust, expanding the

Page 53: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

51

overall potential and value of the application. The use of classification to assign cluster data

into pre-defined ‘personas’ would also be a very useful future development as this would

lead to the application becoming more of a user friendly, unlocking the potential for non-

clustering experts to be able to benefit from the app. With all being said, I believe my last

point leads me nicely onto my final future development point, which is integrating a digital

data locker into the application, with the addition of a highly persuasive and user friendly

interface. With the supplements of classification and database and server integration, the app

would then transform from a back-end technical application tool to a fully-fledged everyday

application that the average person can take advantage of.

Appendix A Personal Reflection

A.1 Reflection on Project

The biggest problem that I encountered putting this project together was grasping the idea of

data mining and cluster analysis, as this was such a complex topic to be able to understand it

enough to relate it back to this project requires a particularly high level of expertise. I found

it hard at the beginning to understand these concepts because I was too focused in getting my

literature review and methodology done ASAP. If I were to do it all again I would spend 2-3

weeks learning and understanding cluster analysis in much more detail as this would have

helped me greatly in reducing the amount of downtime I had at various points during this

project. Another issue I had was time management. In all fairness as the project wore on my

time management did improve dramatically however at the beginning I felt I wasn’t

dedicating the correct amount of time, relatively speaking, meaning that I was assigning too

much time into my methodology and literature review instead of looking into starting my

implementation etc.

In essence I believe that I grew into this project as time wore on and I started understanding

the kind level that needs to be ascertained in order to achieve the highest marks possible in

the project. Even though too much time was spent coding my application and getting it to

work, I believe that overall all areas of my project maintained the highest standard, despite

the circumstances.

A.2 Personal Reflection

In reflection I would say there is a number of personal adjustments I could have made, or

done differently all together, to make the project even better. To begin with I would have

spent a lot less time on the implementation part of the project, as this eat up valuable time

and energy that could have been spent on other sections of this project. Essentially, it took

me around 8-10 weeks to fully code the functionality and the interface of my application, as I

Page 54: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

52

had to code it with 2 different languages (R and R-Shiny). I believe from hindsight I should

have either used a language I was a lot more familiar with or use a more robust software

environment that I can code the application as well as the interface under the umbrella of one

language, like MATLAB. Another suggestion is that I would have picked a topic that I was

more familiar with as this would have made my life a whole lot easier and straightforward

than it has been. Don’t get me wrong I hugely enjoyed working with Panos on his Digital

Prosumer project, but from a dissertation point-of-view, having to learn about intelligent

data analysis as well as statistical coding for someone with no computer science background

was an immense challenge for me which presented me a myriad of challenges, which I did

manage to overcome, which is a huge credit to my ability and fundamental skills that I have

developed and honed during my time at Brunel University. However, with the aid of

hindsight, I would have picked a topic that I was more comfortable with, easing the steep

learning curve that I had to navigate. My final suggestion will probably to make my

application at me aesthetically pleasing as this would have made me feel a little better about

the application, not that I’m ashamed of it in anyway shape or form with what I managed to

produce, I would have just preferred something that would have looked the part in addition

to acting the part, which my application does 150%

Page 55: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

53

Bibliography

Select Business Solutions, Inc. (2010, March 15). What is the Waterfall Model? Retrieved December 20, 2013, from Select Business Solutions: http://www.selectbs.com/analysis-and-design/what-is-the-waterfall-model

Alan R. Hevner, S. T. (2004). DESIGN SCIENCE IN INFORMATION SYSTEMS RESEARCH. MIS Quarterly , 1-25.

BIS. (2011). Better Choices: Better Deals Consumer Powering Growth. London: Cabinet Office. Brooke, J. (2011, January 15). SUS - A quick and dirty usability scale. READING, Berkshire, United

Kingdom. Cathleen Wharton, J. R. (1994, January 18). A Cognative Walkthrough Method: A Practitioner's

Guide. Boulder, Colorado, USA. Cooper, A. (1998). The Inmates Are Running the Asylum. London: Sams - Pearson Education. Cooper, A. (1998). The Inmates Are Running the Asylum. London: Sams - Pearson Education. Creswell, J. W. (2007). Qualitative Inquiry and Research Design: Choosing Among Five Approaches.

California: SAGE Publications. Dane Bertram. (2012, May 5). Likert Scales …are the meaning of life. Retrieved Febuary 25, 2014,

from al-huda.net: http://www.al-huda.net/2012/PA/2014/topic-dane-likert.pdf David C. Yen, W. S. (1991, June 16). Rapid Application Development (RAD). Florence, Kentucky,

USA. David Dean, S. D. (2012). The Digitial Manifesto. The Boston Consulting Group, 1-12. Dein, D. (2012, March 19). PRESS RELEASES . Retrieved March 2014, 25, from Clicks Grow Like

BRICS: G-20 Internet Economy to Expand at 10 Percent a Year Through 2016: http://www.bcg.com/media/PressReleaseDetails.aspx?id=tcm:12-100468

Department for Business, Innovation & Skills . (2013, December 13). Midata Providing better information and protection for consumers. Retrieved December 20, 2013, from UK Government Website: https://www.gov.uk/government/policies/providing-better-information-and-protection-for-consumers/supporting-pages/personal-data

Department for Business, Innovation and Skills. (2013, June 14). Information Economy Strategy. Retrieved December 15, 2013, from UK Goverment Website: https://www.gov.uk/government/publications/information-economy-strategy

Deutsch, G. (2010, January 21). Data Mining in the KDD Environment. Retrieved December 12, 2013, from Data Mining – Blog.com: http://www.data-mining-blog.com/data-mining/data-mining-kdd-environment-fayyad-semma-five-sas-spss-crisp-dm/

Elenburg, D. (2005). Use Cases: Background, Best Practices, and Benefits. MKS, 4-7. Group, T. B. (2012, May 12). Rethinking Personal Data: Strengthening Trust. Retrieved October

17, 2013, from World Economic Forum: http://www.weforum.org/reports/rethinking-personal-data-strengthening-trust

ICO. (2010, 28 May). Key definitions of the Data Protection Act. Retrieved December 15, 2013, from Information Commissioner's Office: http://www.ico.org.uk/for_organisations/data_protection/the_guide/key_definitions

Improved Outcomes Software (ios). (2009, May 5). K-Means Clustering Overview. Retrieved Febuary 2, 2014, from Improved outcomes software : http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/K-Means_Clustering_Overview.htm

Information Commissioner Office. (2010). Determining what information is 'data' for the purposes of the DPA. Information Commissioner Office, 1-2.

Institute of Public & International Affairs. (2009, May 5). WHAT IS INTERPRETIVE RESEARCH? Retrieved January 10, 2014, from INSTITUTE OF PUBLIC & INTERNATIONAL AFFAIRS : http://www.ipia.utah.edu/imps/html/research.html

Page 56: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

54

ISTQB Exam Certification. (2012, January 13). What is Incremental model- advantages, disadvantages and when to use it? Retrieved January 15, 2014, from ISTQB EXAM CERTIFICATION: http://istqbexamcertification.com/what-is-incremental-model-advantages-disadvantages-and-when-to-use-it/

ISTQB Exam Certification. (2012, January 13). What is RAD model- advantages, disadvantages and when to use it? Retrieved January 10, 2014, from ISTQB Exam Certification: http://istqbexamcertification.com/what-is-rad-model-advantages-disadvantages-and-when-to-use-it/

Jenkinson, A. (1994). Journal of Targeting, Measurement and Analysis for Marketing. Beyond Segmentation, 60-65.

Johnson, S. (2010, April 10). Advantages & Disadvantages of Positivism. Retrieved December 25, 2013, from eHow: http://www.ehow.com/info_12088541_advantages-disadvantages-positivism.html

Kamber, J. H. (2006). Data Mining Concepts and Techniques (2nd Edition ed.). Illinois: Morgan Kaufmann.

King, R. (2011, September 7). IBM panel discusses tackling big data storage as problem escalates. Retrieved October 17, 2013, from Smartplanet: http://www.smartplanet.com/blog/smart-takes/ibm-panel-discusses-tackling-big-data-storage-as-problem-escalates/19010

Kotler, P. (1986). The Prosumer Movement: A new challenge for marketers. Advances in Consumer Research Volume 13 , 510-513.

Lewis, C. (1997). Cognitive Walkthroughs. In C. Lewis, Handbook of Human-Computer Interaction (pp. 335-345). Colorado: Elsvier Science.

Lior Rokach, O. M. (2010, June 10). Chapter 15: CLUSTERING METHODS. Retrieved December 15, 2013, from Gurion University of the Negev: http://www.ise.bgu.ac.il/faculty/liorr/hbchap15.pdf

Madrigal, A. C. (2012, March 19). How Much Is Your Data Worth? Mmm, Somewhere Between Half a Cent and $1,200. Retrieved October 17, 2013, from The Atlantic: http://www.theatlantic.com/technology/archive/2012/03/how-much-is-your-data-worth-mmm-somewhere-between-half-a-cent-and-1-200/254730/

Myers, M. D. (2010, August 5). Qualitative Research in Information Systems: References on Interpretive Research. Retrieved December 15, 2013, from Association for Information Systems: http://www.qual.auckland.ac.nz/interp.aspx

Nagel, D. (2013, 7 10). thejournal.com. Retrieved 25 January, 2014, from 212 Billion Devices To Make Up 'The Internet of Things' by2020: http://thejournal.com/articles/2013/10/07/212-billion-devices-to-make-up-the-internet-of-things-by-2020.aspx

Nielsen, J. (1994). Enhancing the explanatory power of usability heuristics. Human factors in computing systems, 1-4.

Nielsen, J. (1994). Usability Engineering. In J. Nielsen, Discount Usability Engineering (pp. 19-21). London: Elsevier.

NIELSEN, J. (1995, January 1). How to Conduct a Heuristic Evaluation. Retrieved Febuary 28, 2014, from Nielsen Norman Group: http://www.nngroup.com/articles/how-to-conduct-a-heuristic-evaluation/

Packer, M. (2007, May 10). Interpretive Research - An Overview. Retrieved December 15, 2013, from The Duquesne Mathematics: http://www.mathcs.duq.edu/~packer/IR/IRmain.html

Palis, C. (2012, 3 20). Internet Economy: How Essential Is The Internet To The U.S.? . Retrieved March 25, 2014, from The Huffington Post : http://www.huffingtonpost.com/2012/03/20/internet-economy-infographic_n_1363592.html

Page 57: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

55

Raza Ali, U. G. (2004, May 17). Data Clustering and Its Applications. Retrieved October 17, 2013, from members.tripod.com: http://members.tripod.com/asim_saeed/paper.htm

Riley, J. (2012, September 23). Marketing Research - Sampling. Retrieved March 26, 2014, from tutor2u: http://www.tutor2u.net/business/marketing/research_sampling.asp

Rouse, M. (2007, Febuary 10). rapid application development (RAD). Retrieved December 10, 2013, from SearchSoftwareQuality: http://searchsoftwarequality.techtarget.com/definition/rapid-application-development

Sage Publications. (2009, August 31). Retrieved December 20, 2013, from http://www.sagepub.com/upm-data/30646_mukherji_chp_1.pdf

Schwartz, P. M. (2003). Property, Privacy and Personal Data. HeinOnline, 2056. Slideshare. (2013, January 22). Slideshare. Retrieved December 13, 2013, from Interpretivism:

http://www.slideshare.net/deepalipatel246/interpretivism-16119473 Sociology Guide. (2008, March 10). Sociology Guide. Retrieved January 20, 2014, from Sociology

Guide - A Students Guide To Sociology: http://www.sociologyguide.com/thinkers/Auguste-Comte.php

Tapscott, D. (1997). Digital Economy: Promise and Peril in the Age of Networked Intelligence . UK: McGraw-Hill Companies.

The Waterfall Development Methodology. (2012, May 28). The Waterfall Development Methodology. Retrieved December 15, 2013, from The Smart Method Limited: http://www.learnaccessvba.com/application_development/waterfall_method.htm

Toffler, A. (1980). The Third Wave. USA: Bantam Books. University of the West of England. (2007, May 20). UWE Bristol. Retrieved December 20, 2013,

from Research Observatory: http://ro.uwe.ac.uk/RenderPages/RenderLearningObject.aspx?Context=7&Area=1&Room=3&Constellation=24&LearningObject=104

Usability.Gov. (2010, May 15). Heuristic Evaluations and Expert Reviews. Retrieved Febuary 15, 2014, from Usability.gov: http://www.usability.gov/how-to-and-tools/methods/heuristic-evaluation.html

usabilityfirst. (2011, June 12). Cognitive Walkthroughs. Retrieved Febuary 25, 2014, from usabilityfirst: http://www.usabilityfirst.com/usability-methods/cognitive-walkthroughs/

Usama Fayyad, G. P.-S. (2008). From Data Mining to Knowledge Discovery in Databases . KDnuggets, 37-54.

Wang Linzhang, Y. J. (2004). Generating Test Cases from UML Activity Diagram based on Gray-Box Method. Software Engineering Conference, 284-291.

Wikipedia. (2011, January 28). Data Mining. Retrieved Decemeber 15, 2013, from Wikipedia: http://en.wikipedia.org/wiki/Data_mining

Wikipedia. (2014, Febuary 10). Positivism. Retrieved Febuary 15, 2014, from Wikipedia: http://en.wikipedia.org/wiki/Positivism

Wikipedia. (2014, Febuary 6). Rapid application development. Retrieved Febuary 15, 2014, from Wikipedia: http://en.wikipedia.org/wiki/Rapid_application_development

Wikipedia. (2014, Febuary 12). Waterfall Model. Retrieved Feburary 15, 2014, from Wikipedia: http://en.wikipedia.org/wiki/Waterfall_model

Williams, L. (2006, June 12). realsearchgroup.com/. Retrieved March 4, 2014, from Testing Overview and Black-Box Testing Techniques : http://agile.csc.ncsu.edu/SEMaterials/BlackBox.pdf

WordPress. (2012, March 14). WordPress.com. Retrieved December 5, 2013, from Interpretivism and Postivism (Ontological and Epistemological Perspectives): http://prabash78.wordpress.com/2012/03/14/interpretivism-and-postivism-ontological-and-epistemological-perspectives/

Page 58: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

56

Zambito, T. (2013, May 27). What is a Buyer Persona? Why the Original Definition Still Matters to B2B. Retrieved October 17, 2013, from tonyzambito.com: http://tonyzambito.com/buyer-persona-original-definition-matters/

Page 59: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

57

A.3 Appendices

A.4 Appendices

Page 60: Digital Prosumer - Identification of Personas through Intelligent Data Mining (Data Science)

Digital Prosumer - Identification of Personas through Intelligent Data Mining (Clustering)

58