Managing Reputation through Online Analytics

S0806637

Michael White © 2012 Word Count: 9,876

Maintaining Reputation

through Online Analytics Why the Public Relations Industry must Adapt or Die

Presented as part of the requirement for an award within the Undergraduate Modular

Scheme at the University of Gloucestershire (April 2012)

April 2012 Maintaining Reputation through Online Analytics PUR334

1 | P a g e

Declaration DECLARATION: This dissertation is the product of my own work. I agree that it

may be made available for reference and photocopying at the discretion of the

University.

Author’s Signature:

Michael White

Date 11/04/2012


2 | P a g e

Abstract

Over the last 6 years the communication landscape has changed significantly.

The advent of Facebook, Twitter, YouTube and other social networks introduced

a range of additional communication channels. Never has it been more

important for the public relations industry to maintain reputation. Whilst an

understanding of social networking tools now exists within the industry,

confusion still exists surrounding the range of metrics available online and how

these can be utilised to effectively provide Return on Investment (ROI) for

clients. A number of 3rd party measuring tools now exists allowing similar ‘search,

measure, understand and engage’ solutions.

This report uncovers in-depth alternative solutions to the terms ‘measure and

understand’ for capturing quantitative and qualitative data. These measuring

metrics and techniques may be used by the public relations industry to achieve

their campaigns’ objectives.

Main Findings:

1. ROI is relied upon for reputation management and direct sales.

2. Third party measuring tools exist but are not perfect.

3. The PR industry needs standardisation.

4. Semantic Analysis works but has not yet been perfected.


3 | P a g e

Acknowledgements

Throughout writing this dissertation concerning online measurement the author

drank approximately 180 cups of coffee, smoked 450 cigarettes and listened to

900 hours’ worth of music. Despite this congenial lifestyle writing this

dissertation was only made possible by the following people.

The author’s parents: For having hope in their seven year old child with

dyslexia who could not read or write.

Lecturer, Practitioner and Extraordinaire, David Phillips: For his guidance

surrounding semantic analytics.

Microsoft Librarian, David K. Stewart: For providing research through the

Microsoft UK library.

Graduated PR student, Michael Healey: Who once interviewed the author

for his own dissertation and was an inspiration for writing this one.

Wikipedia: The author’s unreferenced secret weapon.


4 | P a g e

Table of Contents

Declaration 1

Abstract 2

Acknowledgements 3

Introduction 6

1.0 Literature Review 8

1.1 Public Relations Industry: Adapt or Die 9

1.2 Web Analytics 2.0 12

1.3 How to Measure Sales and Relationships 16

1.4 Introducing the Semantic Web 20

2.0 Methodology 24

2.1 Research Sample Design 26

2.2 Ethical Considerations 28

3.0 Latent Semantic Indexing Research into Neville Hobson’s

Twitter timeline 29

3.1 LSI Python Script 30

3.2 Retrieval, Filter and Identification 32

3.3 Term Count Model and Singular Value Decomposition 33

3.4 The Results 38

4.0 Evaluation 41

4.1 Evaluation of Latent Semantic Indexing 41

4.2 Bayesian Inference and Other Interpretations 43


5 | P a g e

5.0 Conclusion 45

5.1 ROI is relied upon for reputation management and direct sales 45

5.2 Third party measuring tools exist but are not perfect 45

5.3 The PR industry needs standardisation 45

5.4 Semantic Analysis works but has not yet been perfected 45

References 47

Illustrations 51

Appendix 59


6 | P a g e

Introduction

The public relations industry is in a state of rapid change. On the 1st March 2012

the Public Relations Society of America (PRSA) announced the results of a vote

which concluded with their modern definition of PR (White, 2012):

“Public relations is a strategic communication process that builds mutually

beneficial relationships between organizations and their publics.”

This definition is similar to the UK’s Chartered Institute of Public Relations (CIPR)

(CIPR, 2012):

“Public relations is about reputation – the result of what you do, what you say

and what others say about you. Public relations is the discipline which looks after

reputation, with the aim of earning understanding and support and influencing

opinion and behaviour. It is the planned and sustained effort to establish and

maintain good will and mutual understanding between an organisation and its

publics”.

The Public Relations Consultants Association (PRCA), a UK organisation,

definition of PR is extremely similar to the CIPR (PRCA, 2012):

“Public relations is all about reputation. It’s the result of what you do, what you

say, and what others say about you. It is used to gain trust and understanding

between an organisation and its various publics – whether that’s employees,

customers, investors, the local community – or all those stakeholder groups…”

For this PR society, chartered institute and association the emphasis on

‘reputation’ is clear but a modern definition of PR must take into consideration

how the growth of digital communication channels provide an opportunity for


7 | P a g e

the PR industry to expand into additional service areas outside of reputation

management.

Furthermore, a viable method of measuring reputation has not yet been

discovered. Measuring online sentiment levels fulfils the CIPR’s understand of

reputation being “what others say about you” but it is not yet possible to align

sentiment with the global values of a brand, product or service.

Whilst the sharp increase of communication channels being made available

across a range of communication platforms will inevitably impact reputation

management, the definition of PR should also be in question. Since 2006 the

introduction of, now popular, social networks have made available additional

measurement metrics. Some of these metrics are already being utilised by the

online advertising industry to generate direct sales for their clients. As made

clear by some small public relations agencies managing online advertising

campaigns for their clients (Jefkins, 2000). This dissertation explores the

possibility of public relations finding additional ways to measure reputation

online and understanding that digital PR is not just concerned with reputation

but also direct sales.

Within the literature review a succinct but broad assessment of the various

online measurement metrics were examined before an in-depth study into how

semantic analysis could be used to measure public relations activities. All

documents associated with the study can be found in the appendices.


8 | P a g e

1.0 Literature Review

This review seeks to identify, examine and compare key forms of online

measurement. The purpose is to understand the scale of online metrics available

for digital public relations campaigns and the interpretation of data involved. The

information within this literature review will continue to serve as a necessary

foundation for the research present in section 3.0.

Preparing this review has involved consolidating relevant published texts, gaining

insights through marketing based blogs, examining online journal databases and

keyword searches on micro-blogging platform Twitter. The author’s personal

experiences within the field of public relations and online advertising are also

included.

This review is comprised of the following sections:

Public relations industry: adapt or die

Web Analytics 2.0

How to measure sales and relationships

Introducing the semantic web


9 | P a g e

1.1 Public Relations Industry: Adapt or Die

Edelman’s annual 8095 report researched 3,100 millennials across 8 different

countries. The Millennial generation accounts for those between the ages of 17

to 32 as of 2010, their behaviours showing a stark difference compared to baby

boomers (born early 1946 – 1964) and generation x (born early 1960 – 1980).

Evidence in the report exampled the close relationship millennials have with

brands online (Gould, 2010):

28%relied upon brands to make a positive impact in the world

36% relied upon brands to learn about new trends

18% announced they would switch to a competing brand if they were

offered tools to help them in other areas of life

16% relied upon brands to help achieve personal goals

Organisations must ensure that their brands adapt effectively to fit to the online

environment, often referred to as Web 2.0 (Gordon, 2011). The pressure is on

the public relations industry to have the confidence to manage customer

relations on the front lines. The latest PRCA barometer revealed the gloomy

outlook accurately surmised through a response to the report by Weber

Shandwick’s vice-president (Owens, 2012),

“Clients are saying there is an uncertain market and that we have got to be

smarter with our budgets. We are seeing more quarter-by-quarter release of

budgets – there is a desire for more control”.

PRCA’s barometer revealed a worrying lack of confidence which the public

relations industry is beginning to face on the verge of a possible double-dip

recession. Clients are holding back their budgets and the public relations industry

needs to prove effective ROI. The horizon of social networking platforms over the

last 6 years has pressured a vast array of industries to adapt or die. Whilst public


10 | P a g e

relations agencies, in-house professionals and consultants are all gradually

endorsing social media as part of a wider campaign strategy – knowing strategy

and tactics is not enough. Calculations of performance measurement must reach

a standard which not only upholds the values in the definition of public relations

but will be endorsed by the CIPR (Chartered Institute of Public Relations). Not

only have the tools which public relations professionals use changed, but the

industries very definition must be adapted.

The formula for ROI calculates the return of an investment divided by the costs

(Investopedia, 2011).

The public relations industry has to identify the key values from social media, in

relation to the campaign they are running, in order to conclude the necessary

ROI calculation. Public relations theory is integral to understanding how

communication channels should adapt.

Prominent thought leader, Brian Solis, announces in his latest book “The End of

Business as Usual” (2011) that the medium is no longer the message. A play on

words from Marshall McLuhan’s famous coinage from many years before, “the

medium is the message”. Audiences are heavily sharing on social networks which

are transforming behaviours which, in western society, insinuating the

hypodermic needle theory ineffective. According to hypodermic needle theory

(also known as magic bullet theory), “the mass media could influence a very

large group of people directly and uniformly by ‘shooting’ or ‘injecting’ them

with appropriate messages designed to trigger a response” (Gupta, 2006, p. 36).

According to Brian Solis, “media channels that compete for our attention are

transforming our behaviours, empowering users to take control of the

Figure 1 - ROI


11 | P a g e

information that reaches them… messages are reborn through context and the

relevant experiences of people and organisations we value” (Solis, 2011, p. 15).

The public relations industry has reached a critical stage requiring quick but

considered evolution. All content focused industries must evolve much like

natural selection in nature. When British evolutionary biologist, Richard Dawkins,

wrote “Climbing Mount Improbable” he referred to an analogy of creatures

reaching the peak of their evolution which resulted in their fixture in the natural

world or extinction as another creature continued through natural selection. The

same applies for the public relations industry as the internet landscape is shared

with online advertising. The CIPR must protect the industry by defining its role

through the purpose of public relations campaigns. The public relations we see

today may be indistinguishable in three years’ time.

Only three years ago there were many websites designed with landing pages for

users once referred through a search engine (Phillips & Young, 2009). Last year

Facebook could have been considered the social hub for many users before

visiting a website. In the last few months Google+’s affect upon the Google

search algorithm has meant an era of social search (Goold, 2012). The landing

page of a website could be considered less significant in an era when online

recommendation has first taken place. A powerful factor considering Edelman’s

8095 report (at the beginning of this chapter) as more millennials discover

through sharing. This is only one of many developments which public relations

have experienced in the 21st Century. In a recent CIPR interview Dr Jon White

provided a quick definition of public relations as a social psychology (CIPR TV,

2011); the public relations industry must understand how to measure and

understand. Discovering ROI measurements starts through evaluation of a

messages’ context which explains relevancy for a public relations campaign. The

industry must adapt, not die.


12 | P a g e

1.2 Web Analytics 2.0

The public relations industry must adapt or die which is why measurement is

integral for every business to survive. To understand corporate reputation,

relationships must be measured for success (Paine, 2011). The term Web 2.0 is

frequently referred to in context of the evolution of online - websites provide the

facilities for information sharing and collaboration. This form of communication

can be likened to several of Grunig and Hunt’s four models (Grunig & Hunt,

1984):

1. Press Agentry

Description: One Way Communication. Publicity focused

In Practice: Little research into the audience necessary. Half-truths can be told

with the outcome of behaviour manipulation.

2. Public Information

Description: One Way Communication. Accuracy Necessary.

In Practice: Little research into the audience necessary. Accuracy is essential but

feedback is not measured.

3. Two Way Asymmetric

Description: Feedback used to change attitudes

In Practice: Feedback from the audience used to adapt messages for behavioural

change, not manipulation.

4. Two Way Symmetric

Description: A conversation

In Practice: Removes the need of a journalist as a mediator, allowing

conversation and adaptation from both parties involved.


13 | P a g e

In terms of online communication channels Web 1.0 describes how messages

were communicated across websites as one way communication through the use

of ‘Press Agentry’ and ‘Public Information’ models. Just as the information was

communicated it could be said that analytics 1.0 were apparent. The metrics

available were found on the basis of a clickstream data. This data has its

limitations. Avinash Kaushik is the author of the leading research and analytics

blog, Occam’s Razor. Within his latest book “Analytics 2.0” he makes the

distinction that clickstream asks the question ‘what?’ rather than ‘why?’

Clickstream data includes (Google Analytics, 2012):

Visits – The total amount of visits to a website

Unique Visits – The unduplicated amount of visits to a website

New visits – A measurement of new visits versus returning visits

Page views – The amount of pages views on a website

Time – The average amount of time from all visits

Frequency of Visit – The total amount of times a user has returned to a

website

Bounce Rate – The percentage of single-page visits in which the person

left your site from the landing page

Traffic Sources – This includes data from search engines, referring sites

and other traffic sources.

Keywords – This shows the keywords a user typed into a search engine

before arriving on a website.

What clickstream statistics can mean for a digital public relations campaign is

increased revenue, reduced costs and an improvement of customer satisfaction

(Kuashik, 2010). Google Analytics specialises in clickstream data analysis, it is free

to use and vital to measure results online. The formula for use depends upon the

values you place in your ROI.


14 | P a g e

The starting origin of online collaboration is almost impossible to pinpoint. It may

have begun with Morse code in the 1800s. In reality collaboration began with the

arrival of two historical events; the CTSS (Compatible Time-Sharing System) and

the invention of HTML (Hypertext Mark-up Language). Both of these

developments are examples of the human and technological developments

which explain where Web 2.0 is today. Email was human communication and

HTML used hyperlinks which is what defines the internet as WWW (World Wide

Web). The Barabasi-Albert model is drawn from an algorithm which represents

scale-free networks (Barabasi et al, 1999). It is an example of the interconnected

structure of the internet but also how humans connect across a social network.

Figure 3 - Barabasi-Albert model

Figure 2 - ROI


15 | P a g e

The Barabasi-Albert model above is shown with 18 points of connection. Imagine

the scale of Facebook with its 800 million active users, 800 million points of

connection (Facebook, 2012). Web Analytics 2.0 is made possible through the

transparency which is exhibited through the content sharing across a vast array

of social networking platforms. Content is flowing freely across the internet and

it must be listened to and measured because:

You need to keep track of your stakeholders

You need to provide your client the best ROI

We need the public relations industry to evolve

Over the last 6 years we have not just seen the rise of social networking

platforms but also 3rd party measuring tools such as Brandwatch, Radian6 and

Sysomos. These top self-service social media analytics all offer services playing

on a variation of ‘search, measure, understand and engage’ technology.

Organisations who use these tools as part of a social media strategy type in

search terms along with Boolean strings – the results not only showing what

customers may be remarking but allows organisations to plan engagement

tactics. In evaluating this data it is necessary to grasp the definitions upheld by

the industry:

Quantitative: Data that refers to numbers and frequencies (number of updates,

average subscribing rate, etc.)

Qualitative: Data that provides information of meaning (status updates, tweets,

etc.)

Correlation: Works with quantifiable data to find relation between variables.

The exponential growth of social media requires public relations industry to

consider the correlation of data before the data mining processes of 3rd party

providers. We are currently heading towards an era of correlation based digital

public relations where mass sentiment results in reaction based communication.


16 | P a g e

1.3 How to measure sales and relationships

Web Analytics 2.0 is concerned with presenting the ‘What?’ and ‘Why?’ behind

clickstream data (Kuashik, 2010). During the 1990s the online advertising

industry witnessed the revolution of ‘one-to-one’ marketing which is “where

direct response, direct mail, the internet and the interactive opportunities of

digital TV come together” (White, 2000, p. 203). Over the following 10 years the

online advertising formed its own standardisation for measurement. This

involves measuring the below metrics (Gay, Charlesworth and Esen, 2007):

CPC (Cost Per Click)

CPA (Cost Per Action)

CPL (Cost Per Lead)

CTR (Click Through Rate)

CR (Conversion Rate)

CPM (Cost Per Thousand)

Calculations

CTR = CLICKS / IMPRESSIONS

CR = CPA / CLICKS

CPM = (TOTAL COST / IMPRESSIONS)*1000

Using the above metrics and calculations correlation may then be found between

the advertised product/brand and advertising MPU (Media Placement Units). For

instance a clothing brand may be advertising jeans for males between the ages of

18 – 25. Costs within network advertising can be attributed to individual metrics;

usually the CPC, CPA or CPM. In some instances a hybrid cost method may be

attributed (CPC and CPA) to provide the client with a better ROI.

This advertising campaign is run on the basis of sales which mean an action tag

will be placed on the client’s website attributed with the cost of £5.00; this is a


17 | P a g e

CPA cost method. For each sale through advertising they spend £5.00, the RRP

(Recommended Retail Price) of the jeans on the website is £19.99 each. If the

advertising campaign were to run then the graph of results may appear as below.

Costs within network advertising can be attributed to individual metrics; usually

the CPC, CPA or CPM. In some instances a hybrid cost method may be attributed

(CPC and CPA) to provide the client with a better ROI.

In the above example the client spent £16,835 running the network advertising

campaign (excluding internal marketing costs) with the £5.00 being spent on

each sale through advertising. However the actual sale costs of the jeans on the

client’s website were £19.99 leaving a £14.99 profit gap. Gross revenue was

therefore £67,306.33 leaving net revenue of £50,471.33.

ADVERTISING SPEND – TOTAL SALES = NET PROFIT

As stated the above calculations are set out as an example of how analytics are

used in network advertising to generate sales. These analytics are found through

Figure 3 – Online advertising example statistics


18 | P a g e

the same JavaScript method as Google Analytics (and a host of other free

analytics tools), they are not advertising exclusive.

Advertising is tasked with tracking sales based upon clickstream data. Could 2012

be the year when the public relations industry utilises these metrics to not only

raise awareness through social networks but also track sales? It would not be the

first time that public relations industry has relied upon the advertising industry

for validity. Even though the CIPR does not officially endorse AVE (Ad Value

Equivalency) a research paper published in 2003 by the IPR1 describes its

demand by bosses and clients for use (Fox, 2003). The calculation for AVE is:

MEASURING COLUMN INCHES * ADVERTISING RATES = EQUIVELENT COST

Or

SECONDS WITHIN BROADCAST MEDIA * ADVERTISING RATES = EQUIVELENT

COST

The comparison between advertising and public relations is a cause for concern

as clients may presume an equal outcome of messages’ effect. This is to ignore

the additional calculations which may be used to multiply an additional 1.5 to 1.6

to the number (industry standard rates) to manipulate ROI for the client. In

essence AVE follows the same calculation as CPM in online advertising – with a

greater concept of accuracy. Within public relations the outcome of relationships

could be measured through symmetrical communication (Childers, 1999) which

assists with:

Understanding the needs of stakeholders

Tracking the effectiveness of messages

1 The IPR (Institute of Public Relations) gained Chartered status in 2005 making it the

CIPR. (http://publicsphere.typepad.com/mediations/2005/02/ipr_wins_charte.html)

http://publicsphere.typepad.com/mediations/2005/02/ipr_wins_charte.html


19 | P a g e

Listening to mediators (Journalists, Bloggers and Opinion Leaders) to

provide them with relevant content.

In terms of clickstream data the contextual relevancy of messages is found

through correlation. Patterns within data are evaluated against performance

objectives to assume poor or positive results. With social networks it is possible

to focus upon quantitative and qualitative statistics with the introduction of the

semantic web.


20 | P a g e

1.4 Introducing the Semantic Web

Today social networks are largely comprised of text based content which

requires an algorithm for detecting linguistics and presenting such data as

qualitative data sets. Semantic analytics are therefore an amalgamation between

text analytics and network ontologies. Recent research presents a dependency

upon RDF (Resource Description Framework)2, a model which allows data sets to

be placed within web pages (RDF Working Group, 2004). This creates a

noteworthy distinction between hyperlinks and RDF (Lee, 2009);

“Like the web of hypertext, the web of data is constructed with documents on

the web. However, unlike the web of hypertext where links are relationships

anchors in hypertext documents written in HTML, for data they links between

arbitrary things described by RDF”3.

The creation of RDF links allows navigation across one data source to many

others; with the addition of a FOAF (Friend of a Friend) data link it is possible to

attribute identification to another author (Lee, 2009). When FOAF is used with

RDF a social network is created between individuals and data sets (Golbeck &

Rothstein, 2008), allowing a significant degree of accuracy between content,

context and a network of relationships.

Currently data mining for semantic data is achieved through semantic search

engines (through crawlers) or semantic web browsers. Presenting data in an

understandable format is accomplished through the use of OWL (Web Ontology

Language), a sublanguage for applying additional vocabulary for when data

needs to be processed by machines rather than humans (McGuiness &

Harmelen, 2004).

2 Which is written using XML (Extensible Mark-up Language)

3 This quote written by founding father of the World Wide Web in 2006 (revised in 2009),

Tim Berners-Lee, signalled the founding of Linking Open Data project which aims to make data freely available to everyone.


21 | P a g e

The semantic web is made possible through all the above technical elements, the

question is how to utilise the conventions for analytical processing. A research

paper published by University of Georgia and University of Maryland entitled

“Semantic Analytics on Social Networks: Experience in Addressing the Problem of

Conflict of Interest Detection” describes the semantic research method as

follows (Meza, 2005):

1) Obtaining high quality data

Extraction of data from sites which includes metadata extraction from sources to

ensure relevancy.

2) Data preparation

Mostly data clear up and evaluation

3) Entity disambiguation

Attach relevant data to the correct entity

4) Metadata and ontology representation

Importing or exporting data as RDF/RDFS and OWL.

5) Querying and inference techniques

Data processing to enable semantic analytics and discovery

6) Visualization

Prepare data in a readable format

7) Evaluation

Comparison needed between shown data and other evidence to see if a

correlation appears.


22 | P a g e

The process of obtaining semantic analytics depends upon the task; research is

new and therefore experimental. Semantic measurement methods require a re-

imagining of ranking methods which may be used to measure blogs in the past

simply based upon clickstream data as proposed by Katie Delahaye Paine (2011)

– such methods could now be considered archaic.

The most recent semantic measurement method is called Latent Semantic

Analysis (LSA) which evaluates underlying meaning and concepts behind

language to build relationships between nouns and adjectives (Puffinware,

2010). Content is no longer king, context is king (Solis, 2011) and LSA provides

contextual relationships which closely illustrates natural language recognition

(Landauer, Foltz & Laham, 1998).

With regards to the semantic web, PR professionals are already ahead of the

game with their knowledge of values behind relationships. Just as Brian Solis

observed that new technologies are adjusting our behaviours (Solis, 2011), the

public relations industry must change their behaviour of how they utilise new

media – adapt or die. This begins with:

Adjusting our terminology from referring to stakeholders as ‘audiences’

to instead ‘publics’, removing the illusion of control that public relations

professionals still believe they have (Grunig, 2009).

Building relationships on a symmetrical basis rather than asymmetrical

(Grunig, 2011).

Understanding that intent is necessary4 so that a stakeholder

understands that a message is relevant (Theaker, 2008) and listening to

feedback.

4 This theory is discussed in the textbook “Human Communication” written by Michael

Burgoon, Frank G. Hunsaker and J. Dawson.


23 | P a g e

Based upon the nexus of values associated with multiple entities (individuals)

created from semantic analytics a study of linguistic pragmatics can be used to

form the correct rhetoric for stakeholders. The approach considers context of

content online and provides a method for public relations professionals to

provide meaning behind their messages (Mackey, 2005). Thus allows the

completion of campaign objectives to assist in raising awareness and change of

behaviour (which may even result in direct sales).


24 | P a g e

2.0 Methodology

As stated in the introduction to this dissertation the author intends to research

into three different areas:

1) The unprecedented growth of digital communication channels.

2) To assess the current usage of online metrics for evaluating web 1.0 and

web 2.0 platforms.

3) To assess the potential usage of semantic analysis for the public relations

industry.

To accurately research each of these areas the author deployed a variety of

different research methods. The main research piece of this dissertation is the

research into semantic analysis and the research present in the literature review

will be used to complement and provide perspective for the conclusion of the

research. Due to the modern nature of the research present within this

dissertation it was not possible for the author to reference or interview anyone

involved with PR campaigns using semantic measurement as nobody is practicing

it yet.

The table below outlines the mixture of secondary and primary research utilised,

along with how these align with research aims and objectives. The first research

aim is designed to provide an academic insight into the growth of digital

communication channels and how they are measured. The second research aim

relies heavily upon primary research as it is an experimental piece of research.

Research Aim Objectives Secondary Primary

To assess the

current usage of

online metrics for

To explore the

increasing use of

digital

Literature Review N/A


25 | P a g e

Figure 4 – Research table

evaluating web

1.0 and web 2.0

platforms.

communication

channels

To explore the

symmetry between

traditional and

digital

communication

channels


To explore current

metrics for online

measurement

Literature Review Observations

To explore the

potential of the

semantic analysis

method


To assess the

potential usage

of semantic

analysis for the

public relations

industry.

Conducting

research into latent

semantic indexing

(LSI)

Literature Review

Published Texts

Observations

Testing


26 | P a g e

Figure 5 – Research layout

To make it clear how the range of research methods and several research aims

provide conclusions to the questions provided at the start of this dissertation the

author has constructed a visual table.

Data collected for the semantic analysis research was achieved through

extracting data from Neville Hobson’s Twitter timeline and interpreting data

manually and through a python script5. This interpretation includes visually

displaying results using a singular value decomposition algorithm. The results of

this research can be found within the conclusion of this dissertation.

2.1 Research sample design

Literature Review

The literature review was conducted within this dissertation to gain

understanding of the progress of digital communication, the range of metrics

available and to achieve perspective surrounding semantic analysis. The review

was achieved by reading a wide range of PR publications; practitioners published

books and wider reading into online marketing. All material was selected on the

basis of its relevancy. This also included using digital communication channels:

5 This script is available to view in the appendices.

Research

Methods

Literature

Review

Observations

and Testing

Findings

Research

Evaluation

Conclusion


27 | P a g e

Facebook

Twitter

Google+

Google Reader

Online Journals

Online Databases

Due to the nature of the research within the literature review no primary sources

of data collection were chosen. However all secondary evidence was selected

based upon the credentials of their authors.

Data Analysis

Before approaching the research into Latent Semantic Indexing (LSI) it was

important for the author to note the types of data which would be collected:

Quantitative Data

This data takes the forms of numerical figures.

Qualitative Data

This data takes for the form of letters, words and sentences.

Correlating Data

Observing patterns between two or more pieces of data and presenting

these patterns as results. In terms of LSI this could take the form of

contextual patterns.

Knowing each stage of the LSI analysis was done through additional research

which has all been referenced within this dissertation. Stages of this research

have been included within section 3.0 in order to maintain research integrity.


28 | P a g e

2.2 Ethical considerations

Making the decision to know which online data should be collected for LSI

research was made with a conscious approach. It was important that the data

used has a clear human source so that patterns can be detected. Neville

Hobson’s Twitter timeline was eventually selected due to its public nature, but

care was still taken not to publish a tweet widely if it had the possibility to

distress the original author.

The script used for LSI analysis was not originally programmed by the author of

this dissertation. However modifications were made concerning the data

inputted into the script, slight modification to variables to show appropriate

results and a correct to the script due to an update made to Python 2.7. This

script has been made available in the appendices.

All other material referenced within this dissertation is publically available.


29 | P a g e

3.0 Latent Semantic Indexing (LSI)

Research into Neville Hobson’s Twitter

timeline

An ideal example for presenting the benefit of Latent Semantic Indexing (LSI) is

to observe how search engines such as Google operate. When a user provides a

search term an exact lexical match would not be appropriate due to the

existence of synonymies (Duz, 2008). Therefore an example search of “Cheap

gardening spades shop” could result in a lexical match of card playing, gambling,

gardening, etc. In reality the Boolean search query would return every Google

indexed webpage that includes all four words. Instead Google uses a version of

LSI to understand the patterns of words across every indexed webpage (among

other methods). This mathematical technique uses Singular Value Decomposition

(SVD) to identify the context between words. The process assumes that similar

words will be used within the same contexts, discovered through the

relationships between words. Through the contextual basis of word weightages

LSI is able to identify the category of written documents. For public relations

professionals this method, when delivered through an automated algorithm, re-

imagines stakeholder analysis.

Words that are usually written about a celebrity can be analysed to

understand associated values.

Research into competitors can be done to understand related terms

which can then be targeted in Search Engine Optimisation (SEO)

adaptations.

Understanding the values behind stakeholder groups to craft messages

effectively.


30 | P a g e

Automatic categorisation of media releases to understand the contexts

they should appear in.

Brand values become something being referred to by users online rather

than fixed in a marketing department.

The possibilities of LSI in public relations will become clear through time. As a

piece of research into LSI this technique has been used within this dissertation to

identify the key themes surrounding Neville Hobson’s Twitter timeline.

Neville Hobson first began blogging in 2002, a hobby which grew to incorporate

how a business should communicate using digital communication channels.

Today he has over 25 years’ experience in public relations, marketing

communication and financial relations (Hobson, 2012). His acclaimed status is

clearly exampled by his popular Twitter profile boasting over 10,000 followers (as

of 12/02/2012).

3.1 LSI Python Script

This LSI research was conducted using a modified version of this Python Latent

Semantic Analysis code: http://www.puffinwarellc.com/index.php/news-and-

articles/articles/33-latent-semantic-analysis-tutorial.html?start=2. The script was

run using Python 2.7 using additional scientific libraries NumPy and SciPy.

Modifications to the script include a change of subject data, change of stop

words, a display command to print index words and a line to stop the program

automatically closing upon build.

Evaluating Neville Hobson’s Twitter timeline using LSI has involved the following

steps:

1. Retrieve 50 tweets from Hobson’s timeline (11 Feb – 9 Feb 2012).


31 | P a g e

As a piece of manual LSI research 50 tweets provided an adequate

sample. An automated algorithm could pull hundreds of tweets for

analysis.

2. Filter URLs, hashtags, retweets and numerical values.

LSI is concerned with qualitative data in the form of words out of English

syntax. All the data needs to be associated with Neville Hobson (hence no

retweets).

3. Identify index words.

These are words which occur twice or more in the sample data, are not

stop words (such as ‘it’, ‘the’, ‘a’, ‘if’, etc.) and must carry meaning.

4. Discover correlation using Term Count Model (TCM).

The TCM presents the initial stages of LSI by capturing the frequency of

index words from retrieved data.

5. Apply weightages to index words.

Once the frequency of index words has been discovered an algorithm is

used to apply contextual weightages to words.

6. Visual display of results.

Each index word, with their unique weightage, is presented in a graph.

Words plotted in certain sections of the graph indicate categories.

7. Interpretation of Results

Understand the data.


32 | P a g e

3.2 Retrieval, Filter and Identification

Retrieving 50 tweets from Neville Hobson’s Twitter timeline involved a simple

copy and paste into a word document6. The sample of tweets which were

extracted is from a single calendar period between the 9th – 11th February 2012.

Any tweets which were Re-Tweets (RTs) were disregarded as this research into

LSI requires data unique to Neville Hobson.

The data filtration process describes the clean-up process of extracting purely

qualitative data. In the context of data usually found posted on Twitter this

involved removing:

URLS

HashTags

Re-Tweets (RTs)

Numerical values

@replies to other users

Once the data has been filtered the second stage of LSI is to identify the “index

words” of the document. These are words that appear twice or more within the

captured data. So for instance, if the first tweet contained the word “social” and

the thirtieth also contained “social” – this makes “social” an index word. All index

words are connotative which means that their meanings can be interpreted

against other index words.

Retrieving the index words of this document involved several forms of

verification. The first stage involved manually reading over Neville Hobson’s

tweets and highlighting index words individually. This involved identifying index

words within the document and measuring their frequency of appearance. To

verify this manual process, which is subject to error, an adjustment to the python

6 This document can be found in the appendix


33 | P a g e

script was made to display the self.keys variable (line 95) to show the index

words:

1. Advice

2. Business

3. Comments

4. Daily

5. Era

6. Event

7. Fun

8. Global

9. Google

10. Hobson

11. Looks

12. Media

13. Morning

14. Networks

15. Neville

16. Perspectives

17. Post

18. Reading

19. Recording

20. Sharing

21. Snow

22. Social

23. Today

In doing so it was possible to identify any “stop words” within the sample data

through the process of elimination. This concerns examining English sentence

syntax to identify coordinating conjunctions, pronouns, adjectives and verbs. For

this sample data this included the omission of the following words:

'on','just','to','for', 'great', 'i', 'between','and','a','good', 'is', 'the', 'of', 'some', 'in',

'other', 'why', 'get', 'by', 'I', 'as', 'use', 'says', 'out', 'too', 'via', 'here', 'it', 'about',

'an', 'at', 'be', 'coming', 'especially', 'I', 'into', 'its', 'make', 'need', 'not', 'one',

'prime', 'still', 'thanks','that', 'we', 'well', 'what', 'will', 'with'.

3.3 Term Count Model and Singular Value Decomposition

There are several ways to measure the initial results of LSI. These include the

Term Count Model (TCM) and Singular Value Decomposition (SVD). The TCM

marks the initial stage of LSI for understanding the frequency of index word

mentions. LSI works by reducing the structured syntax of language to instead

recognising individual key words. The TCM places the initial data results of the


34 | P a g e

retrieved data into a count model so that it is possible to understand how

frequent key words appear in each extracted tweet. This process alone does not

result in any viable data but does allow for SVD to take place later in the process.

The TCM results of Neville Hobson’s Twitter timeline data can be found on the

next page7. Figures three and four show the data as the initial spread sheet table

and as a graph. At this early stage it is already apparent that the key word ‘social’

is by far the most frequent word.

Please turn over

7 Larger versions of figure 6 and 7 can be found in the illustrations.


35 | P a g e

Fig

ure

6 -

TC

M

Fig

ure

7 -

Vis

uali

sa

tio

n o

f T

CM


36 | P a g e

Now that the TCM table has been constructed it is necessary to revert back to

the Python script to have the selected data broken down into different

dimensions. This process is called Singular Value Decomposition (SVD) and is an

algorithm built to show on a visual basis the relationship between each key word

and the term of which they originate from. The number of dimensions available

in SVD is relative upon the data sets selected and the purpose of the SVD

process. In terms of evaluating SVD for Twitter timeline data three dimensions

have been used. A histogram can be used to understand the importance of each

singular value based upon the data sets used (Puffinware, 2010). The meaning

behind each dimension is as follows:

Dimension 1: The TCM frequency of each index word.

Dimension 2: The X value relationship dimension.

Dimension 3: The Y value relationship dimension.

As the first dimension of SVD simply measures the frequency of each index word

it will not be necessary to implement. Therefore dimensions two and three will

be utilised for the SVD model. In turn these will form the X and Y axis on a

comparative scatter graph. The scatter graph works by noting the values of

dimension two and dimension three which form each of the different

coordinates on the graph. As each of the dimensions have been discovered

through using an algorithm which notes each key word’s relationship with the

term they originate from, the data should show clusters of similar words

associating around particular values. For instance ‘advice’, ‘comments’ and

‘sharing’ may closely align with each other and may be interpreted as a social

category.

Fig 8 shows a list of each key word and the associated values under dimension

two and dimension three.


37 | P a g e

Figure 8 – SVD dimensions table

Once these values have been aligned using a Microsoft Excel spread sheet table

the results appear as shown on the next page.


38 | P a g e

3.4 The Results

As expected certain key words have aligned more closely with some others

dependent upon their original relationship with the tweet from which they

originated from. This explains why the individual key words ‘Neville’, ‘Hobson’

and ‘Daily’ has aligned to form their original simple sentence again as each word

equally appears in exactly the same tweets. The original fifty tweets have not

been included on this graph as their sheer number would have made it

impossible to interpret the key word results and their very existence would not

assist to fulfil the research task necessary for this dissertation. If smaller data sets

had been used (perhaps evaluating a handful of newspaper articles) then the

original terms would have had a meaningful value when compared against the

extracted key words. The final stage of this LSI research concludes with a manual

interpretation of the weighted key word sets.

Figure 9 – Visual SVD


39 | P a g e

Figure 10 – Visual SVD with categorisation

This final stage requires manual interpretation of the categories which are

present as a result of SVD and LSI research. The three circled categories could be

classed as the following:

Red: Broadcasting

These three words are loosely based around the application of

broadcasting.

Blue: Community

Without a doubt these key words are all associated with community

activities and social business. Notice how all four of the words are to do

with the creation and sharing of information on Twitter. This may also

show that Neville Hobson has some influence as a user on Twitter.

Yellow: Authority & Teaching

This could also be labelled as a social category but with respect to Neville

Hobson’s timeline show that he has authority and teaching. Notice how

‘comments’, ‘reading’ and ‘advice’ are closely weighted on the scatter


40 | P a g e

graph which may indicate that some tweets are about commenting and

publishing articles.


41 | P a g e

4.0 Evaluation

4.1 Evaluation of Latent Semantic Indexing

Despite the apparent success of the research within this dissertation concerning

LSI the author must note there are five important areas of improvement needed

with this system.

Small data set

For this research 50 tweets were captured for analysis which has left

words such as “era”, “networks” and “snow” uncategorised when

weighted by SVD. A larger data sample would provide increased accuracy

and depth into Neville Hobson’s online activity.

Shared meaning

LSI is unable to understand that some words may be spelt exactly the

same but their meanings may differ. Whilst the word ‘reading’ was

categories under “Authority & Teaching” the context of the sentence it

originated from may have actually meant the location Reading. In order

for LSI to understand the actual meanings behind words an additional

research process would need to be used before SVD.

The clean-up process

Extracting tweets from Twitter for analysis is a process which requires a

large amount of data clear-up. For an automated process an algorithm

would need to be constructed in order to identify hashtags, urls and

@replies. As LSI can be implemented on a number of different digital

communication channels then separate algorithms would need to be

constructed to implement different data clean-up processes.


42 | P a g e

Interpretation

LSI represents patterns of words. Within this example we can see how the

words “Neville”, “Hobson” and “Daily” have all been attributed the same

weightage through SVD as all words only appear in the same tweets. As

LSI can only identify words with the same meaning this leaves the word

“Morning” entirely separate from “Daily” even though both share close

meaning. In the same way the words “social” and “networks” have been

grouped differently even though the two words are usually frequently

used to describe the same term, “social networks”. Therefore LSI

provides a pattern but additional interpretation is needed to identify

word categories.

Automation is key

The research into LSI in this dissertation is extremely basic in comparison

to the large data sets that would exist within a PR agency or in-house

environment. It has taken a month for the author to fully understand the

process of LSI to process a small data set of Neville Hobson’s Twitter

timeline. For this measurement process to be used professionally then an

automated system would need to be constructed which can quickly crawl,

extract, clean-up and process data. Despite extensive research an

organisation or agency offering these services does not yet exist.


43 | P a g e

4.2 Bayesian Inference and Other Interpretations

The key stage of LSI is concerned with the nature of the SVD which takes place.

For the research within this dissertation the author has approached key word

weighting based upon a three dimensional analysis but discarding the first

dimension for more accurate results. However, curating the results of LSI can

take many forms which all take place after the SVD process. These processes

have not been applied to the processed data within this research piece due to

the small data set. Yet these different processes have been listed below.

Bayesian inference

Bayesian inference is a mathematical method used to understand to what extent

is a notion true or false. In statistical terms this is known as Boolean logic (MS

Research, 1998) and this is a process which works in the background for almost

all variable based computing solutions. In this respect (Radford, 1998), “all forms

of uncertainty are expressed in terms of probability”. Therefore the system

works based upon a posteriori8 justifications which make it perfect for curating

the results of LSI. If a LSI system used an advanced Bayesian inference script then

the LSI algorithm could be completely automated, based upon an initial human

evaluation of categorising key words against sub-set categories.

Benefits: Fully automated system; Machine learning environment.

Considerations: Advanced script needed; Risk of misinterpretation of

words.

Natural language analysis

This process would involve taking the end results of LSI and then putting them

through a further process so that each key word is categorised under certain

concepts. For instance the word ‘Reading’ can be defined to either be linked to

an activity or a location. This would be achieved by manually weighting the word

8 The term ‘a posteriori’ is Latin to explain “from the later” and in philosophy explains

knowledge gained from empirical evidence or experience.


44 | P a g e

closer to each of the two concepts by reinforcing its relationship with close

words. For instance if ‘Reading’ and ‘Car’ were to appear within the same syntax

then natural language analysis would result in ‘Reading’ being a location in this

instance. This is a process which works upon the basis of Boolean probability

which would mean it could be used in parallel with Bayesian inference.

Benefits: Fully automated system; Machine learning environment; More

accurate results.

Considerations: Advanced script needed; Risk of multiple languages;

Unknown semantic concepts.

Manual weighting system

The simplest way to curate the results from LSI would be to evoke a manual

weighting system. This would involve users of a partially automated LSI

programme to make judgements concerning the results of analysis. This may

take the form of a star based rating system, a numbered relevance system (1 –

10) or manually grouping certain results together under their own set categories.

Benefits: Easy to set up.

Considerations: Time consuming; Risk of human error; No machine

learning.


45 | P a g e

5.0 Conclusion

1. ROI is relied upon for Reputation Management and Direct Sales

The public relations industry has always deployed an algorithm in order

to understand how a client receives their ROI. In the past this has

involved the use of AVE models but online it is necessary for the CIPR to

invoke a standardisation for practitioners to utilise.

2. Third party measuring tools exist but are not perfect

Clickstream data exists to answer the ‘What?’ and ‘Why?’ questions

behind data. However there are a range of third party measuring tools

that capture this clickstream data and use their own algorithms to

provide sentiment levels. These programmes can be used but only at a

professional’s own discretion as the calculations for sentiment are not

usually publically available.

3. The PR industry needs standardisation

The Online Advertising industry has been used as an example within this

dissertation to show how that particular industry has applied their own

standardisation behind online metrics. In this respect the public relations

industry is years behind; not only is not there a standard for measuring

traditional PR but a standard does not yet exist for digital public relations.

As the Chartered body the CIPR must organise standard measurement

metrics so that services can be better understood by clients and by the

agencies offering services.

4. Semantic Analysis works but has not yet been perfected

The research into LSI shows how this measurement method could be

utilised by PR professionals to measure reputation online. As of the

publication of this dissertation no organisations exist who can offer this

form of measurement. However this may change in the next couple of


46 | P a g e

years. This form of measurement is already being utilised by Google to

deliver their search results and will most likely be used by the PR industry

to measure their activities online. A bigger research study would be

required to really show how LSI could revolutionise the digital PR

industry.


47 | P a g e

References 1. Barabasi, et al. (1999) ‘Emergence of Scaling in Random Networks’,

Science Journal, 509-512 [online]. Available at: http://www.sciencemag.org/content/286/5439/509.full (Accessed: 26 January 2012)

2. Childers, L. (1999) ‘Guidelines for Measuring Relationships in Public

Relations’, the Institute for Public Relations. University of Florida.

3. CIPR TV. (2011) ‘CIPR TV Discusses Broadcast PR and the PR 2020 Report’. Retrieved January 26, 2012 from YouTube: http://www.youtube.com/watch?v=pzUYBEm-E6w&feature=youtu.be

4. CIPR. (2012) ‘What is PR?’ Retrieved April 04, 2012 from CIPR website:

http://www.cipr.co.uk/content/careers-cpd/careers-pr/what-pr

5. Duz, M. (2008) ‘Latent Semantic Indexing LSI Explained’. Retrieved April

04, 2012 from SEO blog: http://www.seo-blog.com/latent-semantic-

indexing-lsi-explained.php

6. Facebook. (2012) ‘Statistics’. Retrived January 26, 2012 from Facebook: https://www.facebook.com/press/info.php?statistics

7. Fox, J. B. (2003) ‘A Discussion of Advertising Value Equivalency (AVE)’, The

Institute for Public Relations. University of Florida.

8. Gay, R. Charlesworth, A and Esen, R. (2007) Online Marketing: a

customer-led approach. Oxford: Oxford University Press.

9. Golbeck, J. and Rothstein, M. (2008) ‘Linking Social Networks on the Web with FOAF: A Semantic Web Case Study’. University of Maryland.

10. Google Analytics. (2012) ‘Google Analytics Product Tour’. Retrieved

January 26, 2012 from Google Analytics website: http://www.google.com/analytics/tour.html

11. Goold, P. (2012) ‘Google’s ‘Search, plus Your World’ Highlights the

Additional Benefits of Social Activity, says Punch Communications’. Retrieved January 26, 2012 from Yahoo News website: http://news.yahoo.com/google-search-plus-world-highlights-additional-benefits-social-081625020.html

12. Gordon, A. (2011) Public Relations. Oxford: Oxford University Press.

13. Gould, D. (2010) ‘8095 Report: For Millennials, Brand Preference is a

Form of Self Expression’. Retrieved January 26, 2012 from PSFK website:


48 | P a g e

http://www.psfk.com/2010/10/8095-report-for-millennials-brand-preference-is-a-form-of-self-expression.html

14. Grunig, E. J. and Hunt, T. T. (1984). Managing Public Relations. United

States: Holt, Rinehart & Winston.

15. Grunig, J. E. (2009). Paradigms of global public relations in an age of digitalisation. Prism 6(2): http://praxis.massey.ac.nz/prisms_on-line_journ.html

16. Gupta, O. (2006). Encyclopaedia of Journalism and Mass Communication.

India: Isha Books 17. Hobson, N. (2012) ‘About’. Retrieved April 04, 2012 from Neville Hobson’s

blog: http://www.nevillehobson.com/about/

18. Investopedia. (2011) ‘Return on Investment – ROI’. Retrieved January 26, 2012 from Investopedia website: http://www.investopedia.com/terms/r/returnoninvestment.asp#axzz1jzpgEI6N

19. Jefkins, F. (2000) Advertising. Edinburgh: Pearson Education Limited.

20. Kaushik, A. (2010) Web Analytics 2.0: The art of online accountability & science of customer centricity. Indiana: Wiley Publishing

21. Landauer, K. T., Foltz, W. P. and Laham, D. (1998) ‘An Introduction to

Latent Semantic Analysis’, Discourse Processes Journal, 25, 259-284 [online] Available: http://lsa.colorado.edu/papers/dp1.LSAintro.pdf (Accessed: 26 January 2012)

22. Lee, B. T. (2009) ‘Linked Data’. Retrieved January 26, 2012 from W3

website: http://www.w3.org/DesignIssues/LinkedData.html

23. Mackey, S. (2005) ‘Rhetorical Theory of Public Relations: Opening the door to semiotic and pragmatism approaches’, The Annual Meeting of Australian and New Zealand Communication Association. Deakin University.

24. McGuinness, L. D. and Harmelen, V. F. (2004) ‘OWL Web Ontology

Language Overview’. Retireved January 26, 2012 from W3 website: http://www.w3.org/TR/owl-features/

25. Meza, A. B, et al. (2005) ‘Semantic Analytics on Social Networks:

Experiences in Addressing the Problem of Conflict of Interest Detection’. University of Georgia & University of Maryland.


49 | P a g e

26. MS Research. (1998) ‘Basics of Bayesian Inference and Belief Networks’. Retrieved April 07, 2012 from Microsoft Research website: http://research.microsoft.com/en-us/um/redmond/groups/adapt/msbnx/msbnx/basics_of_bayesian_inference.htm

27. Owens, J. (2012) ‘PRCA Trends Barometer Reveals Concerns About

Industry Outlook’. Retrieved January 26, 2012 from PR Week: http://www.prweek.com/news/rss/1112783/PRCA-trends-barometer-reveals-concerns-industry-outlook/

28. Paine, D. K. (2007) ‘How to set benchmarks in social media: Exploratory

research for social media, lessons learned’. KDPaine & Partners.

29. Paine, D. K. (2011) Measure what Matters: Online Tools for Understanding Customers, Social Media, Engagement, and Key Relationships. New Jersey: John Wiley & Sons.

30. Phillips, D. and Young. P. (2009) Online Public Relations: A practical guide

to developing an online strategy in the world of social media (2nd Ed). London: Kogan Page.

31. PRCA. (2012) ‘What is PR?’ Retrieved April 04, 2012 from PRCA website:

http://www.prca.org.uk/What_is_PR

32. Puffinware. (2010) ‘Latent Semantic Analysis (LSA) Tutorial’. Retrieved January 26, 2012 from iMetaSearch website: http://www.puffinwarellc.com/index.php/news-and-articles/articles/33-latent-semantic-analysis-tutorial.html

33. Radford, M. (1998) ‘Philosophy of Bayesian Inference’. Retrieved April 07, 2012 from Toronto University website: http://www.cs.toronto.edu/~radford/res-bayes-ex.html

34. RDF Working Group. (2004) ‘Resource Description Framework (RDF).

Retrieved January 26, 2012 from W3C website: http://www.w3.org/RDF/

35. Solis, B. (2011) The End of Business as Usual: Rewire the way you work to succeed in the consumer revolution. New Jersey: John Wiley & Sons.

36. Theaker, A. (2008). The Public Relations Handbook (3rd Ed). Oxon:

Routledge

37. White, M. (2012) ‘Considering PRSA’s Definition of PR’. Retrieved April 04,

2012 from Michael White’s blog:


50 | P a g e

http://www.mikewhite.co.uk/2012/03/19/considering-prsas-definition-

of-pr/

38. White, R. (2000) Advertising (4th Ed). Berkshire: McGraw-Hill Publishing

Company.


51 | P a g e

Illustrations

Figure 1: ROI

Figure 2: Barabasi-Albert model

Figure 3: Online advertising example statistics


52 | P a g e

Figure 4: Research table

Research Aim Objectives Secondary Primary

To assess the

current usage of

online metrics for

evaluating web

1.0 and web 2.0

platforms.

To explore the

increasing use of

digital

communication

channels


To explore the

symmetry between

traditional and

digital

communication

channels


To explore current

metrics for online

measurement

Literature Review Observations

To explore the

potential of the

semantic analysis

method


To assess the

potential usage

of semantic

analysis for the

public relations

industry.

Conducting

research into latent

semantic indexing

(LSI)

Literature Review

Published Texts

Observations

Testing


53 | P a g e

Figure 5: Research layout


54 | P a g e

Figure 6: TCM


55 | P a g e

Figure 7: Visualisation of TCM


56 | P a g e

Figure 8: SVD Dimensions table


57 | P a g e

Figure 9: Visual SVD


58 | P a g e

Figure 10: Visual SVD with categorisation


59 | P a g e

Appendix

Copy of Python LSI Tweet Analysis Code

from numpy import zeros

from scipy.linalg import svd

#following needed for TFIDF

from math import log

from numpy import asarray, sum

titles = ["Firefox on Win just updated to version. Critical security fix",

"March release for Samsung Galaxy S II Android update. Anywhere between and days away",

"Asus Transformer Prime review stars and a good q: What is the point of the Prime",

"yw. Great post, some good contribs to the issue in the other comments",

"Added to the conversation on Guardian post about recording phone interviews for podcasts",

"Why Social Media Jobs Get Filled By Younger Folks: Infographic",

"Viewpoint: V for Vendetta and the rise of Anonymous. Great read",

"FTW",

"I tend to take a power strip with sockets and only one adapter",

"things you still need to know about social media social business. Spot on. especially",

"tips for managing negative comments online. Good advice",

"thanks, Kerry, good refocus",

"The Neville Hobson Daily is out",

"Breakfast supplies",

"U.S. Air Force May Buy 18,000 Apple IPad2s for Flight Crews. Businessweek via",

"we can do that, Ellee, would be fun",

"Morning. Beautiful, sunny and, terrific start to the weekend",

"Thinking that Google Hangouts is a pretty neat tool, especially the screen sharing feature",


60 | P a g e

"An imaginative approach to a difficult (macabre, perhaps) topic to talk about - what happens to your digital conte",

"fyi re Feel free to RT",

"thanks. A good story. Almost as good as",

"hi Sylvie. Not aware of any recent surveys on smallbiz and use of social networks in Australia",

"of UK small businesses use social networks for business, says survey",

"Global perspectives on social media",

"Blog Global perspectives on social media",

"The Neville Hobson Daily is out",

"Google is getting into the music hardware business says the",

"The FTSE social media index. Ranking methodology explained, too. Via",

"Texas Jury Strikes Down Patent Trolls Claim to Own the Interactive Web. Good result",

"What does (and doesn't) on Twitter and Facebook. Hard to get English plainer than this",

"there's a good shoe shiner in the enclosed courtyard at Devonshire Square, EC",

"yes, same here, not much traffic coming in to Reading from the A4 east",

"Ads coming to the LinkedIn mobile app",

"Seeing that Harry Redknapp is still a news headline. Come on, FA, just give him the job",

"of course a lot of snow is a relative expression",

"Driving into Reading shortly should be fun",

"Morning. Quite a bit of snow out there. Well, an inch or two anyway",

"uksnow Will it settle? Looks unlikely although tomorrow morning will tell",

"The tone of life on social networking sites Behavioural study by Pew, interesting findings",

"File Sharing in the Post MegaUpload Era Mainly, staggeringly less efficient.",

"End of an era: Kodak discontinues its camera business",


61 | P a g e

"I suspect is the one to ask that: is anyone recording the Google session at",

"Looks a must-be-there event: Google at",

"we need to make that happen",

"that looks a great event, Holly, thanks But I won't be in London that day unfortunately",

"Many thanks to for his superb insight & advice on social media monitoring today <= my pleasure",

"Windows Consumer Preview due February: why it's not called beta",

"The Neville Hobson Daily is out! Top stories today via",

"wrestles with microblog revenue plan user loyalty, monetize",

"we'll make it work"

]

stopwords = ['on','just','to','for', 'great', 'i', 'between','and','a','good', 'is', 'the', 'of', 'some', 'in', 'other', 'why', 'get', 'by', 'I', 'as', 'use', 'says', 'out', 'too', 'via', 'here', 'it', 'about', 'an', 'at', 'be', 'coming', 'especially', 'I', 'into', 'its', 'make', 'need', 'not', 'one', 'prime', 'still', 'thanks','that', 'we', 'well', 'what', 'will', 'with']

ignorechars = ''',:'!'''

class LSA(object):

def __init__(self, stopwords, ignorechars):

self.stopwords = stopwords

self.ignorechars = ignorechars

self.wdict = {}

self.dcount = 0

def parse(self, doc):

words = doc.split();

for w in words:

w = w.lower().translate(None, self.ignorechars)

if w in self.stopwords:

continue

elif w in self.wdict:


62 | P a g e

self.wdict[w].append(self.dcount)

else:

self.wdict[w] = [self.dcount]

self.dcount += 1

def build(self):

self.keys = [k for k in self.wdict.keys() if len(self.wdict[k]) > 1]

self.keys.sort()

self.A = zeros([len(self.keys), self.dcount])

for i, k in enumerate(self.keys):

for d in self.wdict[k]:

self.A[i,d] += 1

def calc(self):

self.U, self.S, self.Vt = svd(self.A)

def TFIDF(self):

WordsPerDoc = sum(self.A, axis=0)

DocsPerWord = sum(asarray(self.A > 0, 'i'), axis=1)

rows, cols = self.A.shape

for i in range(rows):

for j in range(cols):

self.A[i,j] = (self.A[i,j] / WordsPerDoc[j]) * log(float(cols) / DocsPerWord[i])

def printA(self):

print self.keys

print 'Here is the count matrix'

print self.A

def printSVD(self):

print 'Here are the singular values'

print self.S

print 'Here are the first 3 columns of the U matrix'

print -1*self.U[:, 0:3]

print 'Here are the first 3 rows of the Vt matrix'


63 | P a g e

print -1*self.Vt[0:3, :]

mylsa = LSA(stopwords, ignorechars)

for t in titles:

mylsa.parse(t)

mylsa.build()

mylsa.printA()

mylsa.calc()

mylsa.printSVD()

raw_input("\n\nPress The Enter Key To Exit")

Documents

Managing Reputation through Online Analytics