Social Media Analytics with a pinch of semantics

Social Media Analytics with a pinch of semantics Harith Alani

http://people.kmi.open.ac.uk/harith/

@halani

harith-alani

@halani

Outline of my talk

§  I’ll start talking § Then I’ll finish talking § You’ll wonder what you’ve learned! § You will clap regardless § You’ll be convinced you learned nothing

§ You could be right! § But you’re wrong of course

§ We go to the bar tonight and forget all about the talk!

•  Why social media analytics? –  It’s where everyone is! –  Real time information –  Low cost –  Much of it

Survey of 3800 marketers on how they use social media to grow their business

Social Media for Businesses

§  “they can't be forced to use social apps, they must opt-in”

§  “need a detailed understanding of social networks: how people are currently working, who they work with and what their needs are”

5

Measuring Social Media

6

Tools for monitoring social networks

LinkedIn Group Analytics

Facebook Insights •  Provides measurements

on FB Page performance

•  Provides demographic data about visitors, and their engagement with posts

•  “Experiment with different types of posts to see what your audience responds to best.”

Social Media Challenges •  Integration –  How to represent and

connect this data? •  Behaviour

–  How can we measure and predict behaviour?

–  Which behaviours are good/bad in which community type?

•  Change –  Can we influence behaviour

change? •  Community Health

–  What health signs should we look for?

–  How to predict them?

•  Engagement –  How can we maximise

engagement?

•  Sentiment –  How to measure it? track it? –  Can we predict sentiment

towards entities (brands, people, events)?

Forum on a celebrity

Forum on transport

June 25, 2013

In-house Social Platforms

Jan 29, 2013

Semantically-Interlinked Online Communities (SIOC) •  SIOC aims to enable the integration of online community information. •  SIOC provides a Semantic Web ontology for representing rich data from the Social Web

in RDF

sioc-project.org

Semantics in FB Open Graph

Behaviour Analysis

Why monitor behaviour?

§  Understand impact of behaviour on community evolution §  Forecast community future §  Learn when intervention might be needed §  Learn which behaviour should be encouraged or

discouraged §  Find what could trigger certain behaviours §  What is the best mix of behaviour to increase

engagement in the community §  To see which users need more support, which ones

should be confined, and which ones should be promoted

Behaviour analysis in Social Media

§  Bottom Up analysis §  Every community member

is classified into a “role” §  Unknown roles might be

identified §  Copes with role changes

over time ini#ators

lurkers

followers

leaders

Structural, social network, reciprocity, persistence, participation

Feature levels change with the dynamics of the community

Associations of roles with a collection of feature-to-level mappings e.g. in-degree -> high, out-degree -> high

Run rules over each user’s features and derive the community role composition

Modelling user features and interactions

Encoding Rules in Ontologies with SPIN

Clustering for identifying emerging roles

–  Map the distribution of each feature in each cluster to a level (i.e. low, mid, high)

–  Align the mapping patterns with role labels

Table 1: Correlation Coe!cients of dimensions

Dispersion Engagement Contribution Initiation Quality PopularityDispersion 1.000 0.277 0.168 0.389 0.086 0.356Engagement 0.277 1.000 0.939** 0.284 0.151 0.926**Contribution 0.168 0.939** 1.000 0.274 0.086 0.909**Initiation 0.389 0.284 0.274 1.000 -0.059 0.513Quality 0.086 0.151 0.086 -0.059 1.000 0.065Popularity 0.356 0.926** 0.909** 0.513 0.065 1.000

Figure 7: Cumulative density functions of each dimension showingthe skew in the distributions for initiated and in-degree ratio

same forum and do not deviate away, at the other ex-treme very few users are found to post in a large rangeof forums. For initiated (initiation) and in-degree ratio(popularity) the density functions are skewed towardslow values where only a few users initiate discussionsand are replied to by large portions of the community.Average points per post (quality) is also skewed to-wards lower values indicating that the majority of usersdo not provide the best answers consistently.These plots indicate that feature levels derived from

these distributions will be skewed towards lower values,for instance for initiated the definition of high for thisfeature is anything exceeding 1.55x10!5.The distribution of each dimension is shown in Fig-

ure 8 for each of the 11 induced clusters. We assessthe distribution of each feature for each cluster againstthe levels derived from the equal-frequency binning ofeach feature, thereby generating a feature-to-level map-

Figure 8: Boxplots of the feature distributions in each of the 11 clus-ters. Feature distributions are matched against the feature levels de-rived from equal-frequency binning

ping. This mapping is shown in Table 2 where certainclusters are combined together as they have the samefeature-level mapping patterns (i.e. 5,7 and 8,9). Wethen interpreted the role labels from these clusters, andtheir subsequent patterns, as follows:

• 0 - Focussed Expert Participant: this user typeprovides high quality answers but only within se-lect forums that they do not deviate from. Theyalso have a mix of asking questions and answeringthem.

• 1 - Focussed Novice: this user is focussed within afew select forums but does not provide good qual-ity content.

• 2 - Mixed Novice: is a novice across a mediumrange of topics

6

Table 2: Mapping of cluster dimensions to levels

Cluster Dispersion Initiation Quality Popularity0 L M H L1 L L L L2 M H L H3 H H H H4 L H H M5,7 H H L H6 L H M M8,9 M H H H10 L H M H

• 3 - Distributed Expert: an expert on a variety oftopics and participates across many di!erent fo-rums

• 4 - Focussed Expert Initiator: similar to cluster0 in that this type of user is focussed on certaintopics and is an expert on those, but to a large ex-tent starts discussions and threads, indicating thathis/her shared content is useful to the community

• 5.7 - Distributed Novice: participates across arange of forums but is not knowledgeable on anytopics

• 6 - Focussed Knowledgeable Member: con-tributes to only a few forums, has medium-levelexpertise (i.e. he/she is neither an expert nor anovice) and has medium popularity

• 8,9 - Mixed Expert: medium-dispersed user whoprovides high-quality content

• 10 - Focussed Knowledgeable Sink: focusseduser who has medium-level expertise but who getsa lot of the community replying to them - hence asink. Di!ers from cluster 6 in terms of popularity.

6. Analysis: Community Health

Deriving a community’s role composition providescommunity operators and hosts with amacro-level viewof how their community is operating and how it is func-tioning. Understanding what is a healthy and unhealthycomposition in a community involves analysing how agiven role composition has been associated with com-munity activity, interaction or some other measure in thepast and reusing that knowledge. Forums and communi-ties operating within the same platform may also di!ersuch that what turns a community healthy in one loca-tion may be di!erent from another. In this section wedescribe how community analysis is possible throughour presented approach to derive the role compositionof a community using semantic rules.

6.1. Experimental Setup

To demonstrate the utility of our approach we anal-ysed each of the 33 SAP communities from 2009through to 2011. Figure 9 shows how our dataset wasdivided into the tuning section - i.e. the first half of2008 in which we derived our clusters and aligned themto roles (as described in Section 5) - and the analysissection. We began with the 1st January 2009 as our col-lect date by taking a feature window 6 months prior tothis date (going back to the 2nd half of 2008) in whichwe measured the behaviour dimensions for each com-munity’s users. In order to gauge the role compositionin a community over time we move our collect date onone week at a time and use the 6-months prior to thisdate as our feature window. As Figure 9 demonstrateswe repeat this process until we reach 2011.

Figure 9: Windows used for a) tuning of the clusters and the derivationof roles and b) the analysis of community health. Role compositionis derived every week from 2009 onwards using a 6-month windowgoing back from the collection date.

By measuring the behaviour dimensions of individ-ual users in individual communities we are able to inferthe roles of the users using the semantic rules describedin Section 4. This provides a micro-level assessment ofthe roles that individual users assume. We can then lookat the macro-level by deriving the role composition of agiven community at a given point in time by measuringhow many users have a specific role. Such role compo-sition analysis allows for predictions to then be made.To demonstrate the application of such analysis we per-formed three distinct experiments (each designed to ex-plore one of our three aforementioned research ques-tions):

1. Composition Analysis: assesses the average rolecomposition in each community and clusters thembased on the compositions. We also pick out eachcommunity’s most popular role and measure whatpercentage of the community that role covers.

2. Activity Increase/Decrease: we perform a binaryclassification task such that at timestep t = k + 1we predict whether the community’s activity (i.e.number of posts) has increased or decreased since

7

•  1 - Focussed Novice: focussed within a few select forums but does not provide good quality content.

•  2 - Mixed Novice: a novice across a medium range of topics

•  3 - Distributed Expert: expert on a variety of topics and participates across many different forums

….

Mapping of cluster dimensions to levels

Correlation of behaviour with community activity

§  How existence of certain behaviour roles impact activity in an online community?

Online Community Health Analytics

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Churn Rate

FPR

TPR

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

User Count

FPR

TPR

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Seeds / Non−seeds Prop

FPR

TPR

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Clustering Coefficient

FPR

TPR

•  Machine learning models to predict community health based on compositions and evolution of user behaviour

•  Churn rate: proportion of community leavers in a given time segment.

•  User count: number of users who posted at least once.

•  Seeds to Non-seeds ratio: proportion of posts that get responses to those that don’t

•  Cluster coefficient: extent to which the community forms a clique.

Health categories

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Churn Rate

FPR

TPR

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

User Count

FPR

TPR

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Seeds / Non−seeds Prop

FPR

TPR

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Clustering Coefficient

FPR

TPR

False Positive Rate

False Positive Rate False Positive Rate

False Positive Rate

True

Pos

itive

Rat

e Tr

ue P

ositi

ve R

ate

True

Pos

itive

Rat

e Tr

ue P

ositi

ve R

ate

The fewer Focused Experts in the community, the more posts will received a reply! There is no “one size fits all” model!

Community Types

Community types §  Do communities of different types behave differently?

§  Analysed IBM Connections communities to study participation, activity, and behaviour of users

§  Help us to know what is normal and healthy in a community, and what is not!

§  Compare exhibited community with what users say they use the community for § Does macro behaviour match micro needs?

Community types Community

Wiki Page Blog Post Forum Thread

Wiki Edit Blog Comment Forum Reply

Bookmark Tag

File §  Data consists of non-

private info on IBM Connections Intranet deployment

§  Communities: §  ID §  Creation date §  Members §  Used applications

(blogs, Wikis, forums)

§  Forums: §  Discussion threads §  Comments §  Dates §  Authors and

responders

Community types

§  Muller, M. (CHI 2012) identified five distinct community types in IBM Connections: §  Communities of Practice (CoP): for sharing information and

network §  Teams: shared goal for a particular project or client §  Technical Support: support for a specific technology §  Idea Labs Communities: for focused brainstorming §  Recreation Communities: recreational activities unrelated to work.

§  Our data consisted of 186 most active

communities: §  100 CoPs, 72 Teams, and 14 Techs communities § No Ideas of Recreation communities

Behaviour in different community types

•  Members of Team communities are more engaged, popular, and initiate more discussions

•  Tech users are mostly active in a few communities, and don’t initiate of contribute much

•  CoP users disperse their activity across many communities, and contribute more

Macro

Mean and Standard Deviation (in brackets) of the distribution of micro features within the different community types

Need an ontology and inference

engine of community types

Matthew Rowe, Miriam Fernandez, Harith Alani, Inbal Ronen, Conor Hayes and Marcel Karnstedt: Behaviour Analysis across different types of Enterprise Online Communities. ACM WebSci 2012

User needs and value

41%

47%

8% 3% 1%

[Quality of content] .

18%

46% 26%

8% 2%

[Number of members] .

31%

53%

13%

2% 1%

[Diversity of expertise] .

2% 15%

30% 30

%

23%

[Level of entertainment] .

44% 50%

4% 2%

[Provides accurate answers to questions].

38%

55%

5% 2%

[Contributes good quality and well presented content].

21%

60%

14% 5%

[Provides quick answers to questions].

38%

49%

8% 5%

[Has good expertise in a domain].

11%

58%

25%

6%

[Contributes content frequently]

1% 17%

34% 30%

18%

[Has many contacts (e.g. Facebook friends)].

2% 14%

32% 31%

21%

[Has many fans (e.g. Twitter followers, positive

replies to posts)].

Community Value

Community Member Value

Value of community features Measurements of value and needs satisfaction •  Assessing user engagement and needs

satisfaction

•  Measuring value of individual users to their communities

•  Measuring value of communities to their members

Monitoring Online Communities

Maslow’s Hierarchy of Needs

Mapping Maslow’s hierarchy of needs to social media communities

Self_actualisation: Altruistic behavior:

helping others, replying to queries, giving rates

Self-Esteem: Need to be rated and ranked higher in the

community, promotion of roles from novice to active member to

expert and moderator

Social Belongingness: Need to be part of the community, groups, need for interaction and

engagement

Security: Need for privacy, security from identity theft, security from online abuse, trolling and bullying

Physical: Need for Hardware, Software, Information, Internet access.

User groups based on ‘needs’ High Helping Need •  Reply a lot •  Last 17% longer in system •  Contribute to many forums •  High and consistent

engagement •  (Self-actualisation)

High Information Need •  Contribute 70% less •  Don’t care about ‘points’

and ‘reputation’ •  Don’t stay for long •  Engage with very few users •  (Basic needs)

High Social Need •  High level of social

interaction •  Moderate reputation scores •  High contribution level •  Low information needs •  (Social belongingness)

Recognition Need •  High ‘reputation’ •  Moderate contribution level •  High engagement •  (Self-esteem)

~90% of users at happily staying at the lower levels of the ‘need’s hierarchy’

experts to-be

about to churn

on right path to leadership

Behaviour evolution patterns

§  Can we predict future behaviour role? §  Who’s on the path to become a

leader? an expert? a churner? §  Which users we want to encourage

staying/leaving?

into becoming an expert - however this development only occurs 4 times

13

10

P28

13

8

P76

1

3

8

10

P103

12

3

P133

1

3

8

10

P155

1

3

6

10

P159

15

7

P190

17

10

P191

1

2

3

10

P193

1

38

10

11

P198

14

10

P201

1

3

10

11

P208

1

3

8

11

P223

1

3

6

10

P283

1

7

8

11

P284

13

6

P302

1

36

8

10

P305

13

10

P343

1

3

8

11

P363

1

38

10

11

P374

13

9

P413

17

8

P415

1

3

8

10

P417

1

2

3

11

P426

1

3

6

10

P427

1

5

7

10

P429

1

5

7

9

P430

1

2

3

8

P434

1

4

9

11

P458

3

8

10

11

P464

14

8

P480

1

35

10

11

P486

12

3

P507

1

2

3

6

P534

1

38

9

11

P537

1

23

6

10

P570

1

4

5

11

P571

7

8

10

11

P586

1

4

9

10

P602

1

3

6

11

P636

1

57

10

11

P654

1

45

9

11

P661

1

78

10

11

P667

1

36

8

10

P685

1

57

8

10

P720

1

2

3

6

P738

1

3

68

9

10

11

P750

1

57

8

10

P772

1

2

3

8

P785

1

3

5

8

9

11

P807

Fig. 6. Progression Patterns where users progress from a novice to an expert role overtime

Engagement Analysis

Tweet recipe for generating engagement §  Identifying seed posts

Top features: Time in Day, Readability, Out-Degree, Polarity, Informativeness

Top features: Referral Count, Topic Likelihood, Informativeness, Readability, User Age

For both datasets: •  Content features play a greater

role than user features •  The combination of all features

provides the best results

•  Predicting discussion activity Top features: Referral Count(-), Complexity(-)

Top features: URLs(-), Polarity(-), Topic Likelihood(+), Complexity (+)

For both, a decrease in URLs is associated with max activity. Language and terminology are more significant for Boards.ie.

Engagement in different communities §  How the results differ:

§  from one community type to another §  from random datasets to topic-

based ones §  from related experiments in the

literature

§  Experimented with 7 datasets, from: §  Boards.ie §  Twitter §  SAP §  Server Fault §  Facebook

Impact of features on engagement Boards.ie

β

−2−1

012

Twitter Random

β

−0.50.00.51.0

Twitter Haiti

−6e+16−4e+16−2e+16

0e+002e+164e+166e+16

Twitter Union

β

−0.8−0.6−0.4−0.2

0.00.2

Server Fault

β

−1.0−0.5

0.00.51.01.52.0

SAP

β

−10

−5

0

5

Facebook

β

−0.10.00.10.20.30.40.5

In−degreeOut−degreePost CountAgePost RatePost LengthReferrals Count

PolarityComplexityReadabilityReadability FogInformativenessEF−IPFCF−IPF

Entity EntropyConcept EntropyEntity Degree CentralityConcept Degree CentralityEntity Network EntropyConcept Network Entropy

Effects of individual social, content, and semantic features on the response variable (i.e. whether the post seeds engagement or not).

Semantic Sentiment Analysis

Semantic sentiment analysis on social media

§  Offers a fast and cheap access to publics’ feelings towards brands, business, people, etc.

§  Range of features and statistical classifiers have been used for in recent years

§  Semantics are often neglected

§ We add semantics as additional features into the training set for sentiment analysis

§ Measure the correlation of the representative concept with negative/positive sentiment

Sentiment Analysis

hate negative honest positive inefficient negative Love positive …

Sentiment Lexicon

I hate the iPhone

I really love the iPhone

Lexical-Based Approach

Learn Model

Apply Model

Naïve Bayes, SVM, MaxEnt , etc.

Training Set

Test Set

Model

Machine Learning Approach

Semantic Concept Extraction §  Extract semantic concepts from tweets data and incorporate them

into the supervised classifier training.

Fig. 1. Measuring correlation of semantic concepts with negative/positive sentiment. These se-mantic concepts are then incorporated in sentiment classification.

OpenCalais and Zemanta. Their experimental results showed that AlchemyAPI per-forms best for entity extraction and semantic concept mapping. Our datasets consist ofinformal tweets, and hence are intrinsically different from those used in [10]. There-fore we conducted our own evaluation, and randomly selected 500 tweets from the STScorpus and asked 3 evaluators to evaluate the semantic concept extraction outputs gen-erated from AlchemyAPI, OpenCalais and Zemanta.

No. of Concepts Entity-Concept Mapping Accuracy (%)Extraction Tool Extracted Evaluator 1 Evaluator 2 Evaluator 3AlchemyAPI 108 73.97 73.8 72.8Zemanta 70 71 71.8 70.4OpenCalais 65 68 69.1 68.7Table 2. Evaluation results of AlchemyAPI, Zemanta and OpenCalais.

The assessment of the outputs was based on (1) the correctness of the extractedentities; and (2) the correctness of the entity-concept mappings. The evaluation resultspresented in Table 2 show that AlchemyAPI extracted the most number of conceptsand it also has the highest entity-concept mapping accuracy compared to OpenCalaisand Zematna. As such, we chose AlchemyAPI to extract the semantic concepts fromour three datasets. Table 3 lists the total number of entities extracted and the number ofsemantic concepts mapped against them for each dataset.

STS HCR OMDNo. of Entities 15139 723 1194No. of Concepts 29 17 14

Table 3. Entity/concept extraction statistics of STS, OMD and HCR using AlchemyAPI.

Likely sentiment for a concept

§  Semantic concepts can help determining sentiment even when no good lexical clues are present

Impact of adding semantic features

§  Incorporating semantics increases accuracy by 6.5% for negative sentiment, and 4.8% for positive sentiment §  F = 75.95%, with 77.18% Precision and 75.33% Recall § Using baselines of unigrams and part-of-speech features

§  More to-dos: §  Semantic Concepts Extraction: Explore more fine-grained approach

for the entity extraction and the entity-concept mapping

§  Selective Method: Interpolate semantic concepts based on their contribution to the classification performance

Saif, Hassan; He, Yulan and Alani, Harith (2012). Semantic sentiment analysis of twitter. In: The 11th International Semantic Web Conference (ISWC 2012), 11-15 November 2012, Boston, MA, USA

OK, and now what?!

OUSocials

§  Many FB groups exist for students of OU courses

§  Created and used by students to discuss and share opinions on courses and get support

Behaviour Analysis

Sen#ment Analysis

Topic Analysis

Course tutors

Real #me monitoring

•  How are opinion and sen#ment towards a course evolving?

•  Who’s providing posi#ve/nega#ve support?

•  What topics are emerging? How they change over#me?

•  Do students get the answers and support they need?

Analytics over FB groups

§  Compare findings to course performance, and student performance

Reel Lives

Problem Summary

•  Fragmented digital selves don’t support social learning and individual empowerment

•  Need to enable: –  Digital empowerment –  Improved understanding and social cohesion –  Informed decision making (for individuals) –  Informed policy making (for organisations) –  Facilitating creative participation –  Co-curating of digital personhoods

Creating the ‘reels’

Changing energy consumption behaviour

A Decarbonisation Platform for Citizen Empowerment and Translating Collective

Awareness into Behavioural Change

August 2012

Energy Monitors

www.efergy.com greenenergyoptions.co.uk

fastcompany.com tdevice.net

powerp.co.uk

www.energycircle.com

indiegogo.com

greentechadvocates.com

•  Do they change how we consume energy in our homes?

•  Are they enough? •  Why? How? What if? Where?

Social Eco Feedback Technology

Thanks to ..

Matthew Rowe (now at Uni Lancaster) Sofia Angeletou

(now at BBC)

Gregoire Burel Miriam Fernandez Smitashree Choudhury Hassan Saif

Papers http://oro.open.ac.uk/view/person/ha2294.html §  Rowe, Matthew; Fernandez, Miriam; Angeletou, Sofia and Alani, Harith (2012). Community analysis through semantic rules and role composition

derivation. Journal of Web Semantics, 18(1)

§  Rowe, Matthew; Fernandez, Miriam; Alani, Harith; Ronen, Inbal ; Hayes, Conor and Karnstedt, Marcel (2012). Behaviour analysis across different types of Enterprise Online Communities. In: ACM web Science Conference 2012 (WebSci12), 22-24 June 2012, Evanston, U.S.A.

§  Rowe, Matthew; Stankovic, Milan and Alani, Harith (2012). Who will follow whom? Exploiting semantics for link prediction in attention-information networks. In: 11th International Semantic Web Conference (ISWC 2012), 11-15 November 2012, Boston, USA

§  Rowe, Matthew and Alani, Harith (2012). What makes communities tick? Community health analysis using role compositions. In: 4th IEEE International Conference on Social Computing, 3-6 September 2012, Amsterdam, The Netherlands

§  Wagner, Claudia ; Rowe, Matthew; Strohmaier, Markus and Alani, Harith (2012). Ignorance isn't bliss: an empirical analysis of attention patterns in online communities. In: 4th IEEE International Conference on Social Computing, 3-6 September 2012, Amsterdam, The Netherlands

§  Saif, Hassan; He, Yulan and Alani, Harith (2012). Semantic sentiment analysis of twitter. In: The 11th International Semantic Web Conference (ISWC 2012), 11-15 November 2012, Boston, MA, USA.

§  Rowe, Matthew; Angeletou, Sofia and Alani, Harith (2011). Predicting discussions on the social semantic web. In: 8th Extended Semantic Web Conference (ESWC 2011), 29 May - 2 June 2011, Heraklion, Greece.

§  Rowe, Matthew; Angeletou, Sofia and Alani, Harith (2011). Anticipating discussion activity on community forums. In: Third IEEE International Conference on Social Computing (SocialCom2011) , 9-11 October 2011, Boston, MA, USA.

§  Angeletou, Sofia; Rowe, Matthew and Alani, Harith (2011). Modelling and analysis of user behaviour in online communities. In: 10th International Semantic Web Conference (ISWC 2011), 23 - 27 Oct 2010, Bonn, Germany.

§  Karnstedt, Marcel ; Rowe, Matthew; Chan, Jeff ; Alani, Harith and Hayes, Conor (2011). The Effect of User Features on Churn in Social Networks. In: ACM Web Science Conference 2011 (WebSci2011), 14 - 17 June 2011, Koblenz, Germany.