42
Trust, Influence and Bias in Social Media Anupam Joshi Joint work with Tim Finin and several students Ebiquity Group, UMBC [email protected] http://ebiquity.umbc.edu/

Trust, Influence and Bias in Social Media

  • Upload
    birch

  • View
    45

  • Download
    1

Embed Size (px)

DESCRIPTION

Trust, Influence and Bias in Social Media. Anupam Joshi Joint work with Tim Finin and several students Ebiquity Group, UMBC [email protected] http://ebiquity.umbc.edu/. Knowing & Influencing your Audience. Your goal is to campaign for a presidential candidate - PowerPoint PPT Presentation

Citation preview

Page 1: Trust, Influence and Bias in Social Media

Trust, Influence andBias in Social Media

Anupam JoshiJoint work with Tim Finin and several

studentsEbiquity Group, UMBC

[email protected]://ebiquity.umbc.edu/

Page 2: Trust, Influence and Bias in Social Media

Knowing & Influencing your Audience

•Your goal is to campaign for a presidentialcandidate

•How can you track the buzz about him/her?•What are the relevant communities and

bogs?•Which communities are supporters, which are

skeptical, which are put off by the hype?•Is your campaign having an effect? The desired

effect?•Which bloggers are influential with political

audience? Of these, which are already onboard and which are lost causes?

•To whom should you send details or talk to?

Page 3: Trust, Influence and Bias in Social Media

Knowing & Influencing your Market

•Your goal is to market Zune•How can you track the buzz about

it?•What are the relevant communities

and blogs?•Which communities are fans, which

are suspicious, which are put offby the hype?

•Is your advertising having an effect?The desired effect?

•Which bloggers are influential in this market? Of these, which are already onboard and which are lost causes?

•To whom should you send details or evaluation samples?

Page 4: Trust, Influence and Bias in Social Media

What is Influence?“the act or power of producing an effect without apparent exertion of force or direct exercise of command’’

Measurable InfluenceThe ability of a blogger to persuade another blogger to• Take action by means of creating a new post about the

topic and commenting on the original (text and graph mining) .

• Quote the blogger’s views in her post (text mining) .• Link to the original post via trackbacks, comments (graph

mining) .• Link to the blogger through other means like del.icio.us,

digg, citeULike, Connotea, etc. (graph mining) • Subscribe to the blog feed (graph mining) .

Page 5: Trust, Influence and Bias in Social Media

A community in real world is represented in a graph as a set of nodes that have more links within the set than outside it.

Graph• Citation Network• Affiliation Network• Sentiment Information• Shared Resource (tags, videos..)

Political Blogs

Twitter Network

Facebook Network

What is a Community

Page 6: Trust, Influence and Bias in Social Media

Finding Communities (and Feeds) That Matter

Before Merge

After Merge

http://ftm.umbc.edu

Analysis of Bloglines Feeds 83K publicly listed subscribers 2.8M feeds, 500K are unique 26K users (35%) use folders to organize subscriptions Data collected in May 2006

Top Advertising Feeds1. Adrants » Marketing and Advertising News With Attitude 2. Adverblog: advertising and new media marketing 3. http://ad-rag.com 4. adfreak 5. AdJab 6. MIT Advertising Lab: future of advertising and advertising technology7. AdPulp: Daily Juice from the Ad Biz 8. Advertising/Design Goodness

Related Tags: advertising  marketing  media  news  design 

Page 7: Trust, Influence and Bias in Social Media

Feeds That MatterTop Feeds for “Politics”Merged folders: “political”, “political

blogs”• Talking Points Memo: by Joshua Micah M

arshall• Daily Kos: State of the Nation• Eschaton• The Washington Monthly• Wonkette, Politics for People with Dirty M

inds• http://instapundit.com/• Informed Comment • Power Line• AMERICAblog: Because a great nation de

serves the truth• Crooks and Liars

Top Feeds for “Knitting”Merged folders “knitting blogs”• Yarn Harlotknitting

• Wendy Knits!

• See Eunny Knit!

• the blue blog

• Grumperina goes to local yarn shops and Home Depot

• You Knit What??

• Mason-Dixon Knitting

• knit and tonic

• Crazy Aunt Purl

• http://www.lollygirl.com/blog/

Page 8: Trust, Influence and Bias in Social Media

Long Tail• 80/20 Rule or Pareto distribution• Few blogs get most attention/links• Most are sparsely connected

Motivation• Web graphs are large, but sparse • Expensive to compute community

structure over the entire graph

Goal• Approximate the membership of the

nodes using only a small portion of the entire graph.

Special Properties of Social Datasets

Page 9: Trust, Influence and Bias in Social Media

Special Properties of Social Datasets

Intuition • Communities defined by the core (A)• Membership of rest (B) approxi-

mated by how they link to the core

Direct Method • NCut (Baseline)

Approximation• Singular value decomposition (SVD)• sampling• Heuristic

Page 10: Trust, Influence and Bias in Social Media

• SVD (low rank) • Sampling based Approach

• Communities can be extracted by sampling only columns from the head (Drineas et al.)

• Heuristic Cluster head to find initial communities. Assign cluster that the tail nodes most frequently link to.

Approximating CommunitiesNodes ordered by degree

~Ur

rrV

r

T

AB T

BC

A BB T B TA 1B

U

B TU 1

U TA 1U TB

AUUT

r

ICWSM ‘08

Page 11: Trust, Influence and Bias in Social Media

Approximating CommunitiesDataset: A blog dataset of 6000 blogs.

ICWSM ‘08

Original Adjacency Heuristic Approximation

Modularity = 0.51

Page 12: Trust, Influence and Bias in Social Media

Approximating Communities

Low ModularityMore Time

Similar ModularityLower Time

• Advantages: faster detection using small portion of graph, less memory

• Complexity: SVD O(n3), Ncut O(nk), Sampling O(r3), Heuristic O(rk) where n = # blogs, k = # clusters, r = # columns

ICWSM ‘08

Page 13: Trust, Influence and Bias in Social Media

Approximating CommunitiesICWSM ‘08

Additional evaluations using Variation of Information score

Page 14: Trust, Influence and Bias in Social Media

Tags are free meta-data!

Other semantic features:• Sentiments• Named Entities• Readership information• Geolocation information• etc.

How to combine this for detecting communities?

Page 15: Trust, Influence and Bias in Social Media

Social Media GraphsLinks Between Nodes Links Between Nodes and Tags

Simultaneous Cuts

Page 16: Trust, Influence and Bias in Social Media

A community in the real world is identified in a graph as a set of nodes that have more links within the set than outside it and share similar tags.

Communities in Social Media

Page 17: Trust, Influence and Bias in Social Media

1 1 1 0 01 1 1 0 0

1 0 1 1 01 0 0 1 1

1 0 0 1 11 1 0 0 0 1 1 1 01 1 1 0 0 1 1 0 00 0 1 1 1 0 0 1 10 0 0 1 1 0 0 1 1

Nodes

Nodes

Nod

esTa

gs

Tags

Nod

es

Tags

Tags

11 1 1 111 1 1

Fiedler Vector Polarity

W ' I CC T W

β= 0 Entirely ignore link information

β= 1 Equal importance to blog-blog and blog-tag,

β>> 1 NCut

WebKDD ‘08SimCUT: Clustering Tags and Graphs

Page 18: Trust, Influence and Bias in Social Media

SimCUT: Clustering Tags and Graphs

β= 0 Entirely ignore link information

β= 1 Equal importance to blog-blog and blog-tag,

β>> 1 NCut

Clustering Only Links

Clustering Links + Tags

W ' I CC T W

WebKDD ‘08

Page 19: Trust, Influence and Bias in Social Media

Datasets

• Citeseer (Getoor et al.)• Agents, AI, DB, HCI, IR, ML• Words used in place of tags

• Blog data • derived from the WWE/Buzzmetrics dataset• Tags associated with Blogs derived from del.icio.us• For dimensionality reduction 100 topics derived from blog homepages using LDA (Latent Dirichilet Allocation)

• Pairwise similarity computed • RBF Kernel for Citeseer• Cosine for blogs

Page 20: Trust, Influence and Bias in Social Media

Clustering Tags and GraphsClustering Only Links

Clustering Links + Tags

Page 21: Trust, Influence and Bias in Social Media

Varying Scaling Parameter β

Accuracy = 36% Accuracy = 62%

Higher accuracy by adding ‘tag’ informationSimple Kmeans ~23% Content only, binaryContent only ~52% (Getoor et al. 2004)

β >> 1 β=1β=0

Accuracy = 39%

Only Graph Only Tags Graphs & Tags

Page 22: Trust, Influence and Bias in Social Media

Mutual Information • Measures the dependence between two random variables.• Compares results with ground truth

Effect of Number of Tags, Clusters

Citeseer

Link only has lower MIMore

Semantics helps

Similar results for real, blog datasets

Page 23: Trust, Influence and Bias in Social Media

http://instapundit.com

http://michellemalkin.com/

http://dailykos.com

http://crooksandliars.com

http://volokh.com

http://rightwingnews.com

Influence in Communities

Communities detected using “Fast algorithm for detecting community structure in networks”, M.E. J. Newman

Page 24: Trust, Influence and Bias in Social Media

Authority and PopularityAuthority

• contributes to influence• Influence may be

subjective.• A source, authoritative

in one community could influence another community negatively.

Within a community, an authoritative source is influential.

Popularity• Authority and

popularity often treated equally

• On blog search engines, authority is measured using inlinks, which is at best popularity

• Popularity doesn’t mean influenceDilbert is extremely popular but not influential

Page 25: Trust, Influence and Bias in Social Media

Link Polarity& Sentiment

Page 26: Trust, Influence and Bias in Social Media

Link Polarity and Bias•Linking alone is not indicator of influence•Polarity (+/- sentiment) indicates type of

influence•Consistent negative/positive opinion indicates

bias•Link polarity/citation signal helps determine

trust

Democrat Blog

Republican Blog

Strong Negative

Opinion

Mildly Negative

opinion

Strongly Positive

opinion

Page 27: Trust, Influence and Bias in Social Media

Propagating InfluenceBased on work of Guha et al[1] for modeling propagation of trust and distrust. Framework:•Mij represents influence/bias from user i to j.(0 <= Mij <= 1)•Mij is initialized to the polarity from i to j.•Belief Matrix M (sparse) represents initial set of known

beliefs•Goal is to compute all unknown values in M

•Belief Matrix after ith atomic propagation•Mi+1 = Mi * Ci

•Combined Operator •Ci = a1 * M + a2 * MT*M + a3 * MT + a4 * M*MT

•a {0.4, 0.4, 0.1, 0.1} represents weighing factor[1] Guha R, Kumar R, Raghavan P, Tomkins A. Propagation of trust and distrust. In: Proceedings of the Thirteenth

International World Wide Web Conference, New York, NY, USA, May 2004. ACM Press, 2004.

Page 28: Trust, Influence and Bias in Social Media

Recognizing subjectivity & sentiment

•We’ve developed ΔTFIDF as a simple feature-engineering technique to increase the accuracy of subjectivity detection and sentiment analysis

•Our preliminary analysis shows that ΔTFIDF• Works well in different subject domains• Improves accuracy for documents of varying

sizes: sentence fragments, sentences, paragraphs and multi-paragraph documents

• Helps on text classification tasks other than sentiment analysis

Page 29: Trust, Influence and Bias in Social Media

Feature Engineering for Text Classification

•Typical features: words and/or phrases along with term frequency or (better) TF-IDF scores

•ΔTFIDF amplifies the training set signals by using the ratio of the IDF for the negative and positive collections

•Results in a significant boost in accuracy

Text: The quick brown fox jumped over the lazy white dog.Features: the 2, quick 1, brown 1, fox 1, jumped 1, over 1, lazy 1, white 1, dog 1, the quick 1, quick brown 1, brown fox 1, fox jumped 1, jumped over 1, over the 1, lazy white 1, white dog 1

Page 30: Trust, Influence and Bias in Social Media

ΔTFIDF BoW Feature Set• Value of feature t in document d is • Where

• Ct,d = count of term t in document d• Nt = number of negative labeled training docs with term t• Pt = number of positive labeled training docs with term t

• Normalize to avoid bias towards longer documents• Gives greater weight to rare (significant) words• Downplays very common words• Similar to Unigram + Bigram BoW in other aspects

Ct,d ∗log2N tPt

⎛ ⎝ ⎜

⎞ ⎠ ⎟

Page 31: Trust, Influence and Bias in Social Media

Example: ΔTFIDF vs TFIDF vs TFΔtfidf tfidf tf, city angels ,cage is angels is themediocrity , city .criticized of angels toexhilarating maggie , ofwell worth city of aout well maggie andshould know angel who isreally enjoyed movie goers thatmaggie , cage is itit's nice seth , whois beautifully goers inwonderfully angels , moreof angels us with youUnderneath the city but

15 features with highest values for a review of City of Angels

Page 32: Trust, Influence and Bias in Social Media

Improvement over TFIDF (Uni- + Bi-grams)

•Movie Reviews: 88.1% Accuracy vs. 84.65% at 95% Confidence Interval

•Subjectivity Detection (Opinionated or not): 91.26% vs. 89.4% at 99.9% Confidence Interval

•Congressional Support for Bill (Voted for/ Against): 72.47% vs. 66.84% at 99.9% Confidence Interval

•Enron Email Spam Detection: (Spam or not): 98.917% vs. 96.6168 at 99.995% Confidence Interval

•All tests used 10 fold cross validation•At least as good as mincuts + subjectivity detectors on movie reviews (87.2%)

Page 33: Trust, Influence and Bias in Social Media

Link Polarity ExperimentsDomain•Political Blogosphere•Dataset from Buzzmetrics[2] provides post-post link structure over

14 million posts•Few off-the-topic posts help aggregation•Potential business valueReference Dataset •Hand-labeled dataset from Lada Adamic et al[3] classifying political

blogs into right and left leaning bloggers•Timeframe : 2004 presidential elections, over 1500 blogs analyzed•Overlap of 300 blogs between Buzzmetrics and reference dataset Goal•Classify the blogs in Buzzmetrics dataset as democrat and

republican and compare with reference dataset [2] Lada A. Adamic and Natalie Glance, "The political blogosphere and the 2004 US Election", in Proceedings of the WWW-2005

WorkshopBuzzmetrics – www.buzzmetrics.com

Page 34: Trust, Influence and Bias in Social Media

Evaluation of Link Polarity

Confusion Matrix

•Accuracy = 73%•True positive (Recall) =

78%•False positive (FP) =

31%•True negative (Recall)

= 69%•False negative (FN) =

21%•Precision (R) = 75%•Precision (D) = 72%

Polarity Improves Classification by almost 26%

Page 35: Trust, Influence and Bias in Social Media

Trust Propagation Sample Data•Compensates for initial incorrect polarity (DK–AT)

•Doesn’t change correct polarity (AT-DK)

•Assigns correct polarity for non-existent direct links (AT-IP)

•Numbers in italics are problematic (MM-AT)Improve sentiment detection ?

Page 36: Trust, Influence and Bias in Social Media

MSM Classification Results

Page 37: Trust, Influence and Bias in Social Media

Interesting Observations•24 of 27 sources correct-ly classifiedguardian, foxnews, human-eventsonline, mediamatters

•Outliers: “The Nation” & “Boston Globe”

•Left and right leaning blogs talk negatively about “ny times” & “abc news” and positively about “raw story” and “examiner”

Page 38: Trust, Influence and Bias in Social Media

Identifying Bias using KL Divergence

Page 39: Trust, Influence and Bias in Social Media

Conclusion

Page 40: Trust, Influence and Bias in Social Media

Conclusion• Using topic, social structure and opinions we

can develop a model for influence, bias and trust in social media

• We apply this framework on real-world data and describe techniques for identifying influence

• Splogs are a big issue – we have developed efficient techniques to detect them in near real time

• Does the Game Theoretic Nature of this system raise fundamental new challenges for Data Mining

Page 41: Trust, Influence and Bias in Social Media

Assets: Good, Bad and Wanted•How the assets (data, APIs) were helpful?•Where these assets failed to be helpful and why?Since we go “beyond search”, search data not that useful

•Which research questions you would like to address if you had unlimited access to assets?•Unlimited livespaces link and content data to validate some of our approaches.•Use to place ads on social media sites

Page 42: Trust, Influence and Bias in Social Media

http://ebiquity.umbc.edu