75
12/2/2007 1 Data Mining for Social Network Analysis Jaideep Srivastava University of Minnesota [email protected] Joint work with: Arindam Banerjee, Nishith Pathak, Sandeep Mane, Muhammad A. Ahmad David Kuo-Wei Hsu, Young Ae Kim, University of Minnesota Noshir S. Contractor, Northwestern University Dmitri Williams, University of Southern California Sony Online Entertainment Special thanks to Enron (via US DoJ) Sponsors: US National Science Foundation US Army Research Institute Digital Technology Center, University of Minnesota Australasian Data Mining Conference (AusDM) 2007 December 3 rd –4 th , 2007 Gold Coast, Australia

Data Mining for Social Network Analysis - University of Haifagsb.haifa.ac.il/~draban/SocialDynamics/presentation.pdf12/2/2007 1 Data Mining for Social Network Analysis Jaideep Srivastava

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • 12/2/2007 1

    Data Mining for Social Network Analysis

    Jaideep Srivastava

    University of Minnesota

    [email protected] work with:Arindam Banerjee, Nishith Pathak, Sandeep Mane, Muhammad A. Ahmad David Kuo-Wei Hsu, Young Ae Kim, University of MinnesotaNoshir S. Contractor, Northwestern UniversityDmitri Williams, University of Southern California

    Sony Online EntertainmentSpecial thanks to Enron (via US DoJ)

    Sponsors:US National Science FoundationUS Army Research InstituteDigital Technology Center, University of Minnesota

    Australasian Data Mining Conference (AusDM) 2007

    December 3rd – 4th, 2007Gold Coast, Australia

  • 12/2/2007 Jaideep Srivastava 2

    � Introduction to Social Network Analysis (SNA)� Computer Science and SNA

    � A Detailed Case Study� Socio-cognitive analysis from e-mail logs

    � Modeling socio-cognitive networks

    � Analysis of a socio-cognitive network

    � Experiments with the Enron dataset

    � Extracting concealed relationships� An IR-inspired approach

    � Other applications� SNA from MMORPG logs� Trust in social networks� Expert finding in social networks� Social networks in health care management

    � Some Emerging Applications

    � References

    Outline

  • Introduction to Social Network AnalysisIntroduction to Social Network Analysis

  • 12/2/2007 Jaideep Srivastava 4

    Social Networks

    � A social network is a social structure of people, related (directly or indirectly) to each other through a common

    relation or interest

    � Social network analysis (SNA) is the study of social networks to understand their structure and behavior

    (Source: Freeman, 2000)

  • 12/2/2007 Jaideep Srivastava 5

    SNA in Popular Science Press

    Social Networks have captured the public imagination in recent years as evident in the number of popular science treatment of

    the subject

  • 12/2/2007 Jaideep Srivastava 6

    � Types of Networks (Contractor, 2006)�Social Networks

    � “who knows who”

    �Socio-Cognitive Networks� “who thinks who knows who”

    �Knowledge Networks� “who knows what”

    �Cognitive Knowledge Networks� “who thinks who knows what”

    Networks in Social Sciences

  • 12/2/2007 Jaideep Srivastava 7

    Types of Social Network Analysis

    � Sociocentric (whole) network analysis� Emerged in sociology

    � Involves quantification of interaction among a socially well-defined group of people

    � Focus on identifying global structural patterns

    � Most SNA research in organizations concentrates on sociometricapproach

    � Egocentric (personal) network analysis� Emerged in anthropology and psychology

    � Involves quantification of interactions between an individual (called ego) and all other persons (called alters) related (directly or indirectly) to ego

    � Make generalizations of features found in personal networks

    � Difficult to collect data, so till now studies have been rare

  • 12/2/2007 Jaideep Srivastava 8

    Networks Research in Social Sciences

    � Social science networks have widespread application in various fields

    � Most of the analyses techniques have come from Sociology, Statistics and Mathematics

    � See (Wasserman and Faust, 1994) for a comprehensive introductionto social network analysis

    Epidemiology

    Sociology

    OrganizationalTheory

    SocialPsychologyAnthropology

    Socio-Cognitive

    Networks

    Cognitive

    Knowledge

    Networks

    SocialNetworks

    KnowledgeNetworks

    Perception

    Reality

    Acquaintance(links)

    Knowledge(content)Epidemiology

    Sociology

    OrganizationalTheory

    SocialPsychologyAnthropology

    Socio-Cognitive

    Networks

    Cognitive

    Knowledge

    Networks

    SocialNetworks

    KnowledgeNetworks

    Perception

    Reality

    Acquaintance(links)

    Knowledge(content)

    Socio-Cognitive

    Networks

    Cognitive

    Knowledge

    Networks

    SocialNetworks

    KnowledgeNetworks

    Socio-Cognitive

    Networks

    Cognitive

    Knowledge

    Networks

    SocialNetworks

    KnowledgeNetworks

    Perception

    Reality

    Acquaintance(links)

    Knowledge(content)

  • Computer Science and Social Network Analysis

  • 12/2/2007 Jaideep Srivastava 10

    Computer networks as social networks

    � “Computer networks are inherently social networks, linking people, organizations, and knowledge” (Wellman,

    2001)

    � Data sources include newsgroups like USENET; instant messenger logs like AIM; e-mail messages; social

    networks like Orkut and Yahoo groups; weblogs like Blogger; and online gaming communities

    USENET

  • 12/2/2007 Jaideep Srivastava 11

    Key Drivers for CS Research in SNA

    � Computer Science has created the über-cyber-infrastructure for

    �Social Interaction

    �Knowledge Exchange

    �Knowledge Discovery

    � Ability to capture � different about various types of social interactions

    � at a very fine granularity

    � with practically no reporting bias

    � Data mining techniques can be used for building descriptive and predictive models of social interactions

    ���� Fertile research area for data mining research

  • 12/2/2007 Jaideep Srivastava 12

    A shift in approach:from ‘synthesis’ to ‘analysis’

    Problems• High cost of manual surveys

    • Survey bias- Perceptions of individuals may be incorrect

    • Logistics- Organizations are now spread across several countries.

    Sdfdsfsdf

    FvsdfsdfsdfdfsdSdfdsfsd

    fSdfsdfs

    `

    Sdfdsfsdf

    FvsdfsdfsdfdfsdSdfdsfsd

    fSdfsdfs

    `

    SdfdsfsdfFvsdfsdfsdfdfsd

    SdfdsfsdfSdfsdfs

    `

    Electroniccommunication

    - Email- Web logs

    Analysis

    Socialnetwork

    Cognitivenetwork

    SocialNetwork

    Employee Surveys

    Cognitive network for B

    Cognitive network for C

    Cognitive network for A

    A

    B

    C

    Synthesis

    Shift in approach

  • Data Mining for SNA Case Study

    Socio-Cognitive Analysis from E-mail Logs

  • 12/2/2007 14

    Modeling a Socio-Cognitive Network

  • 12/2/2007 Jaideep Srivastava 15

    Example of E-mail Communication

    � A sends an e-mail to B

    � With Cc to C

    � And Bcc to D

    � C forwards this e-mail to E

    � From analyzing the header, we can infer

    � A and D know that A, B, C and D know about this e-mail

    � B and C know that A, B and C know about this e-mail

    � C also knows that E knows about this e-mail

    � D also knows that B and C do not know that it knows about this e-mail; and that A knows this fact

    � E knows that A, B and C exchanged this e-mail; and that neither A nor B know that it knows about it

    � and so on and so forth …

  • 12/2/2007 Jaideep Srivastava 16

    Modeling Pair-wise Communication

    � Modeling pair-wise communication between actors

    �Consider the pair of actors (Ax,Ay)

    �Communication from Ax to Ay is modeled using the Bernoulli distribution L(x,y)=[p,1-p]

    �Where,

    �p = (# of emails from Ax with Ay as recipient)/(total # of emails exchanged in the network)

    � For N actors there are N(N-1) such pairs and therefore N(N-1) Bernoulli distributions

    � Every email is a Bernoulli trial where success for L(x,y) is realized if Ax is the sender and Ay is a recipient

  • 12/2/2007 Jaideep Srivastava 17

    Modeling an agent’s belief about global communication

    � Based on its observations, each actor entertains

    certain beliefs about the communication strength

    between all actors in the network

    � A belief about the communication expressed by

    L(x,y) is modeled as the Beta distribution, J(x,y),

    over the parameter of L(x,y)

    � Thus, belief is a probability distribution over all

    possible communication strengths for a given

    ordered pair of actors (Ax,Ay)

  • 12/2/2007 Jaideep Srivastava 18

    Model for Belief Update

    � Jk(x,y) is the Beta distribution maintained by actor Akregarding its belief about the communication from Ax to Ay

    � a and b, the two parameters of Jk(x,y), are associated with the number of emails observed by Ak which are � from Ax to Ay , i.e. number of successes, and

    � from Ax not to Ay, i.e. number of failures

    � Initialization� a and b start out with default initial values

    �Many different possibilities� For example, values can be chosen to be small so that they do not

    have much of an impact and can be “washed out” by future observations

    � Belief update� on observing a success or failure, Ak increments a or b

    respectively

  • 12/2/2007 Jaideep Srivastava 19

    Belief State of an Actor

    � Every actor maintains Beta distributions (or beliefs) for all ordered pairs of actors in the network

    � Actor Ak’s belief state is defined to be the set of all N(N-1)Beta distributions (one for every Bernoulli distribution)

    � We also introduce a “super-actor” in the network

    �The super-actor is an actor who observes all the communication in the network

    �Super-actor is used as the baseline for reality

    �E-mail server is the “super-actor”

  • 12/2/2007 20

    Quantitative Measures for Perceptual Closeness

  • 12/2/2007 Jaideep Srivastava 21

    Types of Perceptual Closeness

    � We analyze the following aspects

    �Closeness between an actor’s belief and reality, i.e.

    “true knowledge” of an actor

    �Closeness between the beliefs of two actors, i.e. the

    “agreement” between two actors

    � We define two metrics, r-closeness and a-

    closeness for measuring the closeness to reality

    and closeness in the belief states of two actors

    respectively

  • 12/2/2007 Jaideep Srivastava 22

    Measuring the Closeness Between Beliefs

    � For measuring the closeness between two belief states, the KL-divergence across the expected Bernoulli distributions

    for the two respective beliefs is computed.

    � The expected Bernoulli distribution for a belief is the expectation of the Beta distribution corresponding to that belief

    � If J(a,b)k,t is the Beta distribution, then the corresponding expected Bernoulli distribution (denoted by E[J(a,b)k,t]) is obtained by normalizing the parameters of Beta distribution J(a,b)k,t

  • 12/2/2007 Jaideep Srivastava 23

    Belief Divergence Measures

    � The divergence of one belief, expressed by the Beta distribution J(a,b)x,t, from another, expressed by J(a,b)y,tat a given time t, is defined as,

    where, and

    � The divergence of a belief state By,t from the belief state

    Bx,t for two actors Ay and Ax respectively, at a given time t, is defined as,

  • 12/2/2007 Jaideep Srivastava 24

    Belief Divergence Measures (contd.)

    � The a-closeness measure is defined as the level of agreement between two given actors Ax and Ay with belief states Bx,t and By,t respectively, at a given time t and is given by,

    � The r-closeness measure is defined as the closeness of the given actorAk’s belief state Bk,t to reality at a given time t and it is given by,

    Where BS,t is the belief state of the super-actor AS at time t

  • 12/2/2007 Jaideep Srivastava 25

    Interpretation of the Metrics

    � The r-closeness measure� An actor who has accurate beliefs regarding only few communications is

    closer to reality than some other actor who has a relatively large number of less accurate beliefs

    � Thus, accuracy of knowledge is important

    � The a-closeness measure between actor pairs� Consider three actors Ax, Ay and Az� Suppose we want to determine how divergent are Ay’s and Az’s belief states

    from that of Ax’s� If Ay and Ax have few beliefs in common, but low divergence for each of

    these few common beliefs, then their belief states may be closer than those of Az and Ax, who have a relatively larger number of common beliefs with greater divergence across them

    � a-closeness measure can be used to construct an “agreement graph” (or a who agrees with whom graph)� Actors are represented as nodes and an edge exists between two actors

    only if the agreement or the a-closeness between them exceeds some threshold t

  • 12/2/2007 26

    r-closeness and a-closeness experiments with Enron

    E-mail logs

  • 12/2/2007 Jaideep Srivastava 27

    Enron Email Logs

    �Publicly available: http://www.cs.cmu.edu/~enron/�Cleaned version of data

    � 151 users, mostly senior management of Enron� Approximately 200,399 email messages� Almost all users use folders to organize their emails� The upper bound for number of folders for a user was

    approximately the log of the number of messages for that user

    � A visualization of Enron data (Heer, 2005)

    �For experiments emails

    exchanged between users for the

    months of October 2000 and

    October 2001 were used

  • 12/2/2007 Jaideep Srivastava 28

    Testing ‘conventional wisdom’ using r-closeness

    � Conventional wisdom 1: As an actor moves higher up the organizational hierarchy, it has a better perception of the social network� It was observed that majority of the top positions were

    occupied by employees

    � Conventional wisdom 2: The more communication an actor observes, the better will be its perception of reality�Even though some actors observed a lot of communication,

    they were still ranked low in terms of r-closeness.

    �These actors focus on a certain subset of all communications and so their perceptions regarding the social network were skewed towards these “favored” communications

    �Executive management actors who were communicatively active exhibited this “skewed perception” behavior� which explains why they were not ranked higher in the r-

    closeness measure rankings as expected in 1

  • 12/2/2007 Jaideep Srivastava 29

    Experiment with r-closeness – Oct 2000� For October 2000, based on their r-closeness rankings actors can

    be roughly divided into three categories

    � Top ranks: Actors who are communicatively active and observe a lot of diverse communications

    � Mid ranks: Actors who also observe a lot of communication but had skewed perceptions

    � Low ranks: Actors who are communicatively inactive and hardly observe any of the communication

    Ranks

    Not

    Avail-

    able

    Emplo-

    yees

    Higher

    Manage-

    ment

    Executive

    Manage-

    ment

    Others

    1-10 2.6%

    (1)

    14.6%

    (6)0% (0)

    6.9%

    (2)

    6.67%

    (1)

    11–50 28.9%

    (11)

    34.1%

    (14)21.4% (6)

    24.1%

    (7)

    13.33%

    (2)

    51-151 68.5%

    (26)

    51.3%

    (21)

    78.6%

    (22)

    69%

    (20)

    80%

    (12)

    Total 100% (38)

    100%

    (41)100% (28)

    100%

    (29)

    100%

    (15)

  • 12/2/2007 Jaideep Srivastava 30

    Experiment with r-closeness – Oct 2001

    � r-closeness rankings for the crisis month Oct, 2001 show a

    significant increase (31% to 65.5%) in the percentage of senior executive management level actors in the top 50

    ranks, with employees moving down

    Ranks

    Not

    Avail-

    Able

    Emplo-

    yees

    Higher

    Manage-

    ment

    Executive

    Manage-

    ment

    Others

    1-10 7.9 % (3)

    9.75% (4)

    0% (0) 10.3%

    (3) 0% (0)

    11–50 21.1 %

    (8)

    17.1%

    (7) 25% (7)

    55.2%

    (16)

    13.33%

    (2)

    51-151 71%

    (27)

    73.15%

    (30) 75% (21)

    34.5%

    (10)

    86.67%

    (13)

    Total 100% (38)

    100%

    (41) 100% (28)

    100%

    (29)

    100%

    (15)

  • 12/2/2007 Jaideep Srivastava 31

    Agreement Graph for Oct 2000 (threshold = 0.95) Agreement Graph for Oct 2001 (threshold = 0.95)

    Mean a-closeness against time

    0.00

    0.01

    0.02

    0.03

    0.04

    0.05

    0.06

    0.07

    0.08

    0.09

    6_19

    998_

    1999

    10_1

    999

    12_1

    999

    2_20

    004_

    2000

    6_20

    008_

    2000

    10_2

    000

    12_2

    000

    2_20

    014_

    2001

    6_20

    018_

    2001

    10_2

    001

    12_2

    001

    2_20

    024_

    2002

    6_20

    02

    Time

    Mean a

    -clo

    seness a

    cro

    ss

    acto

    rs

    Mean a-closenessagainst time

    Experiment with a-closeness

  • 12/2/2007 Jaideep Srivastava 32

    a-closeness between Lavorato, Denailey, Lay and Skilling

    Lavorato

    Skilling Lay

    Denailey

    Lavorato

    Skilling Lay

    Denailey

    Lavorato

    Skilling Lay

    Denailey

    Initial State up to Oct,1999

    0.01

    0.01

    0.010.01

    0.011.00

    Lavorato

    Skilling Lay

    Denailey0.077

    0.0070.019

    0.0050.007

    0.016

    Lavorato

    Skilling Lay

    Denailey0.06

    0.0160.012

    0.030.011

    0.016

    0.047

    0.0170.019

    0.0270.012

    0.017

    0.049

    0.0150.019

    0.0270.012

    0.017

    Nov,1999-Dec,2000

    Dec, 2000

    Jan, 2001-

    Aug, 2001

    Aug, 2001

    Sept,2001-Oct,2001

    Oct, 2001

    Nov,2001-

    Dec,2001Dec, 2001

  • 12/2/2007 33

    Automatic Extraction of Concealed Relations

  • 12/2/2007 Jaideep Srivastava 34

    Concealed Relations

    � Concealed/Covert Relations: Relations between

    groups of actors that

    � have high strength

    � but are known to very few actors in the network outside the group

    � Problem: Given email log data for all actors, extract

    the concealed relations from this data

  • 12/2/2007 Jaideep Srivastava 35

    An IR-Motivated Approach

    � Use an approach motivated by informational retrieval

    � Use a tf-idf style scheme for relations� an actor’s view of the social network � document in a

    corpus

    � an (unordered) pair-wise actor relation � a term in a document

    � number of instances of a relation observed by an actor � term frequency (tf) in a document

    � number of actors that know about a relation � document frequency (df)

    � actual frequency of a relation is used to determine a ‘global’ ranking of concealed relations

  • 12/2/2007 Jaideep Srivastava 36

    tf-idf in a Social Network

    � If N is the total number of actors, then for relation rkl between two actors Ak and Al, we define its tf-idf score to be

    � The score skl induces a ranking among relations based on decreasing levels of concealment

    � If nkl is replaced by mklj (# of instances of rkl observed by Aj) then we

    get score for relation rkl relative to actor Aj (denoted by sklj)

    � The higher the value for sklj the more privy is actor Aj to the relation

    rkl

  • 12/2/2007 Jaideep Srivastava 37

    Top 10 Concealed Relations

    Score of this relation has dropped

  • 12/2/2007 Jaideep Srivastava 38

    Top actors from top clusters

  • 12/2/2007 39

    SNA from MMORPG logs

  • 12/2/2007 Jaideep Srivastava 40

    MMO Games

    � MMO (Massively Multiplayer Online) Games are computer games that allow hundreds to thousands of players to interact and play together in a persistent online world

    Popular MMO

    Games- Everquest 2,

    World of Warcraft

    and Second Life

  • 12/2/2007 Jaideep Srivastava 41

    Examples MMOs

    http://everquest.station.sony.com/

    http://www.worldofwarcraft.com/burningcrusade/

    Example virtual world

    http://secondlife.com/

    Sociology research questions• How do networks within the ecosystem of groupsenable and constrain the formation of groups?

    • How do micro-group processes influence groupeffectiveness and social identity?

    Psychology research questions

    • What impact does playing video games haspeople’s real lives?

    • Is online behavior different in MMOs vs.tradition video games?

    Macroeconomics research questions

    • lots of them

    Computer Science research questions• quantitative metrics• algorithms

    ( ) ( )oi

    biob MMsimilarityMMsimilarity,,≠∀≥

    ( ) ( )

    −−∃∑

    thresholdtfttfttwi

    iboffsetiooffset

    ( ) ( )oi

    biob MMsimilarityMMsimilarity,,≠∀≥

    ( ) ( )

    −−∃∑

    thresholdtfttfttwi

    iboffsetiooffset

    ( ) ( )oi

    biob MMsimilarityMMsimilarity,,≠∀≥

    ( ) ( )

    −−∃∑

    thresholdtfttfttwi

    iboffsetiooffset

    Fig. X. Group evolution

    over time

    t

    1

    t

    2

    t

    3

    Impact on Social Science & Comp Sc

  • 12/2/2007 Jaideep Srivastava 42

    MMORPG Example – Everquest 2

    � MMORPGs (MMO Role Playing Games) are the most popular of MMO Games� Examples: World of Warcraft by Blizzard and Everquest 2 by Sony Online

    Entertainment

    � Various logs of players’ behavior are maintained� Player activity in the environment as well his/her chat is recorded at

    regular time instances, each such record carries a time stamp and a location ID

    � Some of the logs capture different aspects of player behavior� Guild membership history (member of, kicked out of, joined, left)

    � Achievements (Quests completed, experience gained)

    � Items exchanged and sold/bought between players

    � Economy (Items/properties possessed/sold/bought, banking activity, looting, items found/crafted)

    � Faction membership (faction affiliation, record of actions affecting faction affiliation)

  • 12/2/2007 Jaideep Srivastava 43

    DM Challenges for Social Science Research with Everquest 2 Data

    � Inferring player relationships and group memberships from game logs� Basic elements of the underlying social network such player-player and

    layer-group relationships need to be extracted from the game logs

    � Developing measures for studying player and group characteristics� Novel measures need to be developed that measure individual and group

    relationships for dynamic groups

    � Novel metrics must also be developed for quantifying relationships between the groups themselves, the groups and the underlying social network as well as the groups and the environment

    � Efficient computational models for analyzing group behavior� Extend existing group analysis techniques from the social science domain

    to handle large datasets

    � Develop novel group analysis techniques that account for the dynamic multiple group scenario as well as the data scale

  • 12/2/2007 Jaideep Srivastava 44

    Measuring Player-Player Relationship (1)

    � Given multiple sources of player-player interaction such as traded with, teamed up with, chat data etc, how does one

    combine all this data into an accurate measure

    representing social interaction strength between players

    � Naïve approach –

    � Reduce each interaction type into a single statistic, for example:

    � Traded with (TAB)- number of times A traded with B

    � Teamed up with (QAB)– total time (in secs) A and B were teamed up

    together

    � Chatted with (MAB)– total number of words exchanged between A and B

    � Represent social interaction strength between A and B by the vector SAB = [TAB QAB MAB]

    � All future analysis work with the vector or a combination/aggregation of the features

  • 12/2/2007 Jaideep Srivastava 45

    Measuring Player-Player Relationship (2)

    � Key Issue with Naïve approach –� Different forms of interaction are indicative of different levels of

    social interaction strength which is unknown, i.e. it is not clear what is the relative weighting among the features of the vector SAB� For example if A and B exchange 100 messages and trade 20 times, it

    is not clear what is the difference in degrees of social interactions are associated with “100 messages exchanged” and “traded 20 times”

    � Traditional normalization techniques that account for magnitude difference do not solve this problem

    � Addressing the issue� Normalize the features of the vector SAB for an accurate measure of

    social interaction strength between players A and B

    � It is desirable that the proposed approach is computationally efficient due to the size of the social networks involved in ourdomain

    � Key Idea: The social interaction strength indications of the different features will be derived from the data itself

  • 12/2/2007 Jaideep Srivastava 46

    Measuring Player-Player Relationship (3)

    � Approach 2

    � Let S* = {Sij | for all pairs of actors i and j}

    � For each feature in the elements of S*, fit the corresponding values into two clusters C1 and C2with means S1 and S2 respectively.

    � It is expected that the cluster with lower mean (say C1) will consist of the weaker social interactions and the cluster with higher mean (say C2) will consist of stronger social interactions

    � The midpoint between the boundaries of the two clusters will serve as a mid-point point for the range of social interaction strengths represented by that feature i.e. halfway between weak and strong social interaction strengths

  • 12/2/2007 Jaideep Srivastava 47

    Measuring Player-Player Relationship (4)

    � Proposed Approach (Continued)

    � If Sm denotes the mid-point then all values of the feature in set S*can be transformed into a corresponding normalized measures of social interaction strength using the transformation

    � f*={fij = sigmoid( d(fij – fm) )|for all pairs of actors i and j}

    � The above procedure is performed for each feature to obtain a normalized set of feature vectors

    � The features now indicate equal levels of social interaction strength and standard vector based analysis or combination/aggregation techniques can be used safely

  • 12/2/2007 Jaideep Srivastava 48

    Measuring Player-Player Relationship (5)

    � The main idea is that the clustering provides a rough sketch of the distribution of feature values across different social interaction strengths

    � One of the inherent assumptions made is that the set S* covers the gamut of social interaction strengths in the underlying social network

    � Not an unrealistic assumption if the set of actors is large enough

    � If the set of actors is small then the social interaction strengths obtained will be relative to the set of actors considered –

    � In such a case the obtained social interaction strengths can still be used for certain important SNA techniques, such as centrality measures and community extraction, that are more reliant on the relative connection strengths rather than the absolute connection strengths

  • 12/2/2007 49

    Trust in Social Networks

  • 12/2/2007 Jaideep Srivastava 50

    What is Trust?

    � “Trust is a bet about the future contingent actions of others” (Sztompka 1999)

    � Main two components of Trust: Belief & Commitment

    � Alice trusts Bob if she commits to an action based on a belief that Bob’s future actions will lead to a good outcome

    � Trust in not a single value: Context Dependent

    � If a trusts b then that does not mean that a has to always trust b.

    � Properties of Trust

    � Transitivity: Trust can be passed along a chain of trusting people. Since Trust is not perfectly transitive, it could degrades along a chain of acquaintances

    � Composability: With information from many trusted people, there is more reasoning and justification for the belief

    � Personalization and Asymmetry: Trust is a inherently a personal opinion

  • 12/2/2007 Jaideep Srivastava 51

    Key Terms

    � Trust propagation

    � An approach for inferring trust values in a network

    � Trust Metrics

    � An estimate of how much trust an agent a should accord an agent b, taking into account trust ratings

    from other persons in the network.

    � Trust in Online Social Network (Web of Trust)

    � Users either explicitly rate other people in their social

    network or trust values are inferred from them

  • 12/2/2007 Jaideep Srivastava 52

    Trust Propagation

    � The Objective of trust propagation� A user trusts some of his friends, his/her friends trust their friends and so on…

    � Given trust and/or distrust values between a handful of pairs of users, predict unknown trust/distrust values between any two users

    � When the source and sink are not connected, � look at the paths that connect them, use information from intermediate people to

    make a trust recommendation

    � TidalTrust Algorithm (Golbeck, 2005)

    � Breadth First based search from source to sink

    � Search minimum possible depth from source to sink

    � Accept ratings from only the highest rated neighbours to sink

    � Use weighted average of trust (rating)

    � Issues in Trust propagation� Trust Decay: How much trust should be passed on to the next node.

    � Avoiding Cycles: Take the shortest path

    � Transitivity:� Partial Transitivity� Direct Trust vs. Recommendation Trust comes into play when discussing transitivity.

    � Normalization� “The normalized trust value from a person who has made many trust ratings will be lower

    than if only one or two people had been rated.”

  • 12/2/2007 Jaideep Srivastava 53

    Trust Metric

    � Network Perspective� Global: Compute a single value for trust from the global

    perspective� Local: Compute values for trust from the perspective of individual

    nodes.

    � Computational Locus� Distributed: Compute trust values based on neighbourhood

    information. Distributed metrics are always global.� Centralized: A central 'authority' computes the trust values.

    � Privacy Issues: Assumes that trust values are publicly available.

    � Disadvantages of Distributed Approaches� Trust Data Storage

    � Convergence takes a long time

    � Advantages of Distributed Approaches� Trust values readily available.

    � Link Evaluation� Scalar: Compute trust between a and b.� Group: Compute trust between a set of individuals in V.

  • 12/2/2007 Jaideep Srivastava 54

    Applications

    � Social Browsing

    � Spam Filtering

    � P2P Systems

    � Identifying the sources of 'good' data

    � Web of Trust

    � Semantic Web

    � Trust based Recommender Systems

  • 12/2/2007 Jaideep Srivastava 55

    Application: Trust based Recommender Systems� Due to the propagation of trust over the social network, it is

    possible to compute the trust weight in more users and more items

    � It reduces a cold start users problem

    � It use trust information explicitly expressed by the users� the fake identities used for the attacks are not trusted explicitly by

    the active users (and by the users she trusts)

  • An Ant Colony Optimization Approach

    Expert Identification in Social Networks

  • Jaideep Srivastava

    Expert Identification Problem

    � How does one route queries in knowledge

    markets which are

    �Set up like social networks

    �One does not know the topic distribution or the expertise of the users beforehand

    �The system is not centrally managing the system

    � Problem Statement

    �Given a graph of E experts and a topic distribution T, devise an approach for query routing that can be incrementally updated

  • Jaideep Srivastava

    Possible Solutions

    � Have a centralized repository of expertise and experts.

    � Assumes that one already knows what the 'topics' are and

    who the corresponding experts are.

    � Alternatively maintain a topic hierarchy over the network.

    � Also assumes that the topics and that the topics are stationary.

    � Desired qualities of a solution

    � identify experts

    � routes queries without assuming a predefined topic scheme,

    � allows for changing topics, expertise and topology, and

    � does it efficiently

  • Jaideep Srivastava

    ACO Based Approach (1)

    � ACO (Ant Colony Optimization) �

    � Inspired from how ants forage for food� Initially an ant randomly forages for food

    �When it finds a source it retraces its path

    �Ants lay chemical trials called pheromones in their path which can evaporate if not reinforced

    �Frequently used trails become stronger

    �Queries are represented as ants

    �Whenever a query ant finds an answer to a query it retraces its path and lays out a trail

    �Experts are the nodes with strong trails leading to them

  • Jaideep Srivastava

    ACO Based Approach (2)�� Approach

    � For the first k iterations, queries are flooded in the network

    � Queries are then be routed based on the scents

    � Small amount of randomization is introduced in order to ensure that alternative routes (experts) are discovered

    � Unlike traditional ACO techniques, different pheromones in our setting can be combined for cases where one encounters an unfamiliar query

    � Advantages� The 'solution' self-organizes

    � 'Solution' can be incrementally built

    � Graceful degradation of performance

    � Can account for changes in the network

    � Topics for expertise do not have to be predefined

  • Jaideep Srivastava

    ACO Based Approach

  • Social Network Analysis of Medicare Data

  • Jaideep Srivastava

    A Patient Profile

    Pulmonologist

    Cardiologist Geriatrics

    Podiatrist

    Rheumatologist

    Medical Problems

    Patient

  • Jaideep Srivastava

    Background

    � In many cases people visit multiple doctors and

    specialists for their medical needs.

    � The patients would be served better if there were

    better coordination between these specialists.

    � To encourage cooperation one first has to identify

    clusters or groups of practitioners that people with

    certain characteristics are likely to go to

    � Solution: find networks of referrals

  • Jaideep Srivastava

    Referral Networks and Cooperation

    � Problem:

    �Identify Referral Networks to encourage specialists to work together to offer better services.

    � Possible Solution

    �Offer incentives individually to specialists.

    � Defects in the Solution

    �Each specialist may want to “optimize” his/her own

    incentives.

    �In such settings local optimization of services does not

    lead to global optimization of services

  • Jaideep Srivastava

    SNA Formulation of Poblem

    � Problem is reformulated as a SNA Problem

    � Matrices can be constructed for variables of interest e.g.,

    Patient- Specialist Matrix, Specialist-Location matrix

    etc.

    � Techniques like Clustering, Link Analysis, Co-clustering,

    identifying cliques can be used.

  • Some Emerging Applications

  • 12/2/2007 Jaideep Srivastava 68

    Key Idea: Approach:

    Status and Future WorkKey Benefits

    Yahoo’s My Web: Me, My Interests and My PeopleYahoo’s My Web: Me, My Interests and My People

    What does MyWeb represent?

    •What does creator think about a page?

    •What do I think about the page?

    •What do others think about the page?

    What can be inferred?

    •Who are the community of people who are “voted” as good resources on a topic?

    •What are the community of pages which are “voted” as good resources on a topic?

    •Who are people/pages authoritative on a topic.

    • Improve Webpage ranking

    • Discovering communities of people and Webpages based on what users think

    • Discovering expert Webpages and people on given topics

    • Personalized Web and Community

    •Excellent source for personalized ads.

    Current Ranking Schemes:

    • Creator Based Ranking.

    Future Work:

    • Use of User Votes to improve ranking

    • Determining a most resourceful person.

    {tags}

    {tags}

    {tags}

    {tags}

    {tags}

    {tags}

    Community Aware PageRank

    PageRank

    Tag Aware PageRank

    {tags}

    P1

    P2

    P3

  • 12/2/2007 Jaideep Srivastava 69

    Key Idea:

    - Identifying the true experts among Yahoo Answers participants

    - Keep track of users who consistently provide good answers for particular topics

    - Provide incentives for experts to stay on

    Yahoo! Answers in order to improve service

    Approach:

    Status and Future Work

    - Develop a PageRank style scoring

    scheme for ranking experts for various topics

    - Develop efficient algorithms for the same

    - Do we penalize users for possible ‘bad’

    answers? If so how do we identify bad answers?

    Key Benefits

    - The study of trends among questions

    answers posted by the users esp.

    comparing behavior of the experts and non-experts

    - The above study as well as retaining

    the experts can help improve the service provided by Yahoo! Answers

    Yahoo! Answers: Identifying the ExpertsYahoo! Answers: Identifying the Experts

    Answer_1User_1

    Answer_2User_2

    Answer_nUser_n

    User_x

    User_y

    User_z

    User

    VotesQuestion

  • 12/2/2007 Jaideep Srivastava 70

    Key Idea:

    - Current recommendation models

    assume all users’ opinions to be independent, i.e. the i.i.d assumption

    - Can we make use of the social

    network data of actors to relax this i.i.d assumption

    Approach:

    Status (Research Issues)

    - Statistical Techniques exist for relaxing the i.i.d

    assumption. Eg. Multilevel modeling and Random mixed effects models

    - Research effort needs to be directed towards

    extending or integrating the ideas presented in these techniques with existing recommendation systems

    - Alternatively, one can also work towards

    designing complex graphical models for the proposed problem

    Key Benefits

    - Understanding the impact of social networks on market behavior

    - Improved recommendation systems

    Influence of Social Networks on Product RecommendationsInfluence of Social Networks on Product Recommendations

    A1A2.

    .

    .

    AN

    A1 A2 A3 … AN

    P1 P2 P3 … PMA1A2.

    .

    .

    AN

    Recommendation

    System

    Social Network

    Product Opinion

  • 12/2/2007 Jaideep Srivastava 71

    trelease

    trelease

    Approach:

    1. Define feature vector, Mo, for objective movie

    - genre, MPAA rating, distributor, cast

    2. Use feature vector as the basis to cluster movies

    3. Take clustered movies as the training data to do classification for the new movie

    4. Find the closet movie’s popularity function, fb

    where f is normalized

    5. Get the current popularity function (query statistics) for the new movie

    - related queries include, e.g., movie’s name, stars

    6. Use pattern matching to compute the distance

    between the objective movie (new one) and the

    similar movie (old one), and further to verify if the

    new movie is popular for each region in each time (interval)

    if not exists, increase ad.

    Example:

    Using Query Statistics to Help Movie AdvertisementUsing Query Statistics to Help Movie Advertisement

    ( ) ( )oibi

    ob MMsimilarityMMsimilarity ,,≠∀≥

    ( ) ( )

    −−∃ ∑

    thresholdtfttfttwi

    iboffsetiooffset

    t

    t t

    t#

    qu

    erie

    s#

    qu

    erie

    s

    # q

    uer

    ies

    # q

    uer

    ies

    MN CA

    Queries related to “Harry Potter”

    II

    trelease

    trelease

    I

    as popular as usual in MN

    need more ad. in CA

  • 12/2/2007 Jaideep Srivastava 72

    Summary

    � Research in Social Network Analysis has significant history

    � Social sciences: Sociology, Psychology, Anthropology, Epidemiology, …

    � Physical and mathematical sciences: Physics, Mathematics, Statistics, …

    � Late 1990s: computer networks provided a mechanism to study social networks at a granular level

    � Computer scientists joined the fray

    � 2000 onwards: Explosion in infrastructure, tools, and applications to enable social networking, and capture data

    about the interactions

    � Opens up exciting areas of data mining research

  • 12/2/2007 Jaideep Srivastava 73

    Impact on Organizational Dynamics Research� Analyzing social networks

    � Hypothesis Testing and model verification� Example [Raymond2001]

    • Individual job performance was positively related to centrality in advice networks and negatively related to centrality in hindrance networks composed of relationships tending to thwart task behaviors

    • Hindrance network density was significantly and negatively related to group performance

    � Identifying who consults whom when confronted with a problem.� Such patterns can be identified by mining the chains of emails

    � Negative relationships may also be determined by mining changes in patterns of emails across time

    � The building of trust among individuals can be studied over time

    � Analyzing knowledge networks� Mining patterns of knowledge in network

    � These patterns may change/evolve over time

    � Social Networks data verification � Verifying subjective data (collected via surveys) with observed event

    sequences

  • 12/2/2007 Jaideep Srivastava 74

    Impact on Organizational Policy Research� Data security

    � An absolute must

    � Privacy

    � Careful balance between privacy and data analysis

    � Impact of SNA on employee-organization relationship

    � Careful thought needed in managing this

    � Should there be ‘opt-in’ or ‘opt-out’ options for employees?

    � Is this too ‘big brother-ish’?

    � Bottom line

    � New technologies are radically transforming the workplace, impacting organization information flow like never before

    � Not managed properly, they can lead to serious problems, e.g. employee releasing corporate secrets in blogs (Google)

    � Need to have tools that enable the understanding (and thus management) of organizational information flow

  • 12/2/2007 75

    Thank you!

    And be careful with that e-mail ☺☺☺☺