Networks and emotion-driven user communities at popular blogs

  • Published on
    06-Aug-2016

  • View
    212

  • Download
    0

Embed Size (px)

Transcript

  • Eur. Phys. J. B 77, 597609 (2010)DOI: 10.1140/epjb/e2010-00279-x

    Regular Article

    THE EUROPEANPHYSICAL JOURNAL B

    Networks and emotion-driven user communities at popular blogs

    M. Mitrovic1, G. Paltoglou2, and B. Tadic1,a

    1 Department of theoretical physics, Jozef Stefan Institute, Box 3000 SI-1001 Ljubljana, Slovenia2 Statistical Cybernetics Research Group, School of Computing and Information Technology University of Wolverhampton, UK

    Received 6 May 2010/ Received in nal form 25 August 2010Published online 27 September 2010 c EDP Sciences, Societa` Italiana di Fisica, Springer-Verlag 2010

    Abstract. Online communications at web portals represents technology-mediated user interactions, leadingto massive data and potentially new techno-social phenomena not seen in real social mixing. Apart frombeing dynamically driven, the user interactions via posts is indirect, suggesting the importance of thecontents of the posted material. We present a systematic way to study Blog data by combined approachesof physics of complex networks and computer science methods of text analysis. We are mapping the Blogdata onto a bipartite network where users and posts with comments are two natural partitions. With themachine learning methods we classify the texts of posts and comments for their emotional contents aspositive or negative, or otherwise objective (neutral). Using the spectral methods of weighted bipartitegraphs, we identify topological communities featuring the users clustered around certain popular posts,and underly the role of emotional contents in the emergence and evolution of these communities.b

    1 Introduction

    Science of the Web [1] is an emerging multidisciplinaryarea with interconnected contributions from the physics ofcomplex dynamical systems, computer science, and socialscience. Apart from developing technology and algorithmsfor safe and ecient information processing, the researchof Web concerns with understanding its structure [2] andthe underlying evolution mechanisms [3] as well as theemergent social phenomena among Web users [4,5]. In thiswork we present a systematic methodology for study of thecollective user behavior on Web portals. The approach isbased on the physics of complex networks and the com-puter science methods of text analysis.

    Emotions & Emerging Behavior in Cyberspace. Recentdevelopments of the communication technologies haveinduced new types of human interactions mediated bythe computer networks and on-line availability of dier-ent types of data. This makes the basis for new prac-tice of social communications leading to potentially newtechnology-driven social phenomena not observed in con-ventional social mixing and thus calling for new scienceapproaches [1,4,6,7]. On the other hand, huge amount ofdata of user communications over dierent Web portalsis rapidly accumulating, which oers fabulous possibili-ties for the empirical study. The methodology of complexdynamical systems and mapping the data onto networksprovides the ways to detailed quantitative analysis.

    a e-mail: Bosiljka.Tadic@ijs.sib All data are fully anonymized. No information about user

    IDs are given.

    An important feature of the online communications isthat user interactions are mediated by the posted material,e.g., the text of posts and comments on the Blogs, studiedhere. The indirect interactions not just change the conven-tional social rules known in face-to-face communication,but also indicates the importance of the contents of theposted material [810]. In the Blogs, the posted text mayin dierent ways aect the behavior of the users who readit, depending on the information that the text contains,but also by featuring certain aesthetic, moral or emotionalcontents [8,11]. Recent studies increasingly show that theemotions expressed in the text (or other posted materi-als) play an important role in the online social dynamics.The strength of the emotions expressed by an individual,e.g., the user reading a posted text, can be measured inthe laboratory [12] and observed on the level of large-scalesocial eects [11,13,14].

    A number of conceptually dierent Web sites arecurrently available, ranging from the consumers opinionabout products, e.g., movie database (IMDb), books andmusic records (Amazon), across the sites with exchange ofopinions about everyday events (Diggs, Blogs, Forums),to fast on-line communication on friendship-based net-works (Facebook, FriendFeed, MySpace). The Blogs areconceptually in between the consumer networks and thefriends networks, mentioned above, and thus play a specialrole in the study of social on-line communities [8,9,1519].In Blogs authors express and exchange their opinion viawritten (short) texts, with other users, who are generallynot acquaintances in real life. Registration of bloggers is

  • 598 The European Physical Journal B

    required on many Blogsites, which enables quantitativeanalysis and tracing users activity over time.

    Network representations of on-line interactions. Net-work representations of complex dynamical systems in-cluding social systems, has proved as a useful tool forquantitative study both in terms of the structure andthe dynamics over networks (for a recent review seeRef. [20,21]). Mapping the data related to dierent so-cial media onto networks reveals correlated dynamical be-haviors, which is manifested in power-law dependencesin the structure of networks and other related distribu-tions [19,2227]. The study of group formation in the net-works related to movie data [22,28,29], music genre [24,25],subject of the posts in Blogs [19], forums [30], news sitesand conference publication [31], etc., show that similarmechanisms might underline the behavior of humans inthese on-line communications. Methods for analyzing con-tent of short messages and textual posts [5,32] and theiremotional content [12,33,34] enable understanding howthe interactions on micro-level (user-to-post-to-user) leadsto large-scale behavior within these virtual communities.

    Mapping the data onto bipartite networks [19,22,25,28]is a suitable representation which enables the analysisand identication of dierent user communities. Statis-tical theory and community detection using the meth-ods of the eigenvalue spectral analysis of networks re-veal that dierent mechanisms may drive the dynamics onvery popular post compared to all other posts. (Details ofthe spectral analysis of modular networks are describedin Ref. [35], while other methods based on maximizationof modularity are reviewed in Ref. [36,37]). In particular,the behavior of bloggers on normally popular posts [19]appears to follow a pattern of self-organized dynamical be-havior and communities mostly related with the subjectpreference. Whereas, subjects appear completely mixedin the case of very popular Blogs [19], indicating dierentunderlying mechanisms.

    In this work we focus on studying popular postscollected from bbc.co.uk/blogs/ by mapping the high-resolution data onto bipartite graphs and nding com-munities of users on it. We study the text of posts andcomments of users within these communities with the aidof machine learning approaches, trained to detect and dis-tinguish emotions in text. This enables us to study sys-tematically the role of the emotions in the emergence andthe evolution of the user communities and the patterns ofuser behavior at these popular posts.

    2 Data structure and contents of popularblogs

    We collected data [19] from the bbc.co.uk/blogs/ site fortime period of nearly two years, from June 2007 till Febru-ary 2009. The dataset contains high temporal resolutionof user IDs related action, posting comments related to agiven post, as well as the IDs of the posts and commentsand their text. The concept of the BBC Blogs is ratherspecial: The original posts are written by few (invited)

    authors, who often do not take part in the discussion.All posts belong to one of the predened categories, ac-cording to their subjects. Users are registered by IDs andallowed to make comments on these posts. The informa-tion about comment-on-comment is not stored, so thatall comments are automatically attributed to the originalpost. The whole dataset consists of NP = 3792 posts andNC = 80873 comments written by NU = 21462 users.

    As mentioned above, we focus on the popular postsand analysis of the emotional contents of user commentsrelated to them. As the popularity break-point occurs atthe number of comments 100 (see the discussion belowand Ref. [19]), from the entire dataset we select these postsand all users and their comments related to them. Wend NP = 248 popular posts and NU = 13 674 users whowrote NC = 53 606 comments on these posts. We down-loaded text of each of these posts and text of each relatedcomment, and analyzed it with the emotion classifier, de-scribed below. These posts appear to belong to ve dier-ent subject categories: Business and Economy, Music andArt, Sport, Technology and Nature and Science. Knowingthe authors and the posting times for all posts and com-ments, we are able to reconstruct temporal patterns ofusers behavior and link it to the emotional contents of thetexts.

    2.1 Mapping the data onto bipartite networks

    The Blog data can be suitably represented by directedbipartite graphs with users as one partition, and postsand comments, as the other partition [19]. By deni-tion [38], in bipartite networks links are allowed only be-tween nodes of dierent partitions, which completely re-spects the structure of the interaction between users overposts and comments in the Blog data. In the data wehave iU = 1, . . . , NU users and jB = 1, . . . , NP + NCposts and comments, which together make N = 106 127nodes of the bipartite network which eventually is reducedto N = 67 528 nodes in the case of the popular posts.The post/comment jB is linked to its author iU trougha directed link that points form user to posts/comment,(iU jB). A directed link in the opposite direction(iU jB) indicates that the user iU left a comment on thepost jP . (Note that one user can write more than one com-ment to a specic post, resulting in the multiple outgoinglinks). Following these rules we obtain a directed bipartitenetwork representation of the blog data. An example of asingle-post network, which illustrated these rules is shownin Figure 1a. Together with the information about time ofthe appearance of each user, post, comment and the link,the network contains full information from the dataset.Applying the graph theory methods we can now study thestructure of the interactions at dierent levels, from indi-vidual nodes of both partitions, to the mesoscopic (com-munity) structure to the level of the entire network, aswell as the evolution of the network. We utilize the emo-tional content of each post and comment as an additionalfeature that interferes with the network structure. It is

  • M. Mitrovic et al.: Networks and emotion-driven user communities at popular blogs 599

    Fig. 1. (Color online) Example of a directed bipartite networkwith users (circles) and comments (squares) related to a par-ticular post (a) and a weighted symmetrical posts-and-usersnetwork (b). Color of the posts and comments indicates theiremotional content classied as positive (red), negative (black)or neutral (white).

    indicated by the color of the comments (see detailed dis-cussion below). Note that these networks can become verylarge, depending on the dataset considered.

    Other suitable representations can be obtained bycompressing or projecting from these bipartite graphs. Inreference [19] we have studied several networks of Blogdata obtained by projecting of the bipartite networks ontouser-projection and post-and-comments projection. For thepurpose of this work, i.e., the networks of popular posts,we consider a compression of the whole network into aweighted bipartite network, which consists of users andposts only, while the weights of the links between themis given by the number of comments that the user left onthe related post. An example of such network is also givenin Figure 1b, together with the color, that indicates thecumulative emotional content of all comments related tothe post.

    2.2 Extracting emotions from text sentences

    We view the problem of extracting the emotions from textsentences as a classification problem. The general aim ofclassication is, given a document1 D and a xed set of

    1 The term document is used here in the broadest of senses,signifying a sequence of words. In realistic environments, doc-

    classes C = {c1, c2, ...ct}, to assign D to one, or more, ofthe available classes.

    We have implemented two supervised, machine-learning classiers for estimating the probabilities whethera document D is objective or subjective, positive or nega-tive. The classiers function in a two-tier fashion: the rst-stage classication determines the probabilities of whetherD is objective of subjective, i.e., C1 = {obj, sub}, and thesecond determines the probabilities of the polarity of thedocument, i.e., C2 = {neg, pos}, if it was classied as sub-jective in the rst-tier classication. The nal output ofthe classiers is therefore one of {obj, neg, pos}.

    We have utilized language model classiers [39,40] forboth classication tasks. The aim of the classiers is tomaximize the posterior probability P (c|D), that a givendocument D belongs to class c. Typically, the best classis the maximum a posteriori (MAP) class cMAP :

    cMAP = argmaxcC

    {P (c|D)} . (1)Using Bayes rule, we get:

    cMAP = argmaxcC

    {P (D|c) P (c)

    P (D)

    }

    argmaxcC

    {P (D|c) P (c)} . (2)

    Furthermore, we have removed the denominator P (D)since it does not inuence the outcome of the classica-tion. P (c) is the prior that indicates the relative frequencyof class c, i.e., all other things being equal, the classierwill prefer the most frequent class.

    Language models operate by estimating the probabil-ity of observing document D, given class c. We representD as token sequence {w1, w2, . . . wn}, therefore the aim ofa language model is to estimate the probability of observ-ing the above sequence, given c:

    P (D|c) = P (w1, w2, ...wn|c)= P (w1|c) P (w2|c, w1) ... P (wn|c, w1, w2, ..., wn1)

    =n

    i=1

    P (wi|c, w1, ..., wi1). (3)

    Usually, an n-gram approximation is used to estimateequation (3), which assumes that the probability of to-ken wi appearing in document D depends only on thepreceding n 1 tokens:P (wi|c, w1, . . . , wi1) = P (wi|c, wi(n1), . . . , wi1). (4)

    A straightforward way to calculate the maximum like-lihood estimate of P (wi|c, wi(n1), ..., wi1) during thetraining phase of the classier, given a set of documentsand their respective categories, is by counting the fre-quency of occurrences of the tokens sequences:

    P (wi|c, wi(n1), . . . , wi1) =#(c, wi(n1) . . . wi)

    #(c, wi(n1) . . . wi1),

    (5)

    ument can be any sort of textual communication between twoor more parties, such as blog posts, forum comments or In-stance Messaging utterances.

  • 600 The European Physical Journal B

    where #(c, wi(n1) . . . wi1) is the number of occur-rences of token sequence wi(n1) ....

Recommended

View more >