How Well Does Language-based Community Detection Work for ...cs229.stanford.edu/proj2015/322_report.pdf · nity detection in an online social network is text. Danescu et. al. showed

How Well Does Language-based Community DetectionWork for Reddit?

Urvashi KhandelwalStanford University

[email protected]

Silei XuStanford University

[email protected]

Abstract

Online communities in the modern day eraare becoming more and more important.This makes it imperative for us to under-stand the structure of these communities.In addition, content generation sites likeReddit, Tumblr and Quora have an abun-dance of text in comments and posts whichcan be used to model the user interactionsand network substructures. In this paper,we propose to study community detectionin Reddit solely based on language fea-tures, to better understand how well lan-guage informs the boundaries between dif-ferent communities. We use supervisedprediction tasks and unsupervised commu-nity detection to gauge the quality of thesefeatures and find that they provide a fairlyrobust signal in trying to understand andmodel user interactions in the network.

1 Introduction

One of the most important tasks in understand-ing a social network is Community Detection. Byunderstanding how the network organizes itselfinto communities, we gain the insight that can ex-plain the various relations between entities. Animportant signal that lends itself well to commu-nity detection in an online social network is text.Danescu et. al. showed in their work on Beer com-munities that language plays a key role in identi-fying the life-cycle of a user within a community(Danescu, 2013). This is an instance of the impor-tance of language in understanding online socialnetworks. For a content generation site like Red-dit, we look to utilizing language features fromuser comments, in order to predict subreddits fortest users as well as detect communities within theuser population.

Traditionally, community detection algorithmshave looked at network structure as well as node

attributes. However, in this study we focus our at-tention on the language features. How well do lan-guage similarities connect users with similar inter-ests? This is an important question to ask whentrying to understand an individual user’s diverseinterests as well as which communities suit thembest. Such an understanding can aid collaborativefiltering tasks, content recommendation as well aspersonalized search.

The rest of the paper is organized as follows:in Section 2, we first introduce the related works.In Section 3, we explain the dataset, its process-ing and feature extraction. In Section 4, we givea brief overview of the learning algorithms, pro-ceeding to explain experiments in Section 5, wrap-ping up with a conclusion in Section 6.

2 Related Work

Online communities have been studied for decadesand most of the traditional community detec-tion algorithms put their effort in analyzing graphstructure of the data (Fiedler, 1973; Pothen, 1990;Newman, 2004). However, in the real world, apartfrom the topological structure, we also have con-tent information available to us. In recent years,analysis on community detection on networkswith node attributes has gained more and more at-tention (Zhou, 2009; Yang, 2013). One of the mostsignificant attributes for nodes (users) in contentgeneration sites is their language (Danescu, 2013).In this project, we go one step further by detect-ing communities solely based on users’ languageto see how well language informs the user interac-tions and ground-truth communities.

3 Dataset and Feature Extraction

Reddit is an online content generation website or-ganized by topically specific subreddits, whereusers can submit posts and have other users com-

Figure 1: A subset of unigrams for a randomly selected userwhere size of word is correlated to frequency

ment on them. The Reddit comments dataset1

is a publicly available compilation of commentsfrom nearly 97,000 subreddits since 2008. Thenumber of comments in these communities followa heavy-tailed distribution, such that only about7,000 communities have at least 1,000 commentsin the year of 2014.

In this paper, we only consider commentsposted during 2014. In addition, we consider onlymidsized communities which are defined as thosethat had at least 100 thousand comments and atmost 1 million comments over the entire year. Thischoice is driven to avoid extremely large commu-nities where users are likely to feel a lack of loy-alty and extremely small communities where thenumber of comments are too small to justify awholesome community. After filtering in this way,we end up with 633 communities.

Next, we looked at the user population for theincluded communities. First we filtered users byremoving bots. Bots were identified as usernamesthat ended with the string ‘bot’ or ‘Bot’, as wellas users that posted over 6000 comments in 2014,posted to over 200 communities in 2014 or over700 comments to a single community (activityrepresentative of bots upon looking at the data).Once we get this raw list of users, we further pruneall those who posted less than 1000 commentsthrough the year, in order to have a large numberof features per user. In this study, we include 5123users with each having posted in at least 10 sub-reddits. Overall, we have 7,043,339 comments.

Finally, we extract language features for eachuser. Since the language in Reddit is extremely in-formal, we start by looking at unigrams. For eachuser, we start by tokenizing each comment usingNLTK2. A canonical list of stop words is usedto filter through the more topical and user spe-cific tokens. Tokens containing punctuation, urls,

1https://www.reddit.com/r/datasets/2www.nltk.org

numbers and non-ascii characters are removed.Finally, nouns and verbs are lemmatized. Thisgenerates a list of unigrams with term frequen-cies. The overall size of the unigram vocabulary is1,512,526 words, with 5180.5 average unigramsper user. For the entire set of users, we proceed tocompute Term Frequency-Inverse Document Fre-quencies as follows:

tfidf (w) = (1 + log(tf (w)))× logN

df (w) + 1

where, w is a single token, N is the number ofusers, tf is the term frequency, and df is the doc-ument frequency, treating each user’s commentscollectively as a document. We use tfidf scores asthe feature vector for each user. Given the sparsityof the perceived user-unigram matrix, we use SVDto map the features to lower dimensions, makingthe different users’ language more comparable.

4 Learning

In order to gauge the quality of our language-based features, we looked at two types of tasks:supervised prediction and unsupervised commu-nity detection.

4.1 Supervised Learning - MultilabelClassification

For supervised prediction of subreddits for testusers, we use multilabel classification algorithmssince each user posts to multiple subreddits andbelongs to mutliple classes.

4.1.1 Decision TreesDecision Trees (Quinlan, 1986) are described asfollows:

1. Start with the entire dataset.2. Split along the attribute/dimension that pro-

vides the highest information gain.3. Divide examples into children based on the

splitting attribute.4. Recurse on the children, going back to step 2

for each.Information Gain on setX given that we know theattribute value for A is defined as follows, depen-dent on the definitions of information entropy (H)and conditional entropy:

IG(X|A) = H(X)−H(X|A)

H(X) =∑i

p(xi) log2 p(xi)

H(X|A) =∑i,j

p(xi, aj) log2p(aj)

p(xi, aj)

In the context of our setting, Decision Trees canbe thought as splitting based on different topics,thereby creating a kind of a topic hierarchy in theprocess. The subreddit labels lie in the leaves andeach user can belong to multiple subreddits.

4.1.2 Random Forests

Random Forests is an ensemble technique thatcombines multiple decision trees (Brieman, 2001).Each tree k = 1, ...n is constructed based on i.i.d.random vectors Θk, sampled from the feature dis-tribution. The outputs of all trees are combinedbased on majority vote, with some threshold formultilabel settings. This helps to remove varianceby training on different parts of the dataset and av-eraging over all predictions. In our setting, this isexpected to be helpful in reducing overfitting aswell as the generalization error.

4.1.3 ML k-NN

Multi Label k-Nearest Neighbor is an extensionof the k-NN algorithm. For every query user fromthe test set, we compute the k nearest neighbors inthe feature space, using their class distribution as aprior. Then, the MAP estimate is used to computethe label set for the query user (Zhang, 2007). Inour setting, this would help to demonstrate howwell the distribution in the feature space representsactual subreddits.

This algorithm should perform better than thetwo based on decision trees, since topical hierar-chies try to isolate a subreddit’s language wheremultiple subreddits might have similar languageand this may lead to higher error. On the otherhand, ML k-NN attempts to find similar users,which might all be posting to a similar set of sub-reddits and would have higher confidence in thepredictions.

4.2 Unsupervised Learning

Community detection can be seen as a clusteringof nodes into sets of similar entities. Hence, in or-der to detect communities in Reddit, based on theusers’ language, we tried two unsupervised clus-tering algorithms.

4.2.1 k-means Clustering

We start with the k-means clustering algorithm togroup users into communities, with the objectiveof using coordinate descent to minimize the withincluster sum of squares metric:

J(c,µ) =

m∑i=1

||x(i) − µc(i) ||2

where, J is the distortion function that measuresthe sum of squares between each n-dimensionalexample x(i) and the center of the cluster to whichit was assigned µc(i) . In this setting, the x(i)’s arefeature vectors containing tfidf scores for user i.This algorithm performs hard clustering, i.e. everyuser is assigned to a single cluster.

4.2.2 EM AlgorithmA soft clustering algorithm which uses coordinateascent to maximize the following objective func-tion:

J(Q, θ) =m∑i=1

∑z(i)

Qi(z(i)) log

p(x(i), z(i); θ)

Qi(z(i))

where x(i) is the feature vector containing tfidfscores for user i, z(i) is the subreddit being con-sidered, Qi(z

(i)) is the distribution over the labelsfor user i.

5 Experiments and Analytical Discussion

In this section, we present results for the super-vised and unsupervised approaches to learning de-scribed above. We evaluated both techniques us-ing Precision, Recall and F-1 scores. Precision isdefined as the fraction of the subreddit labels pre-dicted that matched the ground truth, Recall is thefraction of ground truth subreddit labels that werepredicted and F-1 score is the harmonic mean ofPrecision and Recall.

5.1 Ground Truth

Ground truth in this dataset exists in the form ofsets of subreddits for each user, where a subredditis included if the user posted to it during the year.This ground truth is useful for the evaluation of themultilabel classification as well as the clusteringapproaches.

5.2 Prediction Task Results

The prediction task involves training a model tolearn which subreddits a user belongs to based onthe feature vectors, in order to predict these for atest set of users. We randomly pick 4123 usersas our training set and test on the remaining 1000users using the supervised learning algorithms de-scribed in Section 3.1 - decision trees, random

(a) Decision Trees (b) Random Forests (c) k-NN

Figure 2: Supervised Learning Precision/Recall Curves

forests, and multi-label k-NN. For random forests,the number of trees used was 5. For ML k-NN, thenumber of neighbors considered was 5. The Preci-sion, Recall, and F-1 scores are shown in Figure 2.

As we can see, the results for decision trees andrandom forests are relatively better for fewer fea-tures. This can be explained by the fact that afterperforming SVD and picking the top n features,for a small value of n this tends to include the moretopically relevant features that help to distinguishbetween similar subreddits while classifying a newuser. For large values of n, we tend to includemore noisy features which would cause the modelto overfit. For instance, with fewer features weget a shallower decision tree, but as it gets deeperwith higher n, performance degrades as pickingthe correct subreddits at the leaves gets harder dueto confusion caused by noisy features.

Using a higher number of decision trees (5 inrandom forests) helped to increase the precisionin the sense that fewer subreddits were being pre-dicted and a larger fraction of these were cor-rect. But Recall was lower since fewer predictionsmeant retrieving lesser of the ground truth. Over-all, fewer noisy results (based on confusion causedby noisy features) meant that using random forestsserved as an improvement over decision trees, justas expected.

For ML k-NN, with a larger number of fea-tures, the performance remained fairly stable be-cause the algorithm relies on other similar usersto make a prediction and when the feature vectordimensions change, it affects all users in a similarway, showing that similarity of users in the fea-ture vector space is robust in the context of noisyfeatures.

5.3 Community Detection ResultsBaseline: We implemented a naive baseline basedon random assignment to get a sense of how theclustering algorithms performed in comparison to

random guessing. In this baseline, every user isassigned a cluster between 1 and k uniformly atrandom.k-means clustering: We adopted Llyod’s algo-rithm for the k-means clustering. The number ofclusters were varied from 2 to 600, where eachuser is assigned to a single cluster.EM: We used the Gaussian Mixture Model im-plementation for EM. The number of latent vari-ables were varied from 2 to 600 and a probabilitywas computed for each user’s membership for theclass.

We first evaluate the performance of k-meansand EM by calculating the average weight of intra-cluster and inter-cluster edges of detected clusters,where edges are defined as follows. For a user ui,let Ci denote the set of subreddits ui commentedin, and for each subreddit c ∈ Ci, let Ni(s) denotethe number of comments ui posted in c. Then theweight of the edge between a pair of users ui anduj , denoted by wi,j , is defined as

wi,j =

∑

c∈Ci∩Cj

Ni(s)∑c∈Ci

Ni(s)

×

∑c∈Ci∩Cj

Nj(s)∑c∈Cj

Nj(s)

Note that if Ci ∩ Cj = ∅, then wi,j = 0, and incase that Ci = Cj , we have wi,j = 1. The resultsare shown in Fig. 3a.

The availability of ground-truth subreddits al-lows us to quantitatively evaluate the performanceof these two unsupervised learning algorithms.However, it’s hard to find the mapping between de-tected communities and ground-truth subreddits.Thus, we evaluate the clustering results by calcu-lating the average F-1 score of the best matchingground-truth community to each detected commu-nity and the best matching detected community toeach ground-truth community:(∑

ci∈C∗ F1(ci, ˆcmi)

|C∗|+

∑ci∈C F1(cm′i , ci)

|C|

)/2

0 100 200 300 400 500 600Number of clusters

0.00

0.05

0.10

0.15

0.20

0.25

Clustering Evaluation

k-means: avg weight of intra edgesk-means: avg weight of inter edgesEM: avg weight of intra edgesEM: avg weight of inter edgesavg weight of all edges

(a) Weight of intra-cluster and inter-cluster edges

0 100 200 300 400 500 600Number of clusters

0.02

0.04

0.06

0.08

0.10

0.12

0.14

F1 s

core

Clustering Evaluation

k-meansEMrandom

(b) Average F1 score

Figure 3: Supervised learning evaluation

where C∗, C denote the set of ground-truth anddetected communities, respectively,mi,m

′i denote

the best matching: mi = arg maxc∈C F1(ci, c),m′i = arg maxc∈C∗ F1(c, ci). The results areshown in Fig. 3b.

From Fig. 3a, one can observe that both k-means and EM algorithms perform much bet-ter than randomly guessing cluster assignments:the average weight of intra-cluster edges is muchhigher than inter-cluster edges, whereas the twometrics should be similar if we cluster users ran-domly. From Fig. 3b, one can see that both algo-rithms perform poorly when trying to match theground truth precisely. The algorithms performrelatively better when trying to discover fewerclusters. This is because the similar subredditshave similar language which allows them to begrouped to form super communities and the clus-tering algorithms seem to be better at detectingthese clusters than the more fine-grained subred-dits that we are working with.

5.4 User Activity

The prediction task and community detection re-sults collectively demonstrate that the unigramtfidf scores do not form a strong set of features.One reason for this can be that we are constructingthese features by using comments posted through-

out the year. However, as shown in Fig. 4, userspost comments to different subreddits at differenttimes of the year. Mixing signals from throughoutthe year might be adding more noise and it couldbe interesting to consider comments and groundtruth from specific time-windows (day, week etc.).

Figure 4: Event Plot showing the comments posted to differ-ent subreddits for a randomly selected user who posted 501comments to 15 subreddits through the year.

6 Conclusion and Future Work

From this study, we see that language in the com-ments holds signals that inform the process formodeling user interactions in Reddit. If we con-sider the task of trying to understand the interac-tions based on similar interests, rather than the pre-cise subreddits, these techniques hold a lot of po-tential. In this case, it might work to our advan-tage to try and cluster the ground truth into groupsbased on similar topics. Another possible futuredirection could look at making the language fea-tures more robust, i.e. considering bigrams, pro-cessing the comments using more advanced natu-ral language processing techniques as well as set-ting up word vectors to match language based onsemantic meaning. Finally, it might also be bene-ficial to consider time-windows for the comments.This would not only improve the current study,but also allow us to understand how these user in-teractions change over time. In conclusion, ourproject served as a good starting point for lookingat text to model interactions between users in theReddit network and after confirming that the lan-guage features provide robust signals, we can pur-sue many future directions to make a more com-pelling argument about how they can be useful.

ReferencesFiedler, Miroslav. 1973. Algebraic connectivity of

graphs. Czechoslovak mathematical journal.

Pothen, Alex, Horst D. Simon, and Kang-Pu Liou.1990 Partitioning sparse matrices with eigenvectorsof graphs. SIAM journal on matrix analysis and ap-plications.

Newman, Mark EJ. 2004 Fast algorithm for detectingcommunity structure in networks. Physical reviewE.

Yang, J. and McAuley, J. and Leskovec, J. 2013.Community Detection in Networks with Node At-tributes. IEEE International Conference On DataMining, Dallas, TX, USA.

Zhou, Y. Cheng, H. and Yu, J. 2009. Graph ClusteringBased on Structural/Attribute Similarities. Proceed-ings of VLDB Endowment, Lyon, France

Zhang, M. and Zhou, Z. 2007. ML-KNN: A lazylearning approach to multi-label learning. PatternRecognition

Brieman, L. 2001. Random Forests.

Quinlan, J. R. 1986. Induction of Decision Trees. Ma-chine Learning

Pedregosa, F. et. al. 2011. Scikit-learn: MachineLearning in Python. Journal of Machine LearningResearch

Danescu-Niculescu-Mizil, C. and West, R. and Juraf-sky, D. and Leskovec, J. and Potts, C. 2013. NoCountry for Old Members: User lifecycle and lin-guistic change in online communities. ACM Inter-national Conference on World Wide Web, Rio deJaneiro, Brazil.

Documents

How Well Does Language-based Community Detection Work for ...cs229.stanford.edu/proj2015/322_report.pdf · nity detection in an online social network is text. Danescu et. al. showed