Analyzing Political Discourse on Reddit

Embed Size (px)

Citation preview

  • 8/10/2019 Analyzing Political Discourse on Reddit

    1/20

    Network Analysis Final Report

    Max Candocia

    December 12, 2014

    Abstract

    This papers purpose is to show how to apply semi-supervised generative models to clas-

    sify the political ideologies of users in an online network and then analyze the interactions

    thoughout the website using homophily measures. For this study, Reddit, a large social me-

    dia website, is used as the sample source. The model uses two sets of variables: the various

    parts of the website where users post comments and whether or not it is positively received,

    and various n-grams used within these comments. Oftentimes users of the Internet will dis-

    cuss and debate various topics relating to politics, but the manner in which they take place

    varies greatly. After classifying users, analyzing the homophily between different portions

    of the website will provide insight into how diversity of ideologies (or lack thereof) plays a

    role in facilitating discussions. In addition, the amount of positively-scored comments be-

    tween people of different or similar ideologies is used as a way of detecting hostilities between

    groups. For the generative model, one of the interesting results is which n-grams are used

    more often by people of differing ideologies. While these words do not include context, the

    second set of variables that involve posting locations help provide that aspect using Reddits

    website structure and scoring system. Hyperparameter controls aid in how sharply different

    variable sets discriminate between classes, and are shown to reduce both error and impurity

    when using 5-fold cross-validation. Finally, the limitations of an algorithm are discussed, as

    comment histories are treated as corpora, which can be tens of thousands of words long.

    Introduction

    User interactions on websites is often a topic of interest. However, the amount of data

    one can gather on a particular user is often limited, especially if they do not have much

    1

  • 8/10/2019 Analyzing Political Discourse on Reddit

    2/20

    information publicly available or have a large amount of text available for analysis. For the

    purposes of this project, Reddit is used. Reddit, one of the largest social media websites

    in the United States, hosts tens of millions of discussions between millions of users. Many

    of these users have extensive posting histories, and they are divided up into sections of the

    website called Subreddits. Subreddits are sections of the website for which content revolves

    around a single topic, which can be either specific, such as a sports team in Germany, or

    broad, such as US politics. There are Subreddits for a variety of topics, and they are usually

    named by putting /r/ (pronounced as the English letter r is pronounced) before them,

    such as /r/politics, /r/Conservative, or /r/DebateCommunism. Anecdotally, these

    communities have biases in them, but no attempt has been made to measure this bias in

    terms of political ideology or affiliation. While it is relatively obvious to any user who

    traverses the site in general, attempts to quantify this have not been made, largely due to

    the difficulty of classifying the ideologies of tens of thousands of users.With these results, it is possible to analyze the compositions of various Subreddits, as

    well as quantify interactions between users of different ideologies in them. The results may

    be interpretable on a scale of -1 to 1, where -1 indicates that a Subreddit is adebating

    grounds (i.e., the sole purpose of commenting is to argue with someone of a different political

    ideology), while 1 indicates that it is a pure hive mind (i.e., believers of an ideology only

    discuss amongst themselves). Of course, the limitations of the classification algorithm should

    be taken into account whenever analyzing the results.

    Background

    Ideologies

    For this study, users are grouped into five different ideological categories. This is not a 100%

    accurate representation of the political spectrum, but the groups are general enough to apply

    to a large number of people, and they are different enough in terms of philosophical aspects

    that they can be considered distinct. These terms are relatively unambiguous when used in

    the context of United States politics.

    Anarcho-Captialism

    Anarcho-Capitalism is a subset of libertarianism as it is described below. Generally speak-

    ing, it is the belief that all interactions between humans should be facilitated according to

    2

  • 8/10/2019 Analyzing Political Discourse on Reddit

    3/20

    principles of laissez-faire capitalism and that government does not exist.

    Conservatism

    In this context, conservatism refers to conservatism in the United States. Generally speaking,

    it is the belief that the government should not have as large of a role in economic issues.

    As far as social issues go, there are varying degrees of belief that government should control

    certain social interactions between people, but on Reddit, the number of conservatives who

    believe in strong government control over social interactions is relatively small compared to

    the number of conservatives. Most of the followers of this ideology on Reddit lean towards

    the US Republican party.

    Liberalism

    In this context, liberalism refers to what is generally known as left-liberalism, which consists

    of the beliefs of less involvement of government in social affairs between people and more

    government involvement in peoples well-being and regulation of businesses. Most of the

    followers of this ideology on Reddit lean towards the US Democratic party.

    Libertarianism

    In this context, libertarianism refers to what is generally known as right-libertarianism, which

    is the belief that government should have minimal involvement in both social and economicaffairs of people. While anarcho-capitalism is a subspace of this ideology, it is excluded from

    this category because of the divisiveness between the idea of minimal government and no

    government.

    Socialism

    In this context, socialism mostly refers to communism, but it can also refer to the ideologies

    of social democrats, who prefer more gradual change than other socialists, who often describe

    themselves as anarchists. One of the common goals of socialism is to eliminate hierarchy insociety via the elimination of capitalism.

    Learning Algorithm

    Learning algorithms for large networks require relatively simple models that can support a

    large number of parameters. For this reason, generative models are often used to accomplish

    3

  • 8/10/2019 Analyzing Political Discourse on Reddit

    4/20

    this. Nigam et al. describes a basic framework for working with a semi-supervised model

    in general. Wang et al. describes some different hyperparameter choices, such as link type

    importance, that are used in this methodology. However, CATHYHIN, as described in

    Wang et al. is an unsupervised learning method that is designed to work on bibliographic

    data, as well as creating subtopics for clustering (clusters within clusters). While this could

    certainly be possible with ideologies (e.g., Anarcho-Capitalism is treated as a subtopic of

    Libertarianism), this aspect of the algorithm will be ignored for the purposes of exploring a

    simpler model.

    Homophilies of Online Communities

    Homophilies of online communities are of great interest to researchers, as data is easier to

    gather and far more abundant than real life data. Generally those measuring homophily

    use race, gender, and location as attribute variables. However, Bisgin et al. describes

    how online communities have the potential to attract people of similar interests due to the

    very low barrier of entry. Bisgin describes methods for describing pairwise similarity using

    Jaccard similarity coefficients, as the similarity attribute is the shared interests between

    users. However, for this data, we rely on clustering in order to determine if users are similar

    or not, so that method cannot be applied here. Granted, by definition of the generative

    model used, users who use similar words or post in similar places will have a tendency to be

    clustered together.

    Data

    Sampling

    The data was collected fromhttp://www.reddit.com/using the Reddit API over the course

    of two weeks. The scraper sequentially cycled through 24 different subreddits and grabbed

    the most recent post with at least 1 comment reply that was at least 2 days old. The

    reason for the cutoff is that most posts do not acquire new comments after 1 day. Within

    each comment thread, each comment is collected, as well as the commenting history (last

    200 comments) of the user who posted the comment. There are cutoffs for threads with

    an abnormally high amount of comments. However, this occurs very rarely, and most of

    these comments have less substance to them and no replies to them, making them less

    meaningful. This is partly due to Reddits ranking algorithm, which places more recent and

    4

    http://www.reddit.com/http://www.reddit.com/
  • 8/10/2019 Analyzing Political Discourse on Reddit

    5/20

    more positively-voted comments most visibly.

    Labeling

    Since ideologies are not labeled a priori, they were manually labeled by selecting a randomsample from the set of all users. Their entire comment histories were analyzed, and a deci-

    sion was made regarding classification. Some users could not be classified due to insufficient

    information, so they were skipped. The group sizes from this classification were used to con-

    struct priors for the generative model. The group sizes for anarcho-capitalism, conservatism,

    liberalism, libertarianism, and socialism were 53, 80, 296, 87, and 59, respectively, for a total

    of 578 users. This equates to priors of 9.2%, 13.9%, 51.5%, 15.1%, and 10.3%. Note that

    this prior is specific to the sample, not to Reddit as a whole.

    Description

    A total of 18,545 users were recorded, with up to 200 comments in their comment histories

    recorded, with a time-cutoff of 4 months from the date of collection. 116,810 comments from

    100 different Subreddits were used in the learning algorithm. 3,550 different threads were

    analyzed, with a total of 70,738 comments between users. Note that this value is the sum of

    all weighted edges between users.

    The n-grams used for the general algorithm were hand-selected, as topical clustering can

    occur in many other ways. For example, people from all ideologies might talk about the ebolavirus, ISISs invasion of Iraq, or other current events that arent politically lopsided. Also,

    note that the values for Subreddits were weighted logarithmically in order to give more weight

    to comments that are recieved more positively or negatively by the community. Additionally,

    the Subreddit variables were divided into two parts: positively-scored comments on the

    Subreddit and negatively-scored comments. This increases the number of Subreddits from

    100 to 200.

    5

  • 8/10/2019 Analyzing Political Discourse on Reddit

    6/20

    Method

    Semi-Supervised Classification

    Symbol Meaning

    b the set of ideologiess the set of Subreddits

    t the set of n-grams

    bk kth ideology

    si ith Subreddit

    ti ith n-gram

    vt the relative importance of terms versus Subreddits

    Ws the initial weight of manually labeled data versus the entire data set for Subreddits

    Wt the initial weight of manually labeled data versus the entire data set for n-grams

    ds the exponential decay rate for the weight of manually labeled data for Subreddits

    dt the exponential decay rate for the weight of manually labeled data for n-grams

    ws the minimum weight of manually labeled data after applying decay for Subreddits

    wt the minimum weight of manually labeled data after applying l decay for n-grams

    s the calculated weight of trained data for Subreddits

    t the calculated weight of trained data for n-grams

    s the raw count added to each ideologys Subreddit count; a smoothing parameter

    t the raw count added to each ideologys n-gram count; a smoothing parameter

    sik the probability ofsi appearing in bk

    tik the probability ofti appearing in bk

    ui theith user

    usij the weighted value ofsj for ui

    utij the weighted value oftj forui

    ubi the ideology ofui

    ubik the likelihood ofbk forui

    ubik the probability ofbk for ui

    q the iteration number of the EM algorithm, starting at 0

    N the total number of users

    Ns the total number of Subreddits

    Nt the total number of n-grams

    M the total number of manually labeled users

    um the set of manually labeled users

    6

  • 8/10/2019 Analyzing Political Discourse on Reddit

    7/20

    Assumptions

    1. Each user will post a certain number of comments with no relation to their ideology

    2. Each user will use a certain number of n-grams with no relation to their ideology

    3. When a user who follows bk posts a comment in a Subreddit, there is a sik probability

    that the user would post in Subreddit si.

    4. When a user who followsbk uses a key n-gram in a comment, there is a tik probability

    that that particular n-gram is ti.

    5. For any randomly selected user, there is a pk chance that they belong to bk, where pk

    is the prior probability.

    Algorithm

    For each iteration number q, we first estimate the weights of manually labeled data,

    s=NWsmax(e

    qds , ws)

    M

    s=NWtmax(e

    qdt , wt)

    M

    Normally algorithms run in supervised modeThen we calculate the Subreddit and n-gram probabilties for each ideology,

    sjk =s+

    ui|ubi=bk

    (1 +sI(ui um)) usij

    s Ns +

    sls

    ui|ubi=bk

    s+ (1 +sI(ui um)) usil

    tjk =t+

    ui|ubi=bk

    (1 +tI(ui um)) utij

    t Nt +

    tlt

    ui|ubi=bk

    t+ (1 +tI(ui um)) util

    Using these values, we then calculate the ideology membership likelihoods of different

    users.

    ubik =

    sjs

    sjkusij

    tjt

    tjkutij

    vt

    Then the probabilities are calculated by dividing each likelihood by the sum of likelihoods:

    7

  • 8/10/2019 Analyzing Political Discourse on Reddit

    8/20

    ubik =ubikblb

    ubil

    Then new memberships are assigned by selecting the ideology with the maximum prob-

    ability,

    ubi =bargmax(k,ubik

    )

    This process is repeated several times (usually one to two dozen) for convergence. Note

    that logarithms are preferred for computational purposes, but the mathematics is the same.

    Additionally, class priors were not updated, as they usually are in generative models. Doing

    so favors the largest groups, which disproportionately increases error in other groups.

    Variable Selection

    Because variables were initially selected without much regard for their usefulness from a

    statistical perspective, a simple and efficient method is used to reduce the number of Sub-

    reddit and n-gram variables: After the generative model algorithm is run, it is rerun, but

    with any variable that does not have a factor of at least 20 between the highest and lowest

    probabilities among its ideologies. This means that variables must be more discriminatory

    in order to be considered in the model. Using cross-validation to evaulate model criteria,

    other methods are possible, but this cutoff is the most efficient considering the size of the

    dataset.

    Validation

    Because trained data exists, the model may be validated via n-fold cross-validation.

    1. Choose a number k that represents the number of folds (i.e., iterations of validation)

    you want to use.

    2. Divide the manually labeled data into k partitions.

    3. For each partition, run the generative model without having data from that partition

    being labeled.

    4. Compare the results of the classifications for each partition with their true (manually

    labeled) classifications. Record the error rate and the impurity rate for each classifica-

    tion.

    8

  • 8/10/2019 Analyzing Political Discourse on Reddit

    9/20

    Depending on the goal of classification, other loss criteria may be used to quantify the

    models accuracy.

    Homophily Measures

    Note that if one of these variables appears without operating on a Subreddit, it refers to the

    entire dataset.

    Variable Meaning

    f(sk) the frequency matrix of comments from users of one ideology to another in sk

    fij(sk) the frequency of comment responses from someone ofbi to someone ofbj insk

    fj(sk) the frequency of responses from all users to someone ofbj insk

    fi(sk) the frequency of responses from someone ofbi to all users in sk

    H(si) the homophily ofsiEij(sk) the expected number of responses from someone ofbi to someone ofbj in sk

    assuming that there is no correlation with respect to i and j

    ERij(sk) fij(sk)/Eij(sk), defined as 1 when it would otherwise be 00

    Sij(sk) proportion of comment replies from someone ofbi tobj in sk that are positively-score

    Overall Homphily

    Since homophily was related by class identification, a matrix of responses, f(sk), for which

    theith row andj th column refers to the number of comment replies to users ofbi tobj insk,

    is used to calculate correlation. The homophily, H(sk), is defined to be

    H(sk) =2 trace f(sk)

    f(sk)

    f(sk)

    A value of 1 means that all interactions are between people of similar ideologies, and a

    value of -1 means all interactions are between users of different ideologies.

    Response Likelihood

    Another metric of interest is how likely someone is to respond to someone else in a Subreddit

    given both their ideologies, versus what youd expect if they responded completely randomly.

    The calculation for the matrix of expected values is as follows:

    Eij(sk) =fi(sk) fj(sk) 1

    (

    f(sk))2

    9

  • 8/10/2019 Analyzing Political Discourse on Reddit

    10/20

    Then the ratio is calculated,

    ERij =fij(sk)

    Eij(sk)

    Ratios that have large enough cell counts are then reported.

    Response Positivity

    One final metric that is used, although not directly related to homophily, is the proportion

    of responses from users of bi to bj within sk that are positive. This metric is extremely

    straightforward to calculate, although it is mostly useful in analyzing specific Subreddits.

    The main purpose of this metric is to indicate hostility towards particular groups from

    another group, which is useful information when analyzing homophily.

    Results

    Semi-Supervised Classification

    For variable selection, 144 out of the original 200 Subreddits were used in the model, and

    154 out of 330 of the original n-grams were used in the model.

    For the semi-supervised classification, 5-fold cross-validation was used in order to deter-

    mine the error rate and the impurity of each group. The error rate for each group is theproportion of members of each group that were not classified into the proper group. The

    impurity rate for each group is the percentage of users classified into a group that do not

    belong to that group. In this experiment, the primary goal was to minimize impurity, as it

    helps ensure that the subjects used for inference are most likely to have accurate labels.

    Anarcho-Captialsim Conservatism Liberalism Libertarianism Socialism

    Error 32.8% 61.3% 11.5% 48.1% 39.2%

    Impurity 26.7% 34.0% 29.0% 33.8% 21.7%

    While these results are significantly better than random guesses, there is still a consider-

    able amount of error.

    With rows representing actual values and columns representing sorted values, this is the

    classification matrix from the cross-validation:

    10

  • 8/10/2019 Analyzing Political Discourse on Reddit

    11/20

    Anarcho-Capitalism Conservatism Liberalism Libertarianism Sociali

    Anarcho-Capitalism 33 0 12 4 4

    Conservatism 0 31 43 5 1

    Liberalism 7 10 262 13 4

    Libertarianism 3 6 32 45 1

    Socialism 2 0 20 1 36

    Authority Hub

    AmericanPolitics

    Anarchism

    Anarcho_Capitalism

    Bad_Cop_No_Donut

    BasicIncome

    changemyview

    Conservative

    conspiracy

    DebateAnarchism

    DebateFascism

    Economics

    EnoughLibertarianSpam

    gunpolitics

    Libertarian

    MensRights

    moderatepolitics

    NeutralPolitics

    occupywallstreet

    PoliticalDiscussion

    politics

    privacy

    progressive

    socialism

    worldpolitics

    1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

    Subreddit

    Ideology

    AnarchoCapitalism

    Conservatism

    Liberalism

    Libertarianism

    Socialism

    Ideology Representation from Top 10 Hub and AuthorityRanked Usersin Each Subreddit

    Figure 1: The ideologies of the most important users in various Subreddits. Hub rankings

    and authority rankings derived from the HITS algorithm are used here. Users with high hub

    rankings tend to respond to more and important individuals, and people with high authority

    rankings tend to be responded to by more and important people.

    11

  • 8/10/2019 Analyzing Political Discourse on Reddit

    12/20

    Homophily Measures

    The following results use cross-validated averages combined with the results with all labeled

    data used to classify results. Additionally, results were averaged over classification-calculated

    ideologies (i.e., values of 1 or 0) and probability-calculated ideologies (derived in the model).The overall homophily for all Subreddits combined was 0.25.

    1,957 123 1,762 569 670 5,080

    133 1,248 3,081 478 123 5,062

    1,740 3,131 34,300 3,893 2,134 45,197

    562 482 3,742 1,636 192 6,614

    677 113 1,970 191 4,516 7,467

    5,068 5,096 44,855 6,767 7,634 69,421

    AnarchoCapitalism

    Conservatism

    Liberalism

    Libertarianism

    Socialism

    Total

    AnarchoCapital ism Conservatism Liberal ism Libertarianism Socialism Total

    Replier's Ideology

    OriginalPoster'sIdeology

    0

    10000

    20000

    30000

    Number of Comments

    Frequency of Replies in Sampled Subreddits by Ideology of Replier and Original Poster

    Displayed values are averaged from different algorithms of ideology predicting

    Figure 2: The frequencies of various response types across all Subreddits. Note that the

    discrepancy between the number of comments in the summary statistics are not the same as

    the number here, as some comments are mediated by unclassified users due to various forms

    of censoring on the website (e.g., banning, comment removal).

    12

  • 8/10/2019 Analyzing Political Discourse on Reddit

    13/20

    Figure 3: The homophilies of various Subreddits, ordered from highest to lowest.

    13

  • 8/10/2019 Analyzing Political Discourse on Reddit

    14/20

    5.3 0.3 0.5 1.1 1.2

    0.4 3.4 0.9 1 0.2

    0.5 0.9 1.2 0.9 0.4

    1.2 1 0.9 2.5 0.3

    1.2 0.2 0.4 0.3 5.5

    AnarchoCapitalism

    Conservatism

    Liberalism

    Libertarianism

    Socialism

    AnarchoCapitalismConservatism Liberalism Libertarianism Socialism

    Replier's Ideology

    OriginalP

    oster'sIdeology

    0.1

    1.0

    10.0

    actual replies

    expected replies

    color is based on a log scale

    Ratio of Actual Responses to Expected Responses on Sampled Subreddits

    Figure 4: The ratios of actual responses to expected responses for the entire network, by

    ideologies of replier and original poster.

    14

  • 8/10/2019 Analyzing Political Discourse on Reddit

    15/20

    89.5% 80.9% 86.3% 87.9% 84.5%

    83.7% 82.4% 80.9% 84.9% 78.2%

    83.5% 76.6% 82.5% 80.3% 84.1%

    87.1% 84.0% 80.3% 83.7% 80.4%

    82.8% 78.6% 86.3% 79.8% 87.3%

    AnarchoCapitalism

    Conservatism

    Liberalism

    Libertarianism

    Socialism

    AnarchoCapitalism Conservatism Liberalism Libertarianism Socialism

    Replier's Ideology

    OriginalPo

    ster'sIdeology

    75%

    80%85%

    90%

    95%

    100%% Positive

    Percentage of PositivelyScored Responsesby Ideology of Responder and Original Poster

    Figure 5: The percent of responses scored positively by ideologies of the replier and orig-

    inal poster. Lower values mean that comment replies recieve negative feedback from the

    community it was posted in, and likely the original poster, as well.

    Some specific results for Subreddits that will be used as examples in the conclusion:

    15

  • 8/10/2019 Analyzing Political Discourse on Reddit

    16/20

    1,766 5 324 221 306 2,623

    18 0 4 1 1 24

    323 2 69 48 55 496

    212 0 46 32 37 328

    320 1 56 38 69 485

    2,639 8 499 340 468 3,955

    AnarchoCapitalism

    Conservatism

    Liberalism

    Libertarianism

    Socialism

    Total

    AnarchoCapitalism Conservatism Liberalism Libertarianism Socialism Total

    Replier's Ideology

    OriginalPoster'sIdeology

    0

    500

    1000

    1500

    Number of Comments

    Frequency of Replies in /r/Anarcho_Capitalism by Ideology of Replier and Original Poster

    Displayed values are averaged from different algorithms of ideology predicting

    60 22 235 221 13 552

    23 12 108 92 7 243

    238 92 912 799 59 2,100

    228 91 745 814 39 1,917

    12 5 49 40 4 110

    561 222 2,050 1,965 122 4,922

    AnarchoCapitalism

    Conservatism

    Liberalism

    Libertarianism

    Socialism

    Total

    AnarchoCapitalism Conservatism Liberalism Libertarianism Socia lism Total

    Replier's Ideology

    OriginalPoster'sIdeology

    0

    250

    500

    750

    Number of Comments

    Frequency of Replies in /r/Libertarian by Ideology of Replier and Original Poster

    Displayed values are averaged from different algorithms of ideology predicting

    Figure 6: Frequencies of /r/Anarcho Capitalism and /r/Libertarian comment replies by

    ideology of replier and original poster.

    16

  • 8/10/2019 Analyzing Political Discourse on Reddit

    17/20

    86% 66% 81% 87% 94%

    75% 78% 78% 76% 84%

    75% 64% 78% 72% 84%

    83% 76% 81% 82% 91%

    82% 40% 70% 64% 88%

    AnarchoCapitalism

    Conservatism

    Liberalism

    Libertarianism

    Socialism

    AnarchoCap ital ism Conse rvat ism L iberal ism L iber ta rian ism Soc ia lism

    Replier's Ideology

    OriginalPoster'sIdeology

    0.00

    0.25

    0.50

    0.75

    1.00% Positive Replies

    Percentage of Comment Reply Scores that are Positive for /r/politics

    Blank cells have too few entries to display.

    Figure 7: Percentage of comments with positive scores by ideologies of replier and original

    poster in /r/politics.

    Conclusions

    Classification

    The semi-supervised classification algorithm was fairly effective at classifying liberals cor-

    rectly, but across the board, the error rate was somewhat high, even for a classification

    problem with 5 different groups. There is not too much to say about the algorithm without

    using a competing algorithm and comparing results. The primary purpose of classification

    is to have the labels as opposed to testing the algorithm.

    Homophily

    Unsurprisingly, the majority of interactions on the sampled Subreddits are facilitated by

    liberals. This effect can be seen as a majority of Subreddits have high homophilies. Two

    notable exceptions are /r/Anarcho-Capitalism and /r/Libertarian. Upon closer inspection,

    viewing the frequency tables for the specific subreddits can detail what is happening. In

    the case of /r/Anarcho-Capitalism, Anarcho-Capitalists tend to respond to many people of

    17

  • 8/10/2019 Analyzing Political Discourse on Reddit

    18/20

    differing ideologies, which lowers the homophily significantly. In the case of /r/Libertarian,

    the majority of users (not seen in the graphics below) are libertarians, but most of the

    interactions are facilitated by liberals. This means that there are many liberals responding

    to libertarians and vice-versa, which makes /r/Libertarian a heterophilic Subreddit.

    While the sampling procedures introduce bias into statistics generalized to the whole

    website, each ideology demonstrated that its users had a level of preferential self-attachment,

    as is seen in Figure 4. What is most notable, though, is the tendency for anarcho-capitalists

    and Socialists to communicate much less frequently with those of other ideologies. There

    is an obvious exception, where anarcho-capitalists are slightly more likely to respond to

    libertarians and socialists and vice-versa, but with the majority of the website consisting of

    liberals, clustering is obvious.

    Finally, looking at the tendency for comments to have a positive score, it is obvious that

    there is some hostility for conservatives by liberals, but this effect is averaged over the entiresample. If you take a look at the main politics Subreddit, /r/politics in Figure 7, the story

    becomes much more clear. It is obvious here that conservatives are not very welcome by

    socialists, liberals, or anarcho-capitalists in /r/politics. Looking at these results by Subreddit

    reduces biases caused by sampling and allows for a much more precise understanding of how

    interactions take place in a community.

    As far as practicality behind these results, individuals can use these to decide which

    sections of the website are more hospitable to discussion given their ideology, and for content-

    makers (e.g., blog writers), the general census results of the Subreddits can be used tostrategically post content targetting the appropriate audience.

    Limitations and Future Work

    There are many limitations of this analysis, the largest one being the error introduced by the

    learning algorithm. Oftentimes people post little information about their beliefs, so it is very

    difficult to classify them. Additionally, even when it is more obvious to a human, machine

    learning algorithms become very complicated when a high level of accuracy is desired.One possible solution in future experiments is to shift the model to use natural language

    processing to establish variables, rather than rely purely on n-grams, which are much noisier

    in the context of comments on social media. Additionally, a model with hyperparameters for

    each class, such as those implemented in CATHYHIN in Wang et al., would allow for more

    flexibility in discriminating between ideologies. Also, akin to the topical heirarchies that are

    18

  • 8/10/2019 Analyzing Political Discourse on Reddit

    19/20

    created in CATHYHIN, a more advanced model could find subclusters within ideologies so

    that top-level groups do not have to be as homogenous with respect to their parameters.

    There is also the question of how dissimilar users are in their beliefs, regardless of ide-

    ology. If two users have one political viewpoint in common and discuss it with each other,

    should that be considered heterophilic? Coincidentally, that type of process is described by

    Damon Cantola in Homophily, Cultural Drift, and the Co-Evolution of Cultural Groups

    as one that can actually bridge groups together. However, trying to extend this model to a

    network evolution model would be very difficult given the Reddit API limitations and the

    vast amounts of data that would need to be collected.

    Lastly, while the sample size was sufficient for the analysis that was done, a larger sample

    size would enable more complex algorithms to be used, since the number of parameters is very

    high for such models. One major difficulty of this is the time-consuming task of manually

    labeling users, but perhaps the nature of semi-supervised learning would not necessitate animpractically large training size.

    References

    1. Bisgin, Hail; Agarwal, Nitin; and Xu, Xiaowei. Investigating Homophily in Online

    Social Networks. http://www.researchgate.net/publication/221158453 Investigating

    Homophily in Online Social Networks/links/0fcfd5089ae5348b1c000000

    2. Centola, Damon et al. Homophily, Cultural Drift, and the Co-Evolution of Cultural

    Groups. http://nsr.asc.upenn.edu/files/Centola-et-al-2007-JCR.pdf

    3. H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009.

    4. Hadley Wickham (2014). scales: Scale functions for graphics. Rpackage version 0.2.4.

    http://CRAN.R-project.org/package=scales.

    http://nsr.asc.upenn.edu/files/Centola-et-al-2007-JCR.pdf

    5. Nigam, Kamal; McCallum, Andrew; and Mitchell, Tom M. Semi-Supervised TextClassification Using EM.http://people.cs.umass.edu/ mccallum/papers/semisup-em.pdf

    6. R Core Team (2014). R: A language and environment for statistical computing. R

    Foundation for Statistical Computing, Vienna, Austria. URLhttp://www.R-project.org/.

    7. Wang et al. Constructing Topical Hierarchies in Heterogeneous Information Networks.

    http://web.engr.illinois.edu/ hanj/pdf/icdm13 cwang.pdf

    19

    http://www.researchgate.net/publication/221158453_Investigating_Homophily_in_Online_Social_Networks/links/0fcfd5089ae5348b1c000000http://www.researchgate.net/publication/221158453_Investigating_Homophily_in_Online_Social_Networks/links/0fcfd5089ae5348b1c000000http://nsr.asc.upenn.edu/files/Centola-et-al-2007-JCR.pdfhttp://cran.r-project.org/package=scaleshttp://nsr.asc.upenn.edu/files/Centola-et-al-2007-JCR.pdfhttp://people.cs.umass.edu/~mccallum/papers/semisup-em.pdfhttp://www.r-project.org/http://web.engr.illinois.edu/~hanj/pdf/icdm13_cwang.pdfhttp://web.engr.illinois.edu/~hanj/pdf/icdm13_cwang.pdfhttp://www.r-project.org/http://people.cs.umass.edu/~mccallum/papers/semisup-em.pdfhttp://nsr.asc.upenn.edu/files/Centola-et-al-2007-JCR.pdfhttp://cran.r-project.org/package=scaleshttp://nsr.asc.upenn.edu/files/Centola-et-al-2007-JCR.pdfhttp://www.researchgate.net/publication/221158453_Investigating_Homophily_in_Online_Social_Networks/links/0fcfd5089ae5348b1c000000http://www.researchgate.net/publication/221158453_Investigating_Homophily_in_Online_Social_Networks/links/0fcfd5089ae5348b1c000000
  • 8/10/2019 Analyzing Political Discourse on Reddit

    20/20

    Appendix

    Figure 8: Visualization of comment replies in sample. Gold = anarcho-capitalist, Red =

    conservative, Blue = liberal, Green = libertarian, Pink = socialist, Black = unlabeled.

    Layout created using Yifan-Hu algorithm in Gephi.

    20