Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Phrasing and its Ideological Implications:
Using Natural Language Processing to Measure Political Affiliation
Alexander Williamson
June 6th, 2013
Abstract
I explore the utility of natural language processing in estimating
political bias in the media. I propose two methods for estimation.
Building on prior research, the first method involves two sets of OLS
regressions. The second method uses a Support Vector Regression, a
tool borrowed from the rapidly growing field of machine learning.
Both models are applied to the topic of political polarization amongst
US daily newspapers to illustrate the value of these and similar
methodologies.
2
Acknowledgements
I would like to very sincerely thank Professor Lynne
Kiesling for her guidance and encouragement. I would
also like to thank Jonathan Friedman for his endless
advice on technology and implementation. Finally, I
would like to thank my mother, father, and sister for their
relentless support over the last four years.
3
Table of Contents
1 INTRODUCTION 4 2 LITERATURE REVIEW 5 THE INFLUENCE OF THE NEWS INDUSTRY 5 2.1 ESTIMATIONS OF POLITICAL BIAS 6 2.2 POLITICAL POLARIZATION 9 2.3 MY CONTRIBUTION 10 2.4
3 ESTIMATING IDEOLOGY 11 INTEREST GROUP SCORES 11 3.1 WHAT IS POLITICAL BIAS? 13 3.2 NOMINATE SCORES 14 3.3 LANGUAGE AS A LINK 15 3.4 BIAS IN CONTEXT 17 3.5
4 MODELING IDEOLOGY AND POLARIZATION 18 THE OLS ALGORITHM 18 4.1 SUPPORT VECTOR REGRESSION 22 4.2 MODELING POLARIZATION 23 4.3
5 DATA COLLECTION 24 6 ANALYSIS 26 RESULTS 26 6.1 DISCUSSION 26 6.2
7 CONCLUSION 29 BIBLIOGRAPHY 34 APPENDIX A: PROCEDURAL DETAILS 37 APPENDIX B: WORDS REMOVED FROM ANALYSIS 39
4
1 Introduction
On March 13, 2013, two newspapers ran articles reporting the verdict in
the trial of Kermit B. Gosnell, titled “Murder: Gosnell guilty verdict hailed on
both sides of abortion debate” and “Abortion doctor Kermit Gosnell convicted of
murder in deaths of three infants”, respectively. If a reader were to observe
the titles chosen by these publications, they would begin to speculate about the
political inclinations of the two organizations, and after reading the first fifty
words in each article, the reader will observe that phrases such as “pro-life
advocates”, “abortionist”, and “abortion industry” are unique to the first
article’s introduction. By the time readers have reached the end of both
articles they will have observed a systematic difference in phrasing.
The first article is from The Washington Times, and the second is from
The Washington Post, two papers commonly believed to be slanted to the right
and left, respectively. Anecdotal demonstrations such as these tend to pervade
conversations about bias in the media, despite susceptibility to bias from both
those making the argument and those observing it. Furthermore, not every
purported incidence of bias, or ‘slant’, will provide such a clear dichotomy, and
common questions in the discussion of current US politics and its relation to
the media are often more subtle. To this end, political scientists and
economists have proposed several successful methods of measuring media bias
5
that fulfill specific research objectives. Developing a sufficiently generalizable
measure of bias, however, has proved to be difficult.
In this thesis, I explore the capacity of natural language as a gauge of
political ideology. After reviewing prior research, I describe the inherent
difficulties of defining ‘liberal’ and ‘conservative’. Then building on the work of
Gentzkow & Shapiro (2010), I consider two alternative estimates of political
bias that compare media outlets to U.S. Congress Members. In the first
approach, I use the OLS estimator developed by Gentzkow & Shapiro with a
few important modifications, including the use of Common-Space NOMINATE
scores (Poole & Rosenthal, 1997) as a proxy for legislator ideology. In the
second model, I train a support vector regression on the set of legislators then
use the trained model to estimate bias in U.S. daily newspapers. To explore
the utility of these procedures I apply each to the topic of political polarization.
I calculate media bias for 14 daily newspapers and assess how it has changed
from 1994 to 2012.
2 Literature Review
The Influence of the News Industry 2.1
Fully appreciating the significance of political bias in the media begins
with an understanding of the crucial role news sources play in democratic
societies, and the potential issues raised by the swift changes that are
occurring in the media industry. The advent of the Internet let media outlets
6
tailor to a smaller group of readers, allowing many people to find an ideological
‘niche’ in which to obtain news (Gaskins & Jerit, 2012). The presence and
ideological catering of numerous new alternative media outlets led to declines
in both advertising revenue and circulation for newspapers around the world
(Kaye & Quinn, 2010). Researchers now refer to the current state of the
Newspaper Industry as a crisis without hesitation (Boczkowski & Siles, 2012).
Because a vital news industry is widely considered to be an essential part of a
healthy democratic society (Gaskins & Jerit, 2012; Gentzkow & Shapiro, 2010;
Boczkowski & Siles, 2012), there is growing concern over the future of the
media industry and whether the various media replacing newspapers will
sufficiently serve the same important functions (Boczkowski & Siles, 2012).
What is more concerning is the evidence that these structural changes are also
altering the nature of content being produced by news organizations, especially
when viewed in the light of media’s influence on political engagement (Vigna &
Kaplan, 2007).
Estimations of Political Bias 2.2
The issue of political bias in the media is not new: Gentzkow, Shapiro, &
Sinkinson (2011) construct a model of newspaper entry and exit in which they
analyze the political endorsements of newspapers around the turn of the 20th
century. Endorsement data, however, are often sparse and can be explanatory
only to a point. In the case of presidential endorsements, it is plausible that
two newspapers with the same endorsement record could have very different
7
ideologies but agree on the choice of candidate every four years. Fortunately
for researchers, the same technological changes that adversely affect the
newspaper industry are creating new ways to analyze media. This new
empirical capacity parallels important and innovative work on media bias over
the last two decades (Groseclose & Milyo, A Measure of Media Bias, 2005;
Gentzkow & Shapiro 2010; Lott & Hassett, 2004; Puglisi, 2011; Ho & Quinn
2008). These authors create distance between themselves and subjectivity by
identifying less explicit ways in which media outlets reveal political affiliation.
Perhaps unsurprisingly, such research is frequently in alignment with
common perception. For example, the widely held belief that major news
outlets favor democrats (Watts, Domke, Shah, & David, 1999) has been
confirmed to various degrees by multiple authors. Puglisi (2011) examines
every article from The New York Times written between 1946 and 1997 and
concludes that the newspaper gives more emphasis to subject areas in which
the public perceives the Democratic Party to be more competent. Lott &
Hassett (2004) find evidence that newspapers give less positive coverage to
releases of economic indicators, though this claim was revisited by Larcinese,
Puglisi, & Snyder (2011), who conclude that newspapers with histories of
endorsing Democratic Presidents tend to give less coverage to negative
unemployment news if the current President is a Democrat, a significantly
weaker conclusion. They find no evidence of this occurring on other topics,
such as the trade deficit, the budget deficit, or inflation.
8
Each of the cited authors provides plausible evidence of the presence of
political bias in various news organizations, but do not attempt to generalize
their results for comparison across media outlets. Moreover, such an attempt
would need to address inherent problems in the respective estimations.
Larcinese, Puglisi, & Snyder rely on endorsement data, and Lott & Hasset
only consider bias as manifested in the framing of economic news. Puglisi only
measures bias in one newspaper.
Groseclose & Milyo (2005) proposed a general method for estimating
media bias that proved to be influential. The authors created a mapping from
media outlets to the space of Congress members using adjusted ADA scores
(Groseclose, Levitt, & Snyder, Jr., 1999). This involved counting how many
times a newspaper cites various think tanks and interest groups and
comparing the count to the number of times that Congress Members cite the
same groups, ideally providing a metric by which to compare media outlets to
each other and to legislators. These estimates, however, are inconsistent over
time and in many instances are not robust to the removal of a individual think
tank (Gasper, 2011 ). Even aside from the problematic nature of the estimates
for media outlets, the general method of calculating interest group scores faces
significant theoretical obstacles.1
1 See Section 3 for details
9
In order to estimate the bias of a general newspaper, Gentzkow &
Shapiro (2010) create an algorithm to quantify political bias of U.S. daily
newspapers by analyzing the language each paper uses. After identifying a set
of 1,000 phrases that are favored by one of the two major parties, the authors
assign an ideological measure to each phrase by regressing the relative
frequency of usage on the legislative ideologies. In the next step, the authors
search newspapers and record the frequency of phrase usage then perform a
second regression of Newspaper phrase usage on phrase ideology. The slope
from the second regression is the estimated political bias of the newspaper.
The model is correlated with survey data on media bias2, indicating the value
of further analysis. To this end, exploring the capacity of natural language to
measure political bias forms a basis for my research in this thesis. A detailed
comparison of Gentzkow and Shapiro’s model and my own is given in Section
4.
Political Polarization 2.3
The current political environment in the US augments the academic
value of accurately estimating media bias due to the stronger reliance on party
affiliation by political participants. Several researchers have suggested that
political participants are becoming more polarized in the sense that a stronger
2 The authors find a 0.40 correlation coefficient with data from media directory Mondo Times. The authors exclude opinion articles from their analysis and Mondo Times does not, so the correlation is not a perfect indicator of similarity.
10
dichotomy exists between political parties. Using the NOMINATE process,
Poole & Rosenthal (1997) illustrate that before 1990 several Republicans were
more liberal than the most conservative democrats, but in recent years there
has been a strict separation. Similarly, the percentage of roll-call votes in
which a majority of republicans oppose a majority of democrats has risen in
the last 20 years (Sinclair, 2000).
As for political polarization in the electorate, the extent to which
compelling evidence exists depends both on how polarization is defined and on
how evidence is interpreted (Prior, 2013). For example, party identification is
more correlated with voting outcomes in the current electorate than it was
prior to the 1970’s (Bartels, 2000), but this may solely be a result of more
cohesive and ideologically distant parties (Fiorina & Abrams, 2008). The
media’s role in political polarization is also unclear: despite partisanship in
particular news sources, especially new media outlets (Gaskins & Jerit, 2012),
there is no evidence that longstanding outlets have become more partisan over
time (Prior, 2013).
My Contribution 2.4
My research focuses on optimizing the procedure proposed by Gentzkow
& Shapiro to contribute to the literature on estimating media bias. This thesis
also contributes to the literature surrounding political polarization and is the
first attempt to measure political polarization quantitatively in long standing
11
media outlets. In a more general sense, this thesis serves as an illustration of
the value of technological innovation to researchers and as an exploration of
machine learning applications to traditional empirical problems
3 Estimating Ideology
Interest Group Scores 3.1
If a quantitative measure of media bias is estimated via a map into the
space of Congress members, the Congressional metric must be meaningful for
that measure to be useful. The fact that interest group scores are inherently
problematic (Shor & McCarty, 2011) undermines the validity of Groseclose’s &
Milyo’s model using Congressional mapping. Even in the case of comparing
legislators, the ADA scores face theoretical barriers (McCarty, 2010). One
reason is that any given think tank’s position may signal different ideologies in
different legislatures. To see this, consider the general procedure by which the
scores are constructed:
• Given a policy group and legislative body, isolate the votes in the legislative body that are relevant to the policy group.
• For each vote, determine whether the policy group prefers a positive or negative outcome.
• For each legislator, determine the percentage of time his or her vote matches the policy group’s preference. This is the legislators ADA score.
In some scenarios this will yield a meaningful one-dimensional ordering of
political entities consistent with other measures and intuition (Groseclose &
Milyo, A Measure of Media Bias, 2005).
12
If such an ordering juxtaposes multiple legislatures with no common
voting record, however, the results can be misleading. Consider a simple
example of interest group scores in which Legislator A, who is considered very
liberal, and Legislator B, who is considered very conservative, vote in different
legislatures, different years, or both. Suppose Think Tank X is a group whose
only mission is to promote policies that favor more relaxed speeding laws for
automobiles. Legislator A is faced with a series of votes in which he is asked
to decide between allocating money to wider, safer roads or to more
environmentally-friendly government buildings. In all of the votes in which he
participates, he chooses to allocate money to government building renovations.
Legislator B is faced with a series of votes in which he is asked to decide
between reducing the budget for either traffic cops or a local laboratory that
conducts stem-cell related research. In all of the votes in which he
participates, he votes to cut funding for the laboratory. As a result, Legislator
A and Legislator B receive the same interest group score, having voted many
times against the wishes of Think Tank X, despite having very different
political views.
Additionally, the interest groups’ preferred point on the 1-dimentional
scale cannot lie on the interior of the distribution of legislative points. That is,
the scores are calculated with the interest group or think tank at one extreme
of the possible scores with the underlying assumption that the organization is
either the most liberal or the most conservative in the distribution (McCarty,
13
2010). If the interest group is politically moderate, then moderate legislators
will receive a very high score while conservative and liberal legislators will
receive similarly low scores. The resulting estimations may be useful as a
political thermometer (Poole & Rosenthal, 1997, p. 169), but will not be useful
in gauging ideological distance between candidates.
What is political bias? 3.2
In addition to the issues mentioned above, most measures of ideology share
the same difficult task of defining bias and also defining what it means to be
‘liberal’ or ‘conservative’. Within the context of US politics, the two terms
seem to derive their meaning exclusively from rough perceptions of a
dichotomy existing solely in legislative bodies. The question of whether an
entity is conservatively biased or liberally biased can have different meanings
when applied to different entities, so caution must be exercised when
attempting to provide an answer. The NOMINATE scores (Poole & Rosenthal,
1997) discussed in the following section estimate 2-dimensional coordinates for
legislators and the issues on which they vote, with the transparent goal of
predicting whether a Congress Person will vote ‘yay’ or ‘nay’ on a given issue,
but it comes at the loss of a clear rhetorical explanation of the coordinates and
does not always yield an obvious method for comparison across different
legislatures, let alone various other political agents.
14
NOMINATE scores 3.3
Poole & Rosenthal (1997) constructed a process for estimating the
ideology of US Congress Members referred to as NOMINATE (Nominal Three-
Step Estimation) process. Rather than trying to measure a priori definitions of
‘liberal’ and ‘conservative’, the authors estimate coordinates in n-dimensional
Euclidean space, or ‘ideal points’, for each legislator based on a probabilistic
model of roll call voting. The authors assume that each legislator’s utility for a
‘yay’ or ‘nay’ is given in part by his or her distance from the ideological
coordinates of the vote outcomes and in part by a stochastic error term with a
weight on the deterministic part of the function common to all legislators. By
assuming that the error terms in the model are drawn from a logit
distribution, they are able to estimate a log-likelihood function for the entire
set of outcomes for a given session in Congress. Using a gradient descent
algorithm, they estimate the set of legislator points, the set vote-outcome
points, and the weight factor, one set at a time and reiterating through until
the estimates have converged.
This procedure circumvents deciding the number of appropriate
dimensions for ideological: the authors simply add a dimension until they
reach a point such that an additional dimension does not increase the fit of the
15
data in a meaningful way. They find that the fit is not appreciably increased
beyond 2-dimensions.3
This procedure has had a tremendous influence on ideological
estimation and has inspired several similar studies into other legislative
bodies in the US (Bailey & Chang, 2001; Bailey, Kamoie, & Maltzman, 2005)
and in other nations (Hix, Noury, & Roland, 2006; Morgenstern, 2004). By
observing legislators who have moved from one legislative body to another,
researchers are also able to use these scores for comparison across legislatures.
Note that any measure of political bias incorporating these estimates as
proxies for ideology will implicitly be estimating the probability with which
entities would vote ‘yay’ on a typical vote from the underlying Congress.
Because I use these common scores in part of my analysis, I will discuss how to
interpret their use in the next section.
Language as a Link 3.4
This thesis explores the idea that the language newspapers use can hint
at political ideology. The same observations can be made with respect to
members of Congress. Veritable proof of this claim was found in 2005, when a
memo from a Republican strategist leaked to the public, revealing the
proactive efforts to get Republicans to say ‘private accounts’ rather than
3 In the case of D-NOMINATE scores, the first dimension is sufficient to describe the variation in most years. The common space coordinates, however, are constructed so that each of the two dimensions is equally salient.
16
‘personal accounts’ (Luntz, 2005). Another examples include Republican usage
of ‘death tax’ rather than ‘estate tax’ (Gentzkow & Shapiro, 2010), ‘spending
programs’ vs. ‘assistance programs’, or ‘government bureaucrats’ vs. ‘civil
servants’. Additionally, there are phrases that refer to topics altogether
avoided by one party or the other, and a wide range of phrases that are
somewhere in between these two categories.4
The relationship between ideology and phrase usage might be extremely
noisy, but if enough speech is observed it may be possible to associate patterns
among political agents. Using an OLS based algorithm, Gentzkow & Shapiro
(2010) explore this possibility by mapping U.S. daily newspapers into to the
space of Congressional ideology using similarities in the usage of certain
ideologically revealing phrases. I make several modifications to their model
and use it as the first of two procedures through which I estimate bias.
Because the data is noisy, however, the resulting estimates could be
unnecessarily imprecise. Figure 1 shows the relative frequency5 with which
Congress members of varying ideologies were on the record saying
“Washington Times”.6 To address the potential inadequacy of a simple linear
4 See Error! Reference source not found.
5 Relative frequency of a phrase is defined as the number of times a legislator uses that phrase divided by the number of times they use any phrase included in the analysis.
6 Incidentally, I found “Washington Times” to be strongly correlated with ideology, but I did not include it in the analysis because it would skew the estimate of bias for The Washington Times. See Section 4 for details on the measurement process.
17
regression, I also use a Support Vector Regression, a machine learning
technique that performs well in problems of pattern recognition (Drucker,
Burges, Kaufman, Smola, & Vapnik, 1997).
Bias in Context 3.5
In the light of the discussion from the previous section, it is important to
define what is meant by bias to facilitate a meaningful interpretation of
estimates. In this circumstance, I measure bias as the tendency to endorse,
explicitly or subtly, policies in a specific region of the NOMINATE space. If
language is a strong proxy for voting patterns, then this measure will have a
useful interpretation as a media organization’s propensity for pushing the
electorate towards a particular set of legislative outcomes; if the associations
generated by language analysis are coincidental, however, and do not
accurately reflect what the unobservable voting patterns of a given newspaper
0.001
0.021
0.041
0.061
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
!st
Dim
ensi
on
of
Co
mm
on
-Sp
ace
So
cre
Liberal <-> Conservative
Relative frequency of phrase "Washington Times" in 1994 Congressional Record
Figure 1. Each ‘x’ in the above graph represents a legislator. As illustrated by the
fitted line, a simple linear regression may not yield imprecise estimates.
18
would be if it were a legislator, then the measure might still be valuable as an
alternative to the definition of ideology used in NOMINATE calculations.7
4 Modeling Ideology and Polarization
The OLS Algorithm 4.1
To test the hypothesis that language patterns could be used to estimate
bias, Gentzkow & Shapiro (2010) (henceforth GS) design an algorithm that
implements two series of OLS regressions. In designing my model I make
several important modifications, but the two algorithms share the same
general framework:
• Determine the set of phrases that were used in Congress during the time interval of interest and the frequency with which each Congress Member used each phrase
• Determine the phrases that are used disproportionately by members with a certain political affiliation
• Determine the frequency with which news providers use the phrases selected in the previous step
• Create a measure of ideology using phrase frequencies to link news providers to congressmen
Specifically, the GS model begins by testing for every two-word and
three-word phrase found in the 2005 Congressional Record the null hypothesis
that the propensity to use a given phrase is the equal for Republicans and
7 A hypothesis that might then be fruitful for future research would be that legislators have a professed ideology, exhibited through speech patterns, and an actual ideology as exhibited through voting patterns.
19
Democrats, as quantified by a Pearson χ2 statistic. Subject to certain
constraints on frequency of appearance in Newspaper headlines8, the 1,000
phrases with the highest χ2 number are designated as ideologically revealing.
A regression for each selected phrase is ran using the relative frequency of
usage as the dependent variable and the ideology of legislators as the
independent variable9. The slope and intercept estimate from each regression
are then used in a second set of regressions: for each newspaper, the relative
frequency of phrase usage minus the intercept of the first-stage estimate is
regressed on the first-stage slope estimate. The slope estimate from the
second stage regression is the newspaper’s media ‘slant’. (Gentzkow &
Shapiro, 2010, pp. 41-46).
Some of the changes I made to the GS model are done purely to reduce
the computational burden of the algorithm: these are detailed in Section 5.
There are two changes, however, that are particularly rooted in a desire to
theoretically improve the model. The first major change is that I estimate two
dimensions of ideology using the “Common-Space” DW-NOMINATE scores
that can be downloaded at http://voteview.com/dwnomjoint.asp. Though the
dimensions of the NOMINATE do not necessarily have strict rhetorical
8 Gentzkow & Shapiro exclude phrases that appear too often or two infrequently in headlines. The requirements vary depending on phrase-length and are arbitrary. (Gentzkow & Shapiro 2010, p.42)
9 Ideology is approximated by the percentage of votes going to the last presidential election’s conservative candidate in the legislators home district.
20
interpretations, the first dimension can be roughly thought of as the economic
dimension, and the second as the social dimension. Positive scores are closely
associated with legislators who fit popular perceptions of conservatism,
particularly in the first dimension. Like the GS model, mine consists of two
stages of estimation, but with two dimensions in each:
𝑓! = α + 𝛽!𝑑!" + 𝛽!𝑑!"
(𝑓! − c) = c + 𝛽!"𝑑! + 𝛽!"𝑑!
𝑓 denotes the relative frequency of phrase usage, and 𝑑!"is the speakers
NOMINATE score in dimension i. The first regression is indexed by speaker
and executed for each phrase. The second is indexed by phrase and executed
for every newspaper. The coefficients estimated by the first regression are
then used as the dependent variable in the second.
The second change is in the Pearson χ2 statistic calculated during
phrase selection. If the algorithm is to produce meaningful results, it must
only select phrases that are ideologically revealing across all political
participants. Phrases like ‘republican colleague’ and ‘friends across the aisle’
will most likely be indicative of partisanship within congress but are unlikely
to be used in the same manner by newspapers. To help filter out procedural
21
phrases specific to each chamber10, I calculated four Pearson χ2 statistics for
each phrase: one for the house and senate, and one for both dimensions of
ideological scores for a total of 4 values for any given phrase. Out of the
phrases that 5000 phrases with the highest χ2 in the Senate also in the top
5000 phrases with the highest χ2 in the house, the 350 phrases with the
highest average first-dimension χ2 between the two chambers and the 150
phrases with the highest average second-dimension χ2 are selected for
newspaper search and analysis.
Using a separate χ2 measure for the House and Senate and cross-
checking the lists of highest scoring phrases ensures that the phrases selected
are ideologically revealing in more than one chamber of Congress, an
important step in choosing phrases that are ideologically revealing outside of
Congress. However, because the ideological scores are less accurate for
moderate legislators, phrases associated with moderate ideologies will be more
susceptible to error in the remainder of the algorithm. To address this, the
Pearson χ2statistic is designed to rule out very moderate phrases. Rather than
simply testing for similar phrase usage between republicans and democrats, I
first order the chamber’s legislators by the ideological dimension being
measured and divide them into seven groups. I then test the null hypothesis
10 If phrases are not politically telling across the chambers of congress, there is no reason to believe that they will be in the space of media outlets. Hence, this provides a necessary condition, but not a sufficient one.
22
that the three groups to the left and to the right of the middle group have the
same propensity to use the phrase. The percentage share of total phrases used
by each collection of legislators is used to calculate the expected number of
uses of any given word for the portion of legislators. The statistic is then
calculated as follows:
𝜒! = 𝑂! − 𝐸! !
𝐸!!
where 𝐸! is the expected frequency of phrase usage my group i, and 𝑂! is the
observed frequency.
Support Vector Regression 4.2
As previously noted, the distribution of phrase usage may not be suited
for a simple linear model; it may also be the case that more general linear
models will perform poorly with regards to the data. An alternative to the
Least Squares regression-based algorithm is training a Support Vector
Machine Regression (SVR) to predict ideology amongst legislators and
applying the learned model to newspapers for each dimension of ideology. This
can be done with minimal assumptions about the true underlying relationship
between ideology and phrase usage. The idea behind an SVR is to map
nonlinear data to a high-dimensional space where it will have linear structure.
Support vector machines predict data based on an input vector by using a
23
‘kernel trick’,11 which implicitly maps inputs to an inner-product space without
ever having defining the function explicitly. I use the sequential minimization
algorithm developed by Shevade, et. al (2000) with a normalized 2nd degree
polynomial kernel.12
Modeling Polarization 4.3
As with political bias, the meaning of polarization can be context-
specific, so it is necessary to clearly state what is being tested. Within the
context of the models I use, the following is a testable hypotheses concerning
political polarization in Newspapers:
H1: On average, liberal newspapers have become more liberal and conservative newspapers have become more conservative
To test the hypotheses I will assume that newspapers with scores less
than zero are liberal and greater than zero are conservative. If the first
hypothesis is true, then newspapers with ideological scores less than zero will
have a lower score in 2012 than in 1994 and newspapers with ideological
scores greater than zero in 1994 will have a greater ideological score in 2012.
11 See Smola & Scholkopf, 2003 for a thorough and accessible explanation of the history and theory of Support Vector Machine Regresions.
12 I use WEKA to implement the algorithm and predictions (Hall, et.al 2009)
24
5 Data Collection
To collect and format the data for analysis, I wrote a series of automated
scripts in Ruby and Ruby on Rails. The first script scrapes the text from every
page of the Congressional Record (except Extensions of Remarks) from the
Government Printing Office (http://www.gpo.gov) and breaks it into passages,
which defined as sections of the text in which there is a single speaker. With a
few exceptions,13 ‘Mr.’, ‘Mrs.’, or ‘Ms.’ followed by a name in capital letters
signifies a new speaker. Extremely common words are then removed from all
passages, and each passage is deconstructed into sequential two-word phrases
and grouped by their phonetic root as determined by the Porter Stemmer
algorithm (Porter, 1980). This allows variations in tense and pluralizing to be
treated as the same phrase.
At this phase there are two important distinctions between the GS
procedure and mine. First, GS removes all words that are found in a
prominent stop list of over 400 words (Fox, 1989-1990). To address the
possibility that list is too restrictive, I begin by removing only the 75 most
common words in the English language and proceed to systematically remove
words that are likely have a distinct meaning in Congress.14 Second, phrase
usage is measured as binary value for each passage: rather than count the
13 See Appendix A for a complete detail of the collection process
14 See Appendix B for a list of words and phrases removed from the text
25
number of times each phrase is used in each passage, I only document whether
or not a phrase was used in a given passage.15
After each passage has been parsed into two-word phrases, the phrases
that appear in only one chamber of Congress are removed from the dataset.
The four chi-2 statistics are then calculated for each phrase and those most
likely to be ideologically revealing are selected for analysis, as described in
section 4.1. Using the NewsBank database via Northwestern University’s
Library webpage16, a script records the relative frequency with which each of
the 14 sampled daily newspapers use the selected phrases.17 Finally, the
relative frequencies of legislators and newspapers are standardized within the
two groups, i.e. scaled to have a mean of zero and unit variance, in order to
account for the large difference in phrase volume of legislators and
newspapers. The data are then exported for analysis.
15 The measure of relative frequency used in later steps is defined as the number of passages or articles in which a legislator or newspaper, respectively, uses a given phrase divided by the number of times they used any of the selected phrases in a passage or article, respectively.
16 http://www.library.northwestern.edu/
17 The database contains several thousand newspapers, but because of time and computational constraints, only 14 were selected. See Appendix A for a full account of the difficulties associated with the procedure.
26
6 Analysis
Results 6.1
Table 1 displays the phrase most associated with conservatism (Highest
Dimension i coefficients) and the most associated with liberalism (Lowest
Dimension i coefficients) for each dimension i and each year. Tables 2 and 3
present the results of the OLS method and the SVR method, respectively.
In both models, traditional measures of goodness-of-fit like R2 will not
be clearly defined. An alternative way to measure goodness-of-fit is the
correlation coefficient from applying the model to legislators rather than
newspapers. This will measure how closely the each model fits the data used
in its derivation and can be applied to both dimensions. These results are
presented in table 4.
To test the Polarization Hypothesis, the sample of newspapers was split
into four groups according to ideology in each dimension (conservative-
conservative, conservative-liberal, liberal-conservative, liberal-liberal). The
mean ideological shift within each group from 1994 to 2012 is shown in Table
5.
Discussion 6.2
The two models share directional estimates of mean ideology in the
sample of newspapers. On average, the newspapers moved towards center
from the left in the first dimension, and past the center from the right in the
second dimension. Using rough interpretations of what the dimensions
27
signify, one may interpret this as newspapers becoming less fiscally liberal and
more socially liberal. Both models estimate The Washington Times to be
significantly more conservative than any of the other newspapers, though the
OLS model also estimates a leftward shift in the second, social-policy oriented
dimension.
One potential explanation is that phrases selected by the 2nd-dimension
Χ2 score have less predictive power in 2012 than in 1994. The subsample of
phrases in Table 1 supports this explanation. The most ideologically revealing
phrases from 1994 have clear explanations rooted in partisan policy
preference, but phrases like ‘were sent’ and ‘across nation’ among 2012’s
conservative-indicative phrases have no clear explanation. This could indicate
the abandonment of the second dimension by conservatives: if conservatives
are not able to use partisan language in the second dimension, then the words
selected for analysis will be arbitrary. Military references seem to be the
strongest indicator of second-dimension conservatism in 1994, but are less
prevalent in 2012. That is not to say conservatives have no preference in the
second dimension, but it that they do not express these preferences through
partisan language, a phenomenon that would lead to the low correlation
coefficients seen in Table 4.
As shown in Table 5, both models show that 1994’s first-dimension
liberal newspapers have become more conservative in the first dimension, and
that on average first-dimension liberal newspapers have moved towards the
28
center in the second dimension. Again, lack of conservative rhetoric cannot be
ruled out as an explanation for this movement. As measured by the OLS
model, there are only two dimensions in which newspapers have moved away
from center. Newspapers that were Conservative/Conservative in 1994
became more Conservative in the first dimension, and those that were
Conservative liberal became more liberal in the second dimension. This
provides very weak evidence for the notion that conservative movement away
from center is driving polarization (Poole & Rosenthal, 1997).
Perhaps surprisingly, the OLS algorithm appears at first to be a more
effective method of estimation. The correlation coefficient of the model as
applied to legislators is above 0.80 for dimension 1 in both years, and several
of its predictions are intuitively appealing. The Washington Times is
estimated to be significantly more conservative than any other newspaper in
the sample, and the sample on average is slanted liberally. Another
interesting result is that the Arkansas Democrat Gazette became more
economically conservative in the time between 1994 and 2012, suggesting a
correlation with the Clinton Presidency.
The SVR approach should not immediately be discounted. One
disadvantage of using kernel methods is their potential for overfitting the data
(Smola & Scholkopf, 2003). That is, the algorithm may ‘learn’ a structure that
fits the training data perfectly, but is not representative of the true underlying
model. To avoid this in my work, I use an algorithm that estimates a model of
29
several different subsamples - called ‘folds’ – and tests them on a fold set aside
for validation. The model is learned through aggregating the estimates of all
the subsample models. The OLS algorithm, however, does not designate an
equivalent check against overfitting, so it could be susceptible to overfitting, in
addition to underfitting. It cannot be ruled out that the SVR model is actually
a more accurate depiction of the noisiness of the data.
7 Conclusion
The results produced in this paper should adequately demonstrate the
utility of natural language processing in assessing political ideology. The
preceding analysis both confirms prior research and public perception
concerning media bias and provides new insight into the nature of the
relationship between legislators and the media. By simply expanding the
newspaper sample, a future researcher could meticulously examine the
difference between various models of partisan phrasing and determine which
is best suited for various lines of inquiry. It is important to note that this
procedure need not be confined to traditional newspapers; the model is
sufficiently generalizable to any entity that produces large amounts of
rhetorical content. Moreover, the rapidly growing field of machine learning
and the digitization of media outlets will continue to facilitate innovation in
research methodology.
30
The results above were derived using only sequential pairs of words;
expanding the natural language analysis to three-word phrases as in the GS
model could provide additional explanatory power, but it is unnecessary to
restrict analysis to consecutive words. As technology progresses, researchers
may be able to incorporate context, setting, and even inflection into analysis.
Procedures like the ones outlined in this paper may become an invaluable
research tool in many fields.
31
Highest Dim. 1 Coefficient Highest Dim. 2 Coefficient Highest Dim. 1 Coefficient Highest Dim. 2 Coefficientsocial spending armed services raising taxes pass farm
washington times rural america federal government bridges infrastructurebig government loving wife new tax medal purple
government bureaucrats deficit spending tax increase high schoolamerican taxpayers appalachian regional job creators gulf coast
tax increase agricultural research care law roads bridgesnew taxes active duty red tape investments infrastructure
republican alternative veterans employment government spending across nationwelfare spending constitutional duty increased taxes rules regulationsincrease history cut defense taxes more passed bipartisan
taxes more technology programs taxing american were sentnew social women veterans live within american farm
endless appeals pay raise government takeover transportation infrastructuresocial welfare natural resources independent payment safety netprice controls public works national debt traditional energy
Lowest Dim. 1 Coefficient Lowest Dim. 2 Coefficient Lowest Dim. 1 Coefficient Lowest Dim. 2 Coefficientagainst women against women tax break care womenviolence against violence against tax cut prevent publicmost vulnerable improvement education middle class republican party
insurance companies against government wealthiest americans tax breakwithout health citizens against repeal affordable women health
domestic violence community notification cancer screenings oil companiescoverage health performance standards super pacs public safetynelson mandela peace security preventative health cervical cancer
south africa funds schools women health millions womencivil rights women health student loans women country
community service nation schools preventative care governor romneyguarantee health kinds situations children preexisting care planning
women health funds local american women public healthchildren working long history costs seniors protects womenworkers rights wages benefits republicans proposed affordable care
1994 2012
Table 1
32
Newspaper Title 1994 2012 Δ 1994 2012 ΔArizona Daily Star -0.311 0.122 0.433 0.141 -0.003 -0.145
Arkansas Democrat-Gazette 0.141 0.265 0.124 0.730 0.176 -0.554Chicago Sun-Times -0.471 -0.448 0.024 -0.441 -0.789 -0.348
Christian Science Monitor 0.520 -0.065 -0.585 -0.968 -1.063 -0.095Columbian -0.023 0.080 0.103 -0.224 -0.035 0.189
Herald & Review -0.764 -0.007 0.757 0.067 0.704 0.638Modesto Bee -0.768 0.033 0.801 0.290 0.437 0.148
Morning Call -0.311 0.063 0.374 -0.010 0.016 0.027New Hampshire Union Leader 0.157 0.079 -0.078 0.523 -0.348 -0.871
News & Observer -0.121 -0.302 -0.181 -0.649 -0.387 0.262News & Record -0.802 -0.326 0.476 -0.042 0.072 0.114
News Tribune -0.274 0.091 0.365 -0.309 0.378 0.688USA TODAY 0.113 -0.433 -0.547 -0.017 -0.276 -0.259
Washington Times 1.018 2.029 1.011 0.207 -1.057 -1.264Mean -0.135 0.084 0.220 -0.050 -0.155 -0.105
Dimension 2: OLSDimension 1: OLS
Table 2
Newspaper Title 1994 2012 Δ 1994 2012 ΔArizona Daily Star -0.003 -0.034 -0.031 0.001 -0.222 -0.223
Arkansas Democrat-Gazette -0.036 0.156 0.192 0.046 0.002 -0.044Chicago Sun-Times -0.075 0.025 0.100 0.027 -0.149 -0.176
Christian Science Monitor -0.031 0.043 0.074 -0.155 -0.065 0.089Columbian -0.039 0.094 0.133 -0.041 -0.012 0.029
Herald & Review -0.088 0.006 0.095 -0.047 0.029 0.076Modesto Bee -0.054 0.104 0.158 -0.080 0.007 0.088
Morning Call -0.003 0.170 0.174 0.003 -0.074 -0.077New Hampshire Union Leader -0.066 0.102 0.168 -0.013 -0.060 -0.047
News & Observer -0.051 -0.020 0.031 -0.072 -0.113 -0.041News & Record -0.153 0.083 0.236 0.017 -0.064 -0.082
News Tribune -0.038 0.090 0.128 0.017 0.006 -0.011USA TODAY -0.105 -0.051 0.054 -0.054 -0.093 -0.039
Washington Times 0.003 0.370 0.367 -0.122 0.037 0.159Mean -0.053 0.081 0.134 -0.034 -0.055 -0.021
Dimension 1: SVR Dimension 2: SVR
Table 3
33
Table 4
Table 5
OLS 1994 OLS 2012 SVR 1994 SVR 2012
Dimension 1 0.825 0.859 0.596 0.714Dimension 2 0.651 0.708 0.254 0.234
Correlation Coefficients
Dimension 1 Dimension 2 Dimension 1 Dimension 2Conservative-Conservative 0.352 -0.896 - -
Conservative-Liberal -0.566 -0.177 0.367 0.159Liberal-Conservative 0.202 -0.030 0.133 -0.102
Liberal-Liberal 0.158 0.181 0.102 0.022
Mean shift: OLS Mean shift: SVRInitial Ideology
34
Bibliography
Watts, M. D., Domke, D., Shah, D. V., & David, F. P. (1999). Elite Cues and Media Bias in Presidential Campaigns: Explaining Public Perceptions of a Liberal Press. Communication Research , 26 (2), 144-175.
Vigna, S. D., & Kaplan, E. (2007). The Fox News Effect: Media Bias and Voting. The Quarterly Journal of Economics , 122 (3), 1187-1234.
Bailey, M. A., & Chang, K. H. (2001). Comparing Presidents, Senators, and Justices: Interinstitutional Preference Estimation. Journal of Law, Economics, and Organization , 71 (2), 477-506.
Bailey, M. A., Kamoie, B., & Maltzman, F. (2005). Signals from the Tenth Justice: The Political Role of the Solicitor General in Supreme Court Decision Making. American Journal of Political Science , 49 (1), 72-85.
Bartels, L. M. (2000). Partisanship and voting behavior. American Journal of Political Science , 44, 35-50.
Boczkowski, P. J., & Siles, I. (2012). Making sense of the newspaper crisis: A critical assessment of existing research and an agenda for future work. New Media and Society , 14 (8), 1375–1394.
Drucker, H., Burges, C. J., Kaufman, L., Smola, A., & Vapnik, V. (1997). Support Vector Regression Machines. Bell Labs and Monmouth University, Department of Electronic Engineering. West Long Branch: MIT Press.
Fiorina, M. P., & Abrams, S. J. (2008). Political Polarization in the American Public. Annual Review of Political Science , 11, 563-588.
Fox, C. (1989-1990). A stop list for general text. ACM SIGIR Forum , 24 (1-2), 19-21.
Gaskins, B., & Jerit, J. (2012). Internet News : Is It a Replacement for Traditional Media Outlets? The International Journal of Press/Politics , 17 (2), 190-2013.
Gaskins, B., & Jerit, J. (2012). Internet News: Is It a Replacement for Traditional Media Outlets? The International Journal of Press Politics , 17 (190), 190-213.
Gasper, J. T. (2011 ). Shifting Ideologies? Re-examining Media Bias . Quarterly Journal of Political Science , 6, 85-102.
Gentzkow, M., & Shapiro, J. M. (2010). What drive media slant? Evidence from U.S. Daily Newspapers. Econometrica , 78 (1), 35-71.
35
Gentzkow, M., & Shapiro, J. M. (2006). Media Bias and Reputation. Journal of Political Economy , 114 (2), 280-316.
Gentzkow, M., Shapiro, J. M., & Sinkinson, M. (2011). The Effect of Newspaper Entry and Exit on Electoral Politics. American Economic Review , 101 (7), 2980-3018.
Groseclose, T., & Milyo, J. (2005). A Measure of Media Bias. The Quarterly Journal of Economics , 120 (4), 1191-1237.
Groseclose, T., Levitt, S. D., & Snyder, Jr., J. M. (1999). Comparing interest group scores across time and chambers; adjusted ADA scores for the U.S. Congress. American Political Science Review , 93 (1), 33.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. The WEKA Data Mining Software: An Update. SIGKDD Explorations , 11 (1).
Hix, S., Noury, A., & Roland, G. (2006). Dimensions of Politics in the European Parliament. American Journal of Political Science , 50 (2), 494-511.
Kaye, J., & Quinn, S. (2010). Funding Journalism in the Digital Age: Business Models, Strategies, Issues and Trends. New York: Peter Lang.
Luntz, F. (2005). Introducing a Searchable, Easily Accessed, Text-Version of the Frank Luntz Republican Playbook. Retrieved May 11, 2013, from politicalstrategy.org: http://www.politicalstrategy.org/archives/001185.php
Laver, M., Benoit, K., & J., a. G. (2003). Extracting Policy Positions From Political Texts Using Words as Data. American Political Science Review , 97, 311–331.
Larcinese , V., Puglisi , R., & Synder, J. M. (2011). Partisan bias in economic news: Evidence on the agenda-setting behavior of U.S. newspapers. Journal of Public Economics , 95 (9-10), 1178-1189.
Lott, Jr., J. R., & Hasset, K. A. (2004, October 19). Is Newspaper Coverage of Economic Events Politically Biased? Retrieved May 19, 2012, from SSRR: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=588453
McCarty, N. (2010, February 26). Measuring Legislative Preferences. Retrieved 19 2013, May, from Nolan McCarty: http://www.princeton.edu/~nmccarty/idealpoint3.pdf
Morgenstern, S. (2004). Patterns of Legislative Politics: Roll Call Voting in Latin America and the United States. Cambridge: Cambridge University Press.
Puglisi, R. (2011). BEING THE NEW YORK TIMES: THE POLITICAL BEHAVIOUR OF A NEWSPAPER. The B.E. Journal of Economic Analysis & Policy , 11 (1).
Poole, K. T., & Rosenthal, H. (1997). Congress: A Political-Economic History of Roll Call Voting. New York: Oxford University Press.
Poole, K. T., & Rosenthal, H. (1984). The Polarization of American Politics. The Journal of Politics , 46, 1061-1079.
36
Prior, M. (2013). Media and Political Polarization. Annual Review of Political Science , 16, 101–27.
Sinclair, B. (2000). Polarized Politics: Congress and the President in a Partisan Era. In J. Bond, & R. Fleisher (Eds.), Polarized Politics: Congress and the President in a Partisan Era (pp. 134–53 ). Washington, DC : CQ Press.
Shevade, S. K., Keerthi, S. S., Bhattacharyya, C., & Murthy, K. R. (2000). Improvements to the SMO Algorithm for SVM Regression. IEEE TRANSACTIONS ON NEURAL NETWORKS , 11 (5), 1188-1193.
Shor, B., & McCarty, N. (2011). The Ideological Mapping of American Legislatures. American Political Science Review , 105 (3), 530-551.
Smola, A. J., & Scholkopf, B. (2003). A Tutorial on Support Vector Regression. Australian National University, Max-Planck-Institut fur biologische Kybernetik.
The Washington Post. (2013, May 13). Abortion doctor Kermit Gosnell convicted of murder in deaths of three infants. The Washington Post .
The Washington Times. (2013, May 13). Murder: Gosnell guilty verdict hailed on both sides of abortion debate. The Washington Times .
37
Appendix A: Procedural Details
Because each step of the data collection process takes a significant
amount of time (completing the analysis for one year and 10 Newspapers takes
approximately one week), necessary exceptions to the scripted procedures were
difficult to find and repair in a timely fashion. It was often the case that after
a few days of carrying out a program, I would discover a new exception in some
part of the analysis that needed to be taken into account. Many times these
error detections required the procedure to completely restart. Originally, I
planned on analyzing every year between 1994 and 2012, but this soon became
infeasible. Similarly, I was forced to greatly reduce the sample size of
Newspapers to ensure the analysis was completed in time.
The text pages containing the Congressional Record are spread across
approximately 15,000 URLs per year, so a necessary first step is to collect all
the links in a text file. After this is complete, an automated Ruby/Ruby on
Rails script employs the Watir:Webdriver gem to extract every page of text
from the URLs. With some exceptions, each page is formatted the same way:
any time a new Congress Member begins to speak, a new paragraph is started
with ‘Mr.’, ‘Ms.’, or Mrs. followed by the person’s last name in upper case
letters. The following the people and instances are exceptions to the rule that
were periodically discovered and implemented:
• Officers (Speaker, Speaker Pro Tempore, President, President Pro Tempore, Clerk, Vice President, Acting Vice President, Chairman
38
• Hyphenated last names
• Multi-word last names
• Names shared by other legislators
• Prefixes in which one letter is lowercase
• Introducing Amendments
• Requesting additional time
• Roll Call Votes
• Speaking during readings
• Addressing the chair
When the Congressional Record is parsed into two-word phrases, I use a
Stemmer algorithm to strip words down to the root. Because the stemmer
cannot be reversed, it is necessary to also store the original text of a given
stem. To this end, a phrase being used for the first time will have both the
stem and the complete phrase stored in the database. From this point on the
complete text of the stem is set: additional uses of the stem do not affect the
associated complete phrase, so the newspapers are searched for the first
iteration of the stem found in the Congressional Record. Fortunately, many of
the phrases selected for analysis do no exhibit significant linguistic variation.
Because there is a possibility that certain phrases involving an extremely
common word are nonsensical with the common word removed, the
newspapers are not searched for only the phrases as they appear in the
database. Rather, the search constraint search for all combinations of the
words in the phrase that have no more than one word separating them.
39
Appendix B: Words Removed From Analysis
a about
absence absent
accordance act
act sec action
activities activity
add adding adjourn
administration
affairs after
agreed all
also am
amend amendment
america owes an
and any
appears appropirated appropriat
appropriations
april are
article as
ask asked
assemble assemebled
at august
authorize authorized
balance be
because of illness been
before bill
budget business to
come business
today but by can
carolina chair
chairman clause clerk
clinton close of
business colleague comments
commission committee concurrent
confere congress
congressional consent consider contents
could dakota
date day days dear
debate december definition
detail details district divided
do does due each
en bloc enact
enacted equally divided
establish established
et seq every man
except extend extent
february federal debt
first florida follow
followed following follows
for forth friday friend friends from
gentleman get
given permission
go going good has
have he
hearing hearings
her hereby
him his
house i
i ask if ii iii
immediate consideration
in include includes
insert inserted
into it its
january july june just
know ladies
gentleman later
leader leadership legislation legislative
lieu like look
madam majority
make man woman
child march may me
meet meet during meet session
member members mention minority minute
modified modify monday
more more morning business motion
my myself nays
new jersey new york
no not
november now
object objection october
of
on one only or
order ordered other ought our out
paragraph pargraph pending people
per per capita
percent period please
precede preceding president presiding presiding
Officer previously
pro tempore procedure procedures
proceed provision purpose
pursuant put back quorum
read recess
recognize reconsider
record reform reject
remark report
representative
request reserve
reserved reserving the
right
resolution resume revise
rhode island rollcall
rule said
saturday say
secretary section
see semicolon
senate senator senators
september session sessions
shall she
should sincerely
so some
speaker stand
statement statements
states stood
subdivision submit
subsection such
sunday support suspend
table table en
take technical tempore
testimony than
thank thank you
that the
their them
then there these they this
through thursday
time title to
today tomorrow
too total
tuesday unanimous unanimous
consent under united
up upon
us use later in
the day va
vote votes was we
wednesday west virginia
what when which who will with
withdraw would year yeas yet
yield you your