Phrasing and its Ideological Implications€¦ · Phrasing and its Ideological Implications: Using Natural Language Processing to Measure Political Affiliation Alexander Williamson

Phrasing and its Ideological Implications:

Using Natural Language Processing to Measure Political Affiliation

Alexander Williamson

June 6th, 2013

Abstract

I explore the utility of natural language processing in estimating

political bias in the media. I propose two methods for estimation.

Building on prior research, the first method involves two sets of OLS

regressions. The second method uses a Support Vector Regression, a

tool borrowed from the rapidly growing field of machine learning.

Both models are applied to the topic of political polarization amongst

US daily newspapers to illustrate the value of these and similar

methodologies.

2

Acknowledgements

I would like to very sincerely thank Professor Lynne

Kiesling for her guidance and encouragement. I would

also like to thank Jonathan Friedman for his endless

advice on technology and implementation. Finally, I

would like to thank my mother, father, and sister for their

relentless support over the last four years.

3

Table of Contents

1 INTRODUCTION 4 2 LITERATURE REVIEW 5 THE INFLUENCE OF THE NEWS INDUSTRY 5 2.1 ESTIMATIONS OF POLITICAL BIAS 6 2.2 POLITICAL POLARIZATION 9 2.3 MY CONTRIBUTION 10 2.4

3 ESTIMATING IDEOLOGY 11 INTEREST GROUP SCORES 11 3.1 WHAT IS POLITICAL BIAS? 13 3.2 NOMINATE SCORES 14 3.3 LANGUAGE AS A LINK 15 3.4 BIAS IN CONTEXT 17 3.5

4 MODELING IDEOLOGY AND POLARIZATION 18 THE OLS ALGORITHM 18 4.1 SUPPORT VECTOR REGRESSION 22 4.2 MODELING POLARIZATION 23 4.3

5 DATA COLLECTION 24 6 ANALYSIS 26 RESULTS 26 6.1 DISCUSSION 26 6.2

7 CONCLUSION 29 BIBLIOGRAPHY 34 APPENDIX A: PROCEDURAL DETAILS 37 APPENDIX B: WORDS REMOVED FROM ANALYSIS 39

4

1 Introduction

On March 13, 2013, two newspapers ran articles reporting the verdict in

the trial of Kermit B. Gosnell, titled “Murder: Gosnell guilty verdict hailed on

both sides of abortion debate” and “Abortion doctor Kermit Gosnell convicted of

murder in deaths of three infants”, respectively. If a reader were to observe

the titles chosen by these publications, they would begin to speculate about the

political inclinations of the two organizations, and after reading the first fifty

words in each article, the reader will observe that phrases such as “pro-life

advocates”, “abortionist”, and “abortion industry” are unique to the first

article’s introduction. By the time readers have reached the end of both

articles they will have observed a systematic difference in phrasing.

The first article is from The Washington Times, and the second is from

The Washington Post, two papers commonly believed to be slanted to the right

and left, respectively. Anecdotal demonstrations such as these tend to pervade

conversations about bias in the media, despite susceptibility to bias from both

those making the argument and those observing it. Furthermore, not every

purported incidence of bias, or ‘slant’, will provide such a clear dichotomy, and

common questions in the discussion of current US politics and its relation to

the media are often more subtle. To this end, political scientists and

economists have proposed several successful methods of measuring media bias

5

that fulfill specific research objectives. Developing a sufficiently generalizable

measure of bias, however, has proved to be difficult.

In this thesis, I explore the capacity of natural language as a gauge of

political ideology. After reviewing prior research, I describe the inherent

difficulties of defining ‘liberal’ and ‘conservative’. Then building on the work of

Gentzkow & Shapiro (2010), I consider two alternative estimates of political

bias that compare media outlets to U.S. Congress Members. In the first

approach, I use the OLS estimator developed by Gentzkow & Shapiro with a

few important modifications, including the use of Common-Space NOMINATE

scores (Poole & Rosenthal, 1997) as a proxy for legislator ideology. In the

second model, I train a support vector regression on the set of legislators then

use the trained model to estimate bias in U.S. daily newspapers. To explore

the utility of these procedures I apply each to the topic of political polarization.

I calculate media bias for 14 daily newspapers and assess how it has changed

from 1994 to 2012.

2 Literature Review

The Influence of the News Industry 2.1

Fully appreciating the significance of political bias in the media begins

with an understanding of the crucial role news sources play in democratic

societies, and the potential issues raised by the swift changes that are

occurring in the media industry. The advent of the Internet let media outlets

6

tailor to a smaller group of readers, allowing many people to find an ideological

‘niche’ in which to obtain news (Gaskins & Jerit, 2012). The presence and

ideological catering of numerous new alternative media outlets led to declines

in both advertising revenue and circulation for newspapers around the world

(Kaye & Quinn, 2010). Researchers now refer to the current state of the

Newspaper Industry as a crisis without hesitation (Boczkowski & Siles, 2012).

Because a vital news industry is widely considered to be an essential part of a

healthy democratic society (Gaskins & Jerit, 2012; Gentzkow & Shapiro, 2010;

Boczkowski & Siles, 2012), there is growing concern over the future of the

media industry and whether the various media replacing newspapers will

sufficiently serve the same important functions (Boczkowski & Siles, 2012).

What is more concerning is the evidence that these structural changes are also

altering the nature of content being produced by news organizations, especially

when viewed in the light of media’s influence on political engagement (Vigna &

Kaplan, 2007).

Estimations of Political Bias 2.2

The issue of political bias in the media is not new: Gentzkow, Shapiro, &

Sinkinson (2011) construct a model of newspaper entry and exit in which they

analyze the political endorsements of newspapers around the turn of the 20th

century. Endorsement data, however, are often sparse and can be explanatory

only to a point. In the case of presidential endorsements, it is plausible that

two newspapers with the same endorsement record could have very different

7

ideologies but agree on the choice of candidate every four years. Fortunately

for researchers, the same technological changes that adversely affect the

newspaper industry are creating new ways to analyze media. This new

empirical capacity parallels important and innovative work on media bias over

the last two decades (Groseclose & Milyo, A Measure of Media Bias, 2005;

Gentzkow & Shapiro 2010; Lott & Hassett, 2004; Puglisi, 2011; Ho & Quinn

2008). These authors create distance between themselves and subjectivity by

identifying less explicit ways in which media outlets reveal political affiliation.

Perhaps unsurprisingly, such research is frequently in alignment with

common perception. For example, the widely held belief that major news

outlets favor democrats (Watts, Domke, Shah, & David, 1999) has been

confirmed to various degrees by multiple authors. Puglisi (2011) examines

every article from The New York Times written between 1946 and 1997 and

concludes that the newspaper gives more emphasis to subject areas in which

the public perceives the Democratic Party to be more competent. Lott &

Hassett (2004) find evidence that newspapers give less positive coverage to

releases of economic indicators, though this claim was revisited by Larcinese,

Puglisi, & Snyder (2011), who conclude that newspapers with histories of

endorsing Democratic Presidents tend to give less coverage to negative

unemployment news if the current President is a Democrat, a significantly

weaker conclusion. They find no evidence of this occurring on other topics,

such as the trade deficit, the budget deficit, or inflation.

8

Each of the cited authors provides plausible evidence of the presence of

political bias in various news organizations, but do not attempt to generalize

their results for comparison across media outlets. Moreover, such an attempt

would need to address inherent problems in the respective estimations.

Larcinese, Puglisi, & Snyder rely on endorsement data, and Lott & Hasset

only consider bias as manifested in the framing of economic news. Puglisi only

measures bias in one newspaper.

Groseclose & Milyo (2005) proposed a general method for estimating

media bias that proved to be influential. The authors created a mapping from

media outlets to the space of Congress members using adjusted ADA scores

(Groseclose, Levitt, & Snyder, Jr., 1999). This involved counting how many

times a newspaper cites various think tanks and interest groups and

comparing the count to the number of times that Congress Members cite the

same groups, ideally providing a metric by which to compare media outlets to

each other and to legislators. These estimates, however, are inconsistent over

time and in many instances are not robust to the removal of a individual think

tank (Gasper, 2011 ). Even aside from the problematic nature of the estimates

for media outlets, the general method of calculating interest group scores faces

significant theoretical obstacles.1

1 See Section 3 for details

9

In order to estimate the bias of a general newspaper, Gentzkow &

Shapiro (2010) create an algorithm to quantify political bias of U.S. daily

newspapers by analyzing the language each paper uses. After identifying a set

of 1,000 phrases that are favored by one of the two major parties, the authors

assign an ideological measure to each phrase by regressing the relative

frequency of usage on the legislative ideologies. In the next step, the authors

search newspapers and record the frequency of phrase usage then perform a

second regression of Newspaper phrase usage on phrase ideology. The slope

from the second regression is the estimated political bias of the newspaper.

The model is correlated with survey data on media bias2, indicating the value

of further analysis. To this end, exploring the capacity of natural language to

measure political bias forms a basis for my research in this thesis. A detailed

comparison of Gentzkow and Shapiro’s model and my own is given in Section

4.

Political Polarization 2.3

The current political environment in the US augments the academic

value of accurately estimating media bias due to the stronger reliance on party

affiliation by political participants. Several researchers have suggested that

political participants are becoming more polarized in the sense that a stronger

2 The authors find a 0.40 correlation coefficient with data from media directory Mondo Times. The authors exclude opinion articles from their analysis and Mondo Times does not, so the correlation is not a perfect indicator of similarity.

10

dichotomy exists between political parties. Using the NOMINATE process,

Poole & Rosenthal (1997) illustrate that before 1990 several Republicans were

more liberal than the most conservative democrats, but in recent years there

has been a strict separation. Similarly, the percentage of roll-call votes in

which a majority of republicans oppose a majority of democrats has risen in

the last 20 years (Sinclair, 2000).

As for political polarization in the electorate, the extent to which

compelling evidence exists depends both on how polarization is defined and on

how evidence is interpreted (Prior, 2013). For example, party identification is

more correlated with voting outcomes in the current electorate than it was

prior to the 1970’s (Bartels, 2000), but this may solely be a result of more

cohesive and ideologically distant parties (Fiorina & Abrams, 2008). The

media’s role in political polarization is also unclear: despite partisanship in

particular news sources, especially new media outlets (Gaskins & Jerit, 2012),

there is no evidence that longstanding outlets have become more partisan over

time (Prior, 2013).

My Contribution 2.4

My research focuses on optimizing the procedure proposed by Gentzkow

& Shapiro to contribute to the literature on estimating media bias. This thesis

also contributes to the literature surrounding political polarization and is the

first attempt to measure political polarization quantitatively in long standing

11

media outlets. In a more general sense, this thesis serves as an illustration of

the value of technological innovation to researchers and as an exploration of

machine learning applications to traditional empirical problems

3 Estimating Ideology

Interest Group Scores 3.1

If a quantitative measure of media bias is estimated via a map into the

space of Congress members, the Congressional metric must be meaningful for

that measure to be useful. The fact that interest group scores are inherently

problematic (Shor & McCarty, 2011) undermines the validity of Groseclose’s &

Milyo’s model using Congressional mapping. Even in the case of comparing

legislators, the ADA scores face theoretical barriers (McCarty, 2010). One

reason is that any given think tank’s position may signal different ideologies in

different legislatures. To see this, consider the general procedure by which the

scores are constructed:

• Given a policy group and legislative body, isolate the votes in the legislative body that are relevant to the policy group.

• For each vote, determine whether the policy group prefers a positive or negative outcome.

• For each legislator, determine the percentage of time his or her vote matches the policy group’s preference. This is the legislators ADA score.

In some scenarios this will yield a meaningful one-dimensional ordering of

political entities consistent with other measures and intuition (Groseclose &

Milyo, A Measure of Media Bias, 2005).

12

If such an ordering juxtaposes multiple legislatures with no common

voting record, however, the results can be misleading. Consider a simple

example of interest group scores in which Legislator A, who is considered very

liberal, and Legislator B, who is considered very conservative, vote in different

legislatures, different years, or both. Suppose Think Tank X is a group whose

only mission is to promote policies that favor more relaxed speeding laws for

automobiles. Legislator A is faced with a series of votes in which he is asked

to decide between allocating money to wider, safer roads or to more

environmentally-friendly government buildings. In all of the votes in which he

participates, he chooses to allocate money to government building renovations.

Legislator B is faced with a series of votes in which he is asked to decide

between reducing the budget for either traffic cops or a local laboratory that

conducts stem-cell related research. In all of the votes in which he

participates, he votes to cut funding for the laboratory. As a result, Legislator

A and Legislator B receive the same interest group score, having voted many

times against the wishes of Think Tank X, despite having very different

political views.

Additionally, the interest groups’ preferred point on the 1-dimentional

scale cannot lie on the interior of the distribution of legislative points. That is,

the scores are calculated with the interest group or think tank at one extreme

of the possible scores with the underlying assumption that the organization is

either the most liberal or the most conservative in the distribution (McCarty,

13

2010). If the interest group is politically moderate, then moderate legislators

will receive a very high score while conservative and liberal legislators will

receive similarly low scores. The resulting estimations may be useful as a

political thermometer (Poole & Rosenthal, 1997, p. 169), but will not be useful

in gauging ideological distance between candidates.

What is political bias? 3.2

In addition to the issues mentioned above, most measures of ideology share

the same difficult task of defining bias and also defining what it means to be

‘liberal’ or ‘conservative’. Within the context of US politics, the two terms

seem to derive their meaning exclusively from rough perceptions of a

dichotomy existing solely in legislative bodies. The question of whether an

entity is conservatively biased or liberally biased can have different meanings

when applied to different entities, so caution must be exercised when

attempting to provide an answer. The NOMINATE scores (Poole & Rosenthal,

1997) discussed in the following section estimate 2-dimensional coordinates for

legislators and the issues on which they vote, with the transparent goal of

predicting whether a Congress Person will vote ‘yay’ or ‘nay’ on a given issue,

but it comes at the loss of a clear rhetorical explanation of the coordinates and

does not always yield an obvious method for comparison across different

legislatures, let alone various other political agents.

14

NOMINATE scores 3.3

Poole & Rosenthal (1997) constructed a process for estimating the

ideology of US Congress Members referred to as NOMINATE (Nominal Three-

Step Estimation) process. Rather than trying to measure a priori definitions of

‘liberal’ and ‘conservative’, the authors estimate coordinates in n-dimensional

Euclidean space, or ‘ideal points’, for each legislator based on a probabilistic

model of roll call voting. The authors assume that each legislator’s utility for a

‘yay’ or ‘nay’ is given in part by his or her distance from the ideological

coordinates of the vote outcomes and in part by a stochastic error term with a

weight on the deterministic part of the function common to all legislators. By

assuming that the error terms in the model are drawn from a logit

distribution, they are able to estimate a log-likelihood function for the entire

set of outcomes for a given session in Congress. Using a gradient descent

algorithm, they estimate the set of legislator points, the set vote-outcome

points, and the weight factor, one set at a time and reiterating through until

the estimates have converged.

This procedure circumvents deciding the number of appropriate

dimensions for ideological: the authors simply add a dimension until they

reach a point such that an additional dimension does not increase the fit of the

15

data in a meaningful way. They find that the fit is not appreciably increased

beyond 2-dimensions.3

This procedure has had a tremendous influence on ideological

estimation and has inspired several similar studies into other legislative

bodies in the US (Bailey & Chang, 2001; Bailey, Kamoie, & Maltzman, 2005)

and in other nations (Hix, Noury, & Roland, 2006; Morgenstern, 2004). By

observing legislators who have moved from one legislative body to another,

researchers are also able to use these scores for comparison across legislatures.

Note that any measure of political bias incorporating these estimates as

proxies for ideology will implicitly be estimating the probability with which

entities would vote ‘yay’ on a typical vote from the underlying Congress.

Because I use these common scores in part of my analysis, I will discuss how to

interpret their use in the next section.

Language as a Link 3.4

This thesis explores the idea that the language newspapers use can hint

at political ideology. The same observations can be made with respect to

members of Congress. Veritable proof of this claim was found in 2005, when a

memo from a Republican strategist leaked to the public, revealing the

proactive efforts to get Republicans to say ‘private accounts’ rather than

3 In the case of D-NOMINATE scores, the first dimension is sufficient to describe the variation in most years. The common space coordinates, however, are constructed so that each of the two dimensions is equally salient.

16

‘personal accounts’ (Luntz, 2005). Another examples include Republican usage

of ‘death tax’ rather than ‘estate tax’ (Gentzkow & Shapiro, 2010), ‘spending

programs’ vs. ‘assistance programs’, or ‘government bureaucrats’ vs. ‘civil

servants’. Additionally, there are phrases that refer to topics altogether

avoided by one party or the other, and a wide range of phrases that are

somewhere in between these two categories.4

The relationship between ideology and phrase usage might be extremely

noisy, but if enough speech is observed it may be possible to associate patterns

among political agents. Using an OLS based algorithm, Gentzkow & Shapiro

(2010) explore this possibility by mapping U.S. daily newspapers into to the

space of Congressional ideology using similarities in the usage of certain

ideologically revealing phrases. I make several modifications to their model

and use it as the first of two procedures through which I estimate bias.

Because the data is noisy, however, the resulting estimates could be

unnecessarily imprecise. Figure 1 shows the relative frequency5 with which

Congress members of varying ideologies were on the record saying

“Washington Times”.6 To address the potential inadequacy of a simple linear

4 See Error! Reference source not found.

5 Relative frequency of a phrase is defined as the number of times a legislator uses that phrase divided by the number of times they use any phrase included in the analysis.

6 Incidentally, I found “Washington Times” to be strongly correlated with ideology, but I did not include it in the analysis because it would skew the estimate of bias for The Washington Times. See Section 4 for details on the measurement process.

17

regression, I also use a Support Vector Regression, a machine learning

technique that performs well in problems of pattern recognition (Drucker,

Burges, Kaufman, Smola, & Vapnik, 1997).

Bias in Context 3.5

In the light of the discussion from the previous section, it is important to

define what is meant by bias to facilitate a meaningful interpretation of

estimates. In this circumstance, I measure bias as the tendency to endorse,

explicitly or subtly, policies in a specific region of the NOMINATE space. If

language is a strong proxy for voting patterns, then this measure will have a

useful interpretation as a media organization’s propensity for pushing the

electorate towards a particular set of legislative outcomes; if the associations

generated by language analysis are coincidental, however, and do not

accurately reflect what the unobservable voting patterns of a given newspaper

0.001

0.021

0.041

0.061

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

!st

Dim

ensi

on

of

Co

mm

on

-Sp

ace

So

cre

Liberal <-> Conservative

Relative frequency of phrase "Washington Times" in 1994 Congressional Record

Figure 1. Each ‘x’ in the above graph represents a legislator. As illustrated by the

fitted line, a simple linear regression may not yield imprecise estimates.

18

would be if it were a legislator, then the measure might still be valuable as an

alternative to the definition of ideology used in NOMINATE calculations.7

4 Modeling Ideology and Polarization

The OLS Algorithm 4.1

To test the hypothesis that language patterns could be used to estimate

bias, Gentzkow & Shapiro (2010) (henceforth GS) design an algorithm that

implements two series of OLS regressions. In designing my model I make

several important modifications, but the two algorithms share the same

general framework:

• Determine the set of phrases that were used in Congress during the time interval of interest and the frequency with which each Congress Member used each phrase

• Determine the phrases that are used disproportionately by members with a certain political affiliation

• Determine the frequency with which news providers use the phrases selected in the previous step

• Create a measure of ideology using phrase frequencies to link news providers to congressmen

Specifically, the GS model begins by testing for every two-word and

three-word phrase found in the 2005 Congressional Record the null hypothesis

that the propensity to use a given phrase is the equal for Republicans and

7 A hypothesis that might then be fruitful for future research would be that legislators have a professed ideology, exhibited through speech patterns, and an actual ideology as exhibited through voting patterns.

19

Democrats, as quantified by a Pearson χ2 statistic. Subject to certain

constraints on frequency of appearance in Newspaper headlines8, the 1,000

phrases with the highest χ2 number are designated as ideologically revealing.

A regression for each selected phrase is ran using the relative frequency of

usage as the dependent variable and the ideology of legislators as the

independent variable9. The slope and intercept estimate from each regression

are then used in a second set of regressions: for each newspaper, the relative

frequency of phrase usage minus the intercept of the first-stage estimate is

regressed on the first-stage slope estimate. The slope estimate from the

second stage regression is the newspaper’s media ‘slant’. (Gentzkow &

Shapiro, 2010, pp. 41-46).

Some of the changes I made to the GS model are done purely to reduce

the computational burden of the algorithm: these are detailed in Section 5.

There are two changes, however, that are particularly rooted in a desire to

theoretically improve the model. The first major change is that I estimate two

dimensions of ideology using the “Common-Space” DW-NOMINATE scores

that can be downloaded at http://voteview.com/dwnomjoint.asp. Though the

dimensions of the NOMINATE do not necessarily have strict rhetorical

8 Gentzkow & Shapiro exclude phrases that appear too often or two infrequently in headlines. The requirements vary depending on phrase-length and are arbitrary. (Gentzkow & Shapiro 2010, p.42)

9 Ideology is approximated by the percentage of votes going to the last presidential election’s conservative candidate in the legislators home district.

20

interpretations, the first dimension can be roughly thought of as the economic

dimension, and the second as the social dimension. Positive scores are closely

associated with legislators who fit popular perceptions of conservatism,

particularly in the first dimension. Like the GS model, mine consists of two

stages of estimation, but with two dimensions in each:

𝑓! = α + 𝛽!𝑑!" + 𝛽!𝑑!"

(𝑓! − c) = c + 𝛽!"𝑑! + 𝛽!"𝑑!

𝑓 denotes the relative frequency of phrase usage, and 𝑑!"is the speakers

NOMINATE score in dimension i. The first regression is indexed by speaker

and executed for each phrase. The second is indexed by phrase and executed

for every newspaper. The coefficients estimated by the first regression are

then used as the dependent variable in the second.

The second change is in the Pearson χ2 statistic calculated during

phrase selection. If the algorithm is to produce meaningful results, it must

only select phrases that are ideologically revealing across all political

participants. Phrases like ‘republican colleague’ and ‘friends across the aisle’

will most likely be indicative of partisanship within congress but are unlikely

to be used in the same manner by newspapers. To help filter out procedural

21

phrases specific to each chamber10, I calculated four Pearson χ2 statistics for

each phrase: one for the house and senate, and one for both dimensions of

ideological scores for a total of 4 values for any given phrase. Out of the

phrases that 5000 phrases with the highest χ2 in the Senate also in the top

5000 phrases with the highest χ2 in the house, the 350 phrases with the

highest average first-dimension χ2 between the two chambers and the 150

phrases with the highest average second-dimension χ2 are selected for

newspaper search and analysis.

Using a separate χ2 measure for the House and Senate and cross-

checking the lists of highest scoring phrases ensures that the phrases selected

are ideologically revealing in more than one chamber of Congress, an

important step in choosing phrases that are ideologically revealing outside of

Congress. However, because the ideological scores are less accurate for

moderate legislators, phrases associated with moderate ideologies will be more

susceptible to error in the remainder of the algorithm. To address this, the

Pearson χ2statistic is designed to rule out very moderate phrases. Rather than

simply testing for similar phrase usage between republicans and democrats, I

first order the chamber’s legislators by the ideological dimension being

measured and divide them into seven groups. I then test the null hypothesis

10 If phrases are not politically telling across the chambers of congress, there is no reason to believe that they will be in the space of media outlets. Hence, this provides a necessary condition, but not a sufficient one.

22

that the three groups to the left and to the right of the middle group have the

same propensity to use the phrase. The percentage share of total phrases used

by each collection of legislators is used to calculate the expected number of

uses of any given word for the portion of legislators. The statistic is then

calculated as follows:

𝜒! = 𝑂! − 𝐸! !

𝐸!!

where 𝐸! is the expected frequency of phrase usage my group i, and 𝑂! is the

observed frequency.

Support Vector Regression 4.2

As previously noted, the distribution of phrase usage may not be suited

for a simple linear model; it may also be the case that more general linear

models will perform poorly with regards to the data. An alternative to the

Least Squares regression-based algorithm is training a Support Vector

Machine Regression (SVR) to predict ideology amongst legislators and

applying the learned model to newspapers for each dimension of ideology. This

can be done with minimal assumptions about the true underlying relationship

between ideology and phrase usage. The idea behind an SVR is to map

nonlinear data to a high-dimensional space where it will have linear structure.

Support vector machines predict data based on an input vector by using a

23

‘kernel trick’,11 which implicitly maps inputs to an inner-product space without

ever having defining the function explicitly. I use the sequential minimization

algorithm developed by Shevade, et. al (2000) with a normalized 2nd degree

polynomial kernel.12

Modeling Polarization 4.3

As with political bias, the meaning of polarization can be context-

specific, so it is necessary to clearly state what is being tested. Within the

context of the models I use, the following is a testable hypotheses concerning

political polarization in Newspapers:

H1: On average, liberal newspapers have become more liberal and conservative newspapers have become more conservative

To test the hypotheses I will assume that newspapers with scores less

than zero are liberal and greater than zero are conservative. If the first

hypothesis is true, then newspapers with ideological scores less than zero will

have a lower score in 2012 than in 1994 and newspapers with ideological

scores greater than zero in 1994 will have a greater ideological score in 2012.

11 See Smola & Scholkopf, 2003 for a thorough and accessible explanation of the history and theory of Support Vector Machine Regresions.

12 I use WEKA to implement the algorithm and predictions (Hall, et.al 2009)

24

5 Data Collection

To collect and format the data for analysis, I wrote a series of automated

scripts in Ruby and Ruby on Rails. The first script scrapes the text from every

page of the Congressional Record (except Extensions of Remarks) from the

Government Printing Office (http://www.gpo.gov) and breaks it into passages,

which defined as sections of the text in which there is a single speaker. With a

few exceptions,13 ‘Mr.’, ‘Mrs.’, or ‘Ms.’ followed by a name in capital letters

signifies a new speaker. Extremely common words are then removed from all

passages, and each passage is deconstructed into sequential two-word phrases

and grouped by their phonetic root as determined by the Porter Stemmer

algorithm (Porter, 1980). This allows variations in tense and pluralizing to be

treated as the same phrase.

At this phase there are two important distinctions between the GS

procedure and mine. First, GS removes all words that are found in a

prominent stop list of over 400 words (Fox, 1989-1990). To address the

possibility that list is too restrictive, I begin by removing only the 75 most

common words in the English language and proceed to systematically remove

words that are likely have a distinct meaning in Congress.14 Second, phrase

usage is measured as binary value for each passage: rather than count the

13 See Appendix A for a complete detail of the collection process

14 See Appendix B for a list of words and phrases removed from the text

25

number of times each phrase is used in each passage, I only document whether

or not a phrase was used in a given passage.15

After each passage has been parsed into two-word phrases, the phrases

that appear in only one chamber of Congress are removed from the dataset.

The four chi-2 statistics are then calculated for each phrase and those most

likely to be ideologically revealing are selected for analysis, as described in

section 4.1. Using the NewsBank database via Northwestern University’s

Library webpage16, a script records the relative frequency with which each of

the 14 sampled daily newspapers use the selected phrases.17 Finally, the

relative frequencies of legislators and newspapers are standardized within the

two groups, i.e. scaled to have a mean of zero and unit variance, in order to

account for the large difference in phrase volume of legislators and

newspapers. The data are then exported for analysis.

15 The measure of relative frequency used in later steps is defined as the number of passages or articles in which a legislator or newspaper, respectively, uses a given phrase divided by the number of times they used any of the selected phrases in a passage or article, respectively.

16 http://www.library.northwestern.edu/

17 The database contains several thousand newspapers, but because of time and computational constraints, only 14 were selected. See Appendix A for a full account of the difficulties associated with the procedure.

26

6 Analysis

Results 6.1

Table 1 displays the phrase most associated with conservatism (Highest

Dimension i coefficients) and the most associated with liberalism (Lowest

Dimension i coefficients) for each dimension i and each year. Tables 2 and 3

present the results of the OLS method and the SVR method, respectively.

In both models, traditional measures of goodness-of-fit like R2 will not

be clearly defined. An alternative way to measure goodness-of-fit is the

correlation coefficient from applying the model to legislators rather than

newspapers. This will measure how closely the each model fits the data used

in its derivation and can be applied to both dimensions. These results are

presented in table 4.

To test the Polarization Hypothesis, the sample of newspapers was split

into four groups according to ideology in each dimension (conservative-

conservative, conservative-liberal, liberal-conservative, liberal-liberal). The

mean ideological shift within each group from 1994 to 2012 is shown in Table

5.

Discussion 6.2

The two models share directional estimates of mean ideology in the

sample of newspapers. On average, the newspapers moved towards center

from the left in the first dimension, and past the center from the right in the

second dimension. Using rough interpretations of what the dimensions

27

signify, one may interpret this as newspapers becoming less fiscally liberal and

more socially liberal. Both models estimate The Washington Times to be

significantly more conservative than any of the other newspapers, though the

OLS model also estimates a leftward shift in the second, social-policy oriented

dimension.

One potential explanation is that phrases selected by the 2nd-dimension

Χ2 score have less predictive power in 2012 than in 1994. The subsample of

phrases in Table 1 supports this explanation. The most ideologically revealing

phrases from 1994 have clear explanations rooted in partisan policy

preference, but phrases like ‘were sent’ and ‘across nation’ among 2012’s

conservative-indicative phrases have no clear explanation. This could indicate

the abandonment of the second dimension by conservatives: if conservatives

are not able to use partisan language in the second dimension, then the words

selected for analysis will be arbitrary. Military references seem to be the

strongest indicator of second-dimension conservatism in 1994, but are less

prevalent in 2012. That is not to say conservatives have no preference in the

second dimension, but it that they do not express these preferences through

partisan language, a phenomenon that would lead to the low correlation

coefficients seen in Table 4.

As shown in Table 5, both models show that 1994’s first-dimension

liberal newspapers have become more conservative in the first dimension, and

that on average first-dimension liberal newspapers have moved towards the

28

center in the second dimension. Again, lack of conservative rhetoric cannot be

ruled out as an explanation for this movement. As measured by the OLS

model, there are only two dimensions in which newspapers have moved away

from center. Newspapers that were Conservative/Conservative in 1994

became more Conservative in the first dimension, and those that were

Conservative liberal became more liberal in the second dimension. This

provides very weak evidence for the notion that conservative movement away

from center is driving polarization (Poole & Rosenthal, 1997).

Perhaps surprisingly, the OLS algorithm appears at first to be a more

effective method of estimation. The correlation coefficient of the model as

applied to legislators is above 0.80 for dimension 1 in both years, and several

of its predictions are intuitively appealing. The Washington Times is

estimated to be significantly more conservative than any other newspaper in

the sample, and the sample on average is slanted liberally. Another

interesting result is that the Arkansas Democrat Gazette became more

economically conservative in the time between 1994 and 2012, suggesting a

correlation with the Clinton Presidency.

The SVR approach should not immediately be discounted. One

disadvantage of using kernel methods is their potential for overfitting the data

(Smola & Scholkopf, 2003). That is, the algorithm may ‘learn’ a structure that

fits the training data perfectly, but is not representative of the true underlying

model. To avoid this in my work, I use an algorithm that estimates a model of

29

several different subsamples - called ‘folds’ – and tests them on a fold set aside

for validation. The model is learned through aggregating the estimates of all

the subsample models. The OLS algorithm, however, does not designate an

equivalent check against overfitting, so it could be susceptible to overfitting, in

addition to underfitting. It cannot be ruled out that the SVR model is actually

a more accurate depiction of the noisiness of the data.

7 Conclusion

The results produced in this paper should adequately demonstrate the

utility of natural language processing in assessing political ideology. The

preceding analysis both confirms prior research and public perception

concerning media bias and provides new insight into the nature of the

relationship between legislators and the media. By simply expanding the

newspaper sample, a future researcher could meticulously examine the

difference between various models of partisan phrasing and determine which

is best suited for various lines of inquiry. It is important to note that this

procedure need not be confined to traditional newspapers; the model is

sufficiently generalizable to any entity that produces large amounts of

rhetorical content. Moreover, the rapidly growing field of machine learning

and the digitization of media outlets will continue to facilitate innovation in

research methodology.

30

The results above were derived using only sequential pairs of words;

expanding the natural language analysis to three-word phrases as in the GS

model could provide additional explanatory power, but it is unnecessary to

restrict analysis to consecutive words. As technology progresses, researchers

may be able to incorporate context, setting, and even inflection into analysis.

Procedures like the ones outlined in this paper may become an invaluable

research tool in many fields.

31

Highest Dim. 1 Coefficient Highest Dim. 2 Coefficient Highest Dim. 1 Coefficient Highest Dim. 2 Coefficientsocial spending armed services raising taxes pass farm

washington times rural america federal government bridges infrastructurebig government loving wife new tax medal purple

government bureaucrats deficit spending tax increase high schoolamerican taxpayers appalachian regional job creators gulf coast

tax increase agricultural research care law roads bridgesnew taxes active duty red tape investments infrastructure

republican alternative veterans employment government spending across nationwelfare spending constitutional duty increased taxes rules regulationsincrease history cut defense taxes more passed bipartisan

taxes more technology programs taxing american were sentnew social women veterans live within american farm

endless appeals pay raise government takeover transportation infrastructuresocial welfare natural resources independent payment safety netprice controls public works national debt traditional energy

Lowest Dim. 1 Coefficient Lowest Dim. 2 Coefficient Lowest Dim. 1 Coefficient Lowest Dim. 2 Coefficientagainst women against women tax break care womenviolence against violence against tax cut prevent publicmost vulnerable improvement education middle class republican party

insurance companies against government wealthiest americans tax breakwithout health citizens against repeal affordable women health

domestic violence community notification cancer screenings oil companiescoverage health performance standards super pacs public safetynelson mandela peace security preventative health cervical cancer

south africa funds schools women health millions womencivil rights women health student loans women country

community service nation schools preventative care governor romneyguarantee health kinds situations children preexisting care planning

women health funds local american women public healthchildren working long history costs seniors protects womenworkers rights wages benefits republicans proposed affordable care

1994 2012

Table 1

32

Newspaper Title 1994 2012 Δ 1994 2012 ΔArizona Daily Star -0.311 0.122 0.433 0.141 -0.003 -0.145

Arkansas Democrat-Gazette 0.141 0.265 0.124 0.730 0.176 -0.554Chicago Sun-Times -0.471 -0.448 0.024 -0.441 -0.789 -0.348

Christian Science Monitor 0.520 -0.065 -0.585 -0.968 -1.063 -0.095Columbian -0.023 0.080 0.103 -0.224 -0.035 0.189

Herald & Review -0.764 -0.007 0.757 0.067 0.704 0.638Modesto Bee -0.768 0.033 0.801 0.290 0.437 0.148

Morning Call -0.311 0.063 0.374 -0.010 0.016 0.027New Hampshire Union Leader 0.157 0.079 -0.078 0.523 -0.348 -0.871

News & Observer -0.121 -0.302 -0.181 -0.649 -0.387 0.262News & Record -0.802 -0.326 0.476 -0.042 0.072 0.114

News Tribune -0.274 0.091 0.365 -0.309 0.378 0.688USA TODAY 0.113 -0.433 -0.547 -0.017 -0.276 -0.259

Washington Times 1.018 2.029 1.011 0.207 -1.057 -1.264Mean -0.135 0.084 0.220 -0.050 -0.155 -0.105

Dimension 2: OLSDimension 1: OLS

Table 2

Newspaper Title 1994 2012 Δ 1994 2012 ΔArizona Daily Star -0.003 -0.034 -0.031 0.001 -0.222 -0.223

Arkansas Democrat-Gazette -0.036 0.156 0.192 0.046 0.002 -0.044Chicago Sun-Times -0.075 0.025 0.100 0.027 -0.149 -0.176

Christian Science Monitor -0.031 0.043 0.074 -0.155 -0.065 0.089Columbian -0.039 0.094 0.133 -0.041 -0.012 0.029

Herald & Review -0.088 0.006 0.095 -0.047 0.029 0.076Modesto Bee -0.054 0.104 0.158 -0.080 0.007 0.088

Morning Call -0.003 0.170 0.174 0.003 -0.074 -0.077New Hampshire Union Leader -0.066 0.102 0.168 -0.013 -0.060 -0.047

News & Observer -0.051 -0.020 0.031 -0.072 -0.113 -0.041News & Record -0.153 0.083 0.236 0.017 -0.064 -0.082

News Tribune -0.038 0.090 0.128 0.017 0.006 -0.011USA TODAY -0.105 -0.051 0.054 -0.054 -0.093 -0.039

Washington Times 0.003 0.370 0.367 -0.122 0.037 0.159Mean -0.053 0.081 0.134 -0.034 -0.055 -0.021

Dimension 1: SVR Dimension 2: SVR

Table 3

33

Table 4

Table 5

OLS 1994 OLS 2012 SVR 1994 SVR 2012

Dimension 1 0.825 0.859 0.596 0.714Dimension 2 0.651 0.708 0.254 0.234

Correlation Coefficients

Dimension 1 Dimension 2 Dimension 1 Dimension 2Conservative-Conservative 0.352 -0.896 - -

Conservative-Liberal -0.566 -0.177 0.367 0.159Liberal-Conservative 0.202 -0.030 0.133 -0.102

Liberal-Liberal 0.158 0.181 0.102 0.022

Mean shift: OLS Mean shift: SVRInitial Ideology

34

Bibliography

Watts, M. D., Domke, D., Shah, D. V., & David, F. P. (1999). Elite Cues and Media Bias in Presidential Campaigns: Explaining Public Perceptions of a Liberal Press. Communication Research , 26 (2), 144-175.

Vigna, S. D., & Kaplan, E. (2007). The Fox News Effect: Media Bias and Voting. The Quarterly Journal of Economics , 122 (3), 1187-1234.

Bailey, M. A., & Chang, K. H. (2001). Comparing Presidents, Senators, and Justices: Interinstitutional Preference Estimation. Journal of Law, Economics, and Organization , 71 (2), 477-506.

Bailey, M. A., Kamoie, B., & Maltzman, F. (2005). Signals from the Tenth Justice: The Political Role of the Solicitor General in Supreme Court Decision Making. American Journal of Political Science , 49 (1), 72-85.

Bartels, L. M. (2000). Partisanship and voting behavior. American Journal of Political Science , 44, 35-50.

Boczkowski, P. J., & Siles, I. (2012). Making sense of the newspaper crisis: A critical assessment of existing research and an agenda for future work. New Media and Society , 14 (8), 1375–1394.

Drucker, H., Burges, C. J., Kaufman, L., Smola, A., & Vapnik, V. (1997). Support Vector Regression Machines. Bell Labs and Monmouth University, Department of Electronic Engineering. West Long Branch: MIT Press.

Fiorina, M. P., & Abrams, S. J. (2008). Political Polarization in the American Public. Annual Review of Political Science , 11, 563-588.

Fox, C. (1989-1990). A stop list for general text. ACM SIGIR Forum , 24 (1-2), 19-21.

Gaskins, B., & Jerit, J. (2012). Internet News : Is It a Replacement for Traditional Media Outlets? The International Journal of Press/Politics , 17 (2), 190-2013.

Gaskins, B., & Jerit, J. (2012). Internet News: Is It a Replacement for Traditional Media Outlets? The International Journal of Press Politics , 17 (190), 190-213.

Gasper, J. T. (2011 ). Shifting Ideologies? Re-examining Media Bias . Quarterly Journal of Political Science , 6, 85-102.

Gentzkow, M., & Shapiro, J. M. (2010). What drive media slant? Evidence from U.S. Daily Newspapers. Econometrica , 78 (1), 35-71.

35

Gentzkow, M., & Shapiro, J. M. (2006). Media Bias and Reputation. Journal of Political Economy , 114 (2), 280-316.

Gentzkow, M., Shapiro, J. M., & Sinkinson, M. (2011). The Effect of Newspaper Entry and Exit on Electoral Politics. American Economic Review , 101 (7), 2980-3018.

Groseclose, T., & Milyo, J. (2005). A Measure of Media Bias. The Quarterly Journal of Economics , 120 (4), 1191-1237.

Groseclose, T., Levitt, S. D., & Snyder, Jr., J. M. (1999). Comparing interest group scores across time and chambers; adjusted ADA scores for the U.S. Congress. American Political Science Review , 93 (1), 33.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. The WEKA Data Mining Software: An Update. SIGKDD Explorations , 11 (1).

Hix, S., Noury, A., & Roland, G. (2006). Dimensions of Politics in the European Parliament. American Journal of Political Science , 50 (2), 494-511.

Kaye, J., & Quinn, S. (2010). Funding Journalism in the Digital Age: Business Models, Strategies, Issues and Trends. New York: Peter Lang.

Luntz, F. (2005). Introducing a Searchable, Easily Accessed, Text-Version of the Frank Luntz Republican Playbook. Retrieved May 11, 2013, from politicalstrategy.org: http://www.politicalstrategy.org/archives/001185.php

Laver, M., Benoit, K., & J., a. G. (2003). Extracting Policy Positions From Political Texts Using Words as Data. American Political Science Review , 97, 311–331.

Larcinese , V., Puglisi , R., & Synder, J. M. (2011). Partisan bias in economic news: Evidence on the agenda-setting behavior of U.S. newspapers. Journal of Public Economics , 95 (9-10), 1178-1189.

Lott, Jr., J. R., & Hasset, K. A. (2004, October 19). Is Newspaper Coverage of Economic Events Politically Biased? Retrieved May 19, 2012, from SSRR: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=588453

McCarty, N. (2010, February 26). Measuring Legislative Preferences. Retrieved 19 2013, May, from Nolan McCarty: http://www.princeton.edu/~nmccarty/idealpoint3.pdf

Morgenstern, S. (2004). Patterns of Legislative Politics: Roll Call Voting in Latin America and the United States. Cambridge: Cambridge University Press.

Puglisi, R. (2011). BEING THE NEW YORK TIMES: THE POLITICAL BEHAVIOUR OF A NEWSPAPER. The B.E. Journal of Economic Analysis & Policy , 11 (1).

Poole, K. T., & Rosenthal, H. (1997). Congress: A Political-Economic History of Roll Call Voting. New York: Oxford University Press.

Poole, K. T., & Rosenthal, H. (1984). The Polarization of American Politics. The Journal of Politics , 46, 1061-1079.

36

Prior, M. (2013). Media and Political Polarization. Annual Review of Political Science , 16, 101–27.

Sinclair, B. (2000). Polarized Politics: Congress and the President in a Partisan Era. In J. Bond, & R. Fleisher (Eds.), Polarized Politics: Congress and the President in a Partisan Era (pp. 134–53 ). Washington, DC : CQ Press.

Shevade, S. K., Keerthi, S. S., Bhattacharyya, C., & Murthy, K. R. (2000). Improvements to the SMO Algorithm for SVM Regression. IEEE TRANSACTIONS ON NEURAL NETWORKS , 11 (5), 1188-1193.

Shor, B., & McCarty, N. (2011). The Ideological Mapping of American Legislatures. American Political Science Review , 105 (3), 530-551.

Smola, A. J., & Scholkopf, B. (2003). A Tutorial on Support Vector Regression. Australian National University, Max-Planck-Institut fur biologische Kybernetik.

The Washington Post. (2013, May 13). Abortion doctor Kermit Gosnell convicted of murder in deaths of three infants. The Washington Post .

The Washington Times. (2013, May 13). Murder: Gosnell guilty verdict hailed on both sides of abortion debate. The Washington Times .

37

Appendix A: Procedural Details

Because each step of the data collection process takes a significant

amount of time (completing the analysis for one year and 10 Newspapers takes

approximately one week), necessary exceptions to the scripted procedures were

difficult to find and repair in a timely fashion. It was often the case that after

a few days of carrying out a program, I would discover a new exception in some

part of the analysis that needed to be taken into account. Many times these

error detections required the procedure to completely restart. Originally, I

planned on analyzing every year between 1994 and 2012, but this soon became

infeasible. Similarly, I was forced to greatly reduce the sample size of

Newspapers to ensure the analysis was completed in time.

The text pages containing the Congressional Record are spread across

approximately 15,000 URLs per year, so a necessary first step is to collect all

the links in a text file. After this is complete, an automated Ruby/Ruby on

Rails script employs the Watir:Webdriver gem to extract every page of text

from the URLs. With some exceptions, each page is formatted the same way:

any time a new Congress Member begins to speak, a new paragraph is started

with ‘Mr.’, ‘Ms.’, or Mrs. followed by the person’s last name in upper case

letters. The following the people and instances are exceptions to the rule that

were periodically discovered and implemented:

• Officers (Speaker, Speaker Pro Tempore, President, President Pro Tempore, Clerk, Vice President, Acting Vice President, Chairman

38

• Hyphenated last names

• Multi-word last names

• Names shared by other legislators

• Prefixes in which one letter is lowercase

• Introducing Amendments

• Requesting additional time

• Roll Call Votes

• Speaking during readings

• Addressing the chair

When the Congressional Record is parsed into two-word phrases, I use a

Stemmer algorithm to strip words down to the root. Because the stemmer

cannot be reversed, it is necessary to also store the original text of a given

stem. To this end, a phrase being used for the first time will have both the

stem and the complete phrase stored in the database. From this point on the

complete text of the stem is set: additional uses of the stem do not affect the

associated complete phrase, so the newspapers are searched for the first

iteration of the stem found in the Congressional Record. Fortunately, many of

the phrases selected for analysis do no exhibit significant linguistic variation.

Because there is a possibility that certain phrases involving an extremely

common word are nonsensical with the common word removed, the

newspapers are not searched for only the phrases as they appear in the

database. Rather, the search constraint search for all combinations of the

words in the phrase that have no more than one word separating them.

39

Appendix B: Words Removed From Analysis

a about

absence absent

accordance act

act sec action

activities activity

add adding adjourn

administration

affairs after

agreed all

also am

amend amendment

america owes an

and any

appears appropirated appropriat

appropriations

april are

article as

ask asked

assemble assemebled

at august

authorize authorized

balance be

because of illness been

before bill

budget business to

come business

today but by can

carolina chair

chairman clause clerk

clinton close of

business colleague comments

commission committee concurrent

confere congress

congressional consent consider contents

could dakota

date day days dear

debate december definition

detail details district divided

do does due each

en bloc enact

enacted equally divided

establish established

et seq every man

except extend extent

february federal debt

first florida follow

followed following follows

for forth friday friend friends from

gentleman get

given permission

go going good has

have he

hearing hearings

her hereby

him his

house i

i ask if ii iii

immediate consideration

in include includes

insert inserted

into it its

january july june just

know ladies

gentleman later

leader leadership legislation legislative

lieu like look

madam majority

make man woman

child march may me

meet meet during meet session

member members mention minority minute

modified modify monday

more more morning business motion

my myself nays

new jersey new york

no not

november now

object objection october

of

on one only or

order ordered other ought our out

paragraph pargraph pending people

per per capita

percent period please

precede preceding president presiding presiding

Officer previously

pro tempore procedure procedures

proceed provision purpose

pursuant put back quorum

read recess

recognize reconsider

record reform reject

remark report

representative

request reserve

reserved reserving the

right

resolution resume revise

rhode island rollcall

rule said

saturday say

secretary section

see semicolon

senate senator senators

september session sessions

shall she

should sincerely

so some

speaker stand

statement statements

states stood

subdivision submit

subsection such

sunday support suspend

table table en

take technical tempore

testimony than

thank thank you

that the

their them

then there these they this

through thursday

time title to

today tomorrow

too total

tuesday unanimous unanimous

consent under united

up upon

us use later in

the day va

vote votes was we

wednesday west virginia

what when which who will with

withdraw would year yeas yet

yield you your

Documents

Phrasing and its Ideological Implications€¦ · Phrasing and its Ideological Implications: Using Natural Language Processing to Measure Political Affiliation Alexander Williamson