68
Quantifying Cultural Change: An Application to Misogyny in Music

faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

Quantifying Cultural Change:

An Application to Misogyny in Music

Page 2: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

Contribution Statement

Consumer researchers have longs been interested in culture and cultural change, but

measurement has proved challenging. It is difficult to acquire data over time, and some things,

particularly the relationship between concepts (e.g., brand personality or cultural stereotypes) are

difficult to measure. This paper suggests that an emerging computational linguistics method,

word embeddings, can help address some of these challenges. In particular, we demonstrate how

this approach can be used to address a longstanding question, whether music lyrics are

misogynist (i.e., exhibit a dislike of, contempt for, or ingrained prejudice against women).

Natural language processing of a quarter of a million songs over 50 years provides an empirical

test of whether music is biased against women and how these biases have changed over time.

While both genders are equally likely to be objects of aggression, subtler machine learning

approaches paint a more complex picture. Compared to men, women are less likely to be

associated with desirable traits (i.e., competence). While this bias has decreased, it persists.

Ancillary analyses suggest that lyrics have become less gendered more broadly (though remain

gendered) and that temporal changes may be driven by male artists’ (as female artists were less

biased initially). Overall, the results shed light on subtle measures of bias and how natural

language processing can provide deeper insight into cultural change.

Page 3: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

Abstract

While researchers have long been interested in culture and cultural change, quantification has

proven difficult. Many have argued that music is misogynistic, for example, but is that actually

true? And have any such biases changed over time? Natural language processing of a quarter of a

million songs over 50 years tries to address these questions. While both genders are equally

likely to be objects of aggression, subtler machine learning approaches paint a more complex

picture. Compared to men, women are less likely to be associated with desirable traits (i.e.,

competence). While this bias has decreased, it persists. Ancillary analyses suggest that lyrics

have become less gendered more broadly (though remain gendered) and that temporal changes

may be driven by male artists’ (as female artists were less biased initially). Overall, the results

shed light on subtle measures of bias and how natural language processing can provide deeper

insight into cultural change.

Keywords: Natural language processing, cultural evolution, music, stereotypes, misogyny.

Page 4: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

Across a range of disciplines, researchers have long been interested in cultural change.

Consumer researchers want to understand the diffusion of innovations (Bass 1969), acceptance

of consumption practices (Humphreys 2010), and shifting of brand perceptions and personalities

over time. Management scholars are interested in the adoption of management practices and the

erosion of culinary categories (Rao, Monin, and Durand 2005). Sociologists want to understand

fashion cycles (Lieberson 2000; Simmel 1957), psychologists want to understand the persistence

of stereotypes (Schaller, Conway, and Tanchuk 2002; Schaller and Crandall 2003), and

sociolinguists want to understand the abandonment of dialects (Wolfram, Reaser, and Vaughn

2008).

But while cultural change is of broad interest, quantification is a key challenge.

Acquiring data over time is often difficult. Getting data on the popularity of a given fashion style

or linguistic marker is not trivial. Further, measurement is not always straightforward. It’s one

thing to measure new product adoption, baby name popularity, or the incidence of a particular

term in the news. Such discrete units either increase in usage or do not. But tracking more

complex things like brand personalities or the persistence of stereotypes is more difficult.

Studies usually rely on surveys, which work fine at individual time points, but are tough to

collect over longer time horizons. Further, while researchers can try to manually code archival

data, this method is often subjective and difficult to scale.

We suggest that an emerging machine learning approach may provide a powerful tool to

address some of these challenges. More and more textual data is available in a range of domains,

and natural language processing provides a useful key to unlock a range of insights (Humphreys

and Wang 2018). We examine how word2vec (Mikolov, Sutskever, Chen, Corrado, and Dean

2013), a well-adopted word vector representation technique, can shed light on questions

Page 5: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

regarding cultural change that might otherwise be difficult or impossible to address.

In particular, we apply this technique to address an area of longstanding debate:

misogyny in music. While academics and cultural critics have long argued about whether music

lyrics are prejudiced against women (e.g., Cooper, Cooper and Faragher 1985), there have been

few empirical tests. Further, the empirical work that does exist is mostly based on small samples

or one genre over a short period of time (Armstrong 2001; Harding and Nett 1984) and relies on

somewhat subjective ratings.

To address these challenges, we collect more than a quarter of a million songs from six

music genres over more than 50 years. Then, we use word embeddings, as well as some simpler

natural language processing techniques, to investigate potential gender biases and whether they

have changed over time.

We make three main contributions. First, most narrowly, we speak to the longstanding

debate about misogyny in music. We provide the first large-scale analysis in this area, allowing

us to address (1) whether music is misogynist, (2) if this has varied over time, and (3) how this

may vary by genre.

Second, we showcase how an emerging machine learning method can shed light on

cultural change more generally. While our empirical application is misogyny in music, similar

methods could provide insight into changes in brand personality, public perception of politicians,

and the universality of emotions. We illustrate how the approach can be used, and then discuss

broader applications.

Third, we highlight the value of additional computational linguistic techniques for

consumer research. While more and more researchers are starting to use Linguistic Inquiry and

Word Count (LIWC, Pennebaker, Boyd, Jordan, and Blackburn 2015) and similar dictionary-

Page 6: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

based methods, other approaches have received less attention. We discuss a number of ways

these approaches can shed light on a range of questions.

Page 7: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

MISOGYNY IN MUSIC

Gender bias is pervasive (Carlana 2019; Garg, Schiebinger, Jurafsky, and Zou 2018;

Hamberg 2008; Kesebir 2017; Larivière, Ni, Gingras, Cronin, and Sugimoto 2013; Lundberg and

Stearns 2019; Moss-Racusin, Dovidio, Brescoll, Graham, and Handelsman 2012; Reuben,

Sapienza, and Zingales 2014; Shor, van de Rijt, Miltsov, Kulkarni, and Skiena, 2015 ). Across a

range of disciplines (e.g., business, science, and medicine) and outcomes (e.g., hiring, evaluation,

and recognition), women are often perceived less favorably and treated less fairly. Teachers’

gender stereotypes influence student performance (Carlana 2019), for example, and the same job

applicant is seen as more competent and offered a higher starting salary, if they have a male

rather than a female name (Moss-Racusin et al. 2012).

One reason bias may be so sticky is that they continually reinforced through culture.

Culture is often conceived of cultural background, and indeed, the particular countries or social

group people grow up in has an important impact on attitudes, cognitions, and behaviors (Markus

and Kitayama 1991). Compared to East Asians, for example, Americans tend to prefer more

unique options (Kim and Markus 1999) and are more persuaded by promotion-focused

information (Aaker and Lee 2001).

Page 8: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

Beyond cross-cultural differences, though, indivdiual cultural items themselves also

shape attitudes and cognitions (Kashima 2008). Songs, books, and other cultural tastes and

practices not only reflect the setting in which they were produced, but also shape the attitudes

and behaviors of the audiences that consume them (Anderson, Carnagey, and Eubanks 2003;

Berger et al. 2019; Brummert Lennings and Warburton 2011). Song lyrics that are aggressive

towards women, for example, or portray them negatively, increase anti-female attitudes and

misogynous behavior (Fischer and Greitemeyer 2006). Lyrics that espouse equality, however,

can boost attitudes towards women and encourage pro-female behavior (Greitemeyer,

Hollingdale, and Traut-Mattausch 2015). Consequently, one reason stereotypes and biases, as

well as attitudes more generally, may be so persistent is that they are continually reinforced by

the cultural items (e.g., songs, books, and advertisements) that consumers experience on an

everyday basis.

But while such cultural items clearly have impact, their actual nature is less transparent.

Consider music. Are song lyrics biased against women? Do any such biases vary across genres?

And have they changed over time?

Page 9: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

Attempts to answer such questions have been hampered by issues of scale and

measurement. While researchers in a range of disciplines have argued about whether music lyrics

are misogynist (i.e., exhibit a dislike of, contempt for, or ingrained prejudice against women),

most perspectives are based on small samples or one genre over a short period of time (Adams

and Fuller 2006). Harding and Nett (1984), for example, examined 40 rock songs covers and

Armstrong (2001) examined 13 rap artists. While a couple papers have examined slightly larger

samples in one genre (e.g., 340 rap songs from 1979 to 1997, Herd 2009) or cross-genre samples

over a short time period (e.g., 600 songs over four years, Flynn, Craig, Anderson, and Holody

2016) without more comprehensive data, it is difficult to draw strong conclusions. One genre

may not be representative of music as a whole, and without examining longer time horizons, it is

hard to know whether bias has shifted over time. Truly examining misogyny in music requires

analyzing at least tens of thousands of songs over multiple genres across multiple decades.

Further, even if one were able to compile such a dataset, measuring misogyny would be

challenging. It’s one thing to read and rate the lyrics of a few dozen songs, and multiple research

assistants could even rate a few hundred, but having people rate lyrics for tens of thousands of

songs would be prohibitive.

Finally, because existing analyses rely completely on human judgment, they are

susceptible to bias. The same lyrics, for example, may seem more or less misogynistic depending

on the gender of the person reading them.

THE CURRENT RESEARCH

To address these challenges, we take a different approach. Rather than having individuals

Page 10: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

read song lyrics and manually rate them based on how misogynistic they seem, we use

automated textual analysis.

Recent work has begun to highlight the value of using automated textual analysis in

consumer research (Humphreys and Wang 2018; Netzer, Feldman, Goldenberg, and Fresko

2012; Netzer, Lemaire, and Herzenstein 2019, Moore and McFerran 2017; Packard, Moore, and

McFerran 2018; Rocklage and Fazio 2015; Rocklage, Rucker, and Nordgren 2018). Rather than

requiring individuals to manually read through text, these computerized approaches extract a

variety of textual features automatically. Researchers have used these approaches to measure

things like linguistic mimicry in online word of mouth (Moore and McFerran 2017), pronoun use

in customer service calls (Packard et al. 2018), and emotional language in persuasion (Rocklage

et al. 2018). Automated textual analysis is particularly useful because it can parse large quantities

of textual information, relatively quickly, and can do so in an objective manner.

Most research in this area has relied on a class of approaches that can be described as

word extraction. At a basic level, these approaches capture how often a given entity (e.g., word

or phrase) appears in a text. Dictionaries like Linguistic Inquiry and Word Count (LIWC,

Pennebaker et al. 2015), for example, allow researchers to measure how often first-person

pronouns, positive emotion words, or other types of language show up. Tools like Evaluative

Lexicon (Rocklage et al. 2018), Hedonometer (Dodds and Danforth 2010), and VADER (Hutto

and Gilbert 2014) use similar methods and researchers can even create their own customized

dictionaries to use in a given application (Humphreys and Wang 2018).

But while word extraction is useful in a variety of applications, it has some limitations.

Taken to the context of misogyny, for example, one could certainly measure the number of times

certain pejorative terms for women show up in songs over time, but that, by itself, wouldn’t tell

Page 11: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

you how those terms are being used. The word b*tch, for example, could be used to refer to a

man, a woman, or potentially even an object.

Some work by marketing scholars has begun to leverage another class of approaches

called topic extraction. Similar to how factor analysis might be used to identify underlying

themes among survey items, tools like topic modeling (e.g., Latent Dirichlet Allocation or LDA,

Blei, Ng, and Jordan 2003) can identify the general topics or themes discussed in a body of text,

as well as the words that make up those themes. Country songs, for example, talk a lot about the

themes of “girls and cars” (e.g., words like car, drive, girl, and kiss) and “uncertain love” (e.g.,

words like love, ain’t, and can’t, Berger and Packard 2018. Toubia, Iyengar, Bunnell, and

Lemaire (2019) use a similar approach to find the main themes discussed in movies.

But while topic extraction is useful for extracting themes from a large number of

documents, is still doesn’t say much about the relationship between entities. Whether women are

talked about differently than men, for example, or whether certain brands are often described as

having particular traits.

For these types of questions, an emerging computational linguistic approach called word

embeddings is particularly useful (Devlin, Chang, Lee, and Toutanova 2018; Mikolov et al.

2013; Pennington, Socher, and Manning 2014). As discussed in more detail in the methods, this

machine learning framework represents each word by a vector, and uses a high-dimensional

space to map each word or entity based on the other words with which it frequently appears. The

relationship between these vectors then captures the semantic relationship between words.

Researchers can use this space to understand the relationship between words and the context in

which they are used.

We use word embedding to examine whether (and how) men and women are talked about

Page 12: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

differently in music and whether any such biases may have shifted over time. We start by

compiling a dataset of more than a quarter of a million songs from six music genres over more

than 50 years, orders of magnitude larger than any previous dataset used to examine misogyny.

Next, we use several natural language processing approaches to examine potential biases.

Before turning to the more complex word-embeddings approach, we begin with a simpler

analysis. Since some work has measured misogyny using aggression towards women (Fischer

and Greitemeyer 2006) we start by identifying verbs that express aggression (e.g., hate) and

measuring the degree to which each gender is the recipient of aggressive thoughts and actions.

Gender bias can also be more subtle. Compared to men, for example, women are often

described as warm (e.g., kind and supportive), but less likely to be described as competent (e.g.,

smart and ambitious, Fiske, Cuddy, Glick, and Xu 2002). Consequently, we use word

embeddings to measure this subtler form of misogyny. Not whether lyrics are explicitly

aggressive towards women, but whether woman are less likely to be linked with desirable traits.

Finally, to begin to shed light on what might be driving any observed patterns, we

examine artist gender. We examine whether the prevalence of female artists has shifted over

time, whether male and female artists use different language, and how any shifts in their

language are related to overall changes in misogyny.

We close with a broader discussion of how word embeddings, and natural language

processing more generally, may be used to inform a wide range of consumer research questions.

EMPIRICAL ANALYSIS OF A QUARTER OF A MILLION SONGS

Data

Page 13: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

First, we compiled songs and their lyrics. Given copyright issues, there are limited open-

access lyrics datasets, so we compiled information from different places. We started by scraping

the Billboard website, pulling all songs on each of the major charts (i.e., pop, rock, country,

R&B, dance, and rap) every 3 months from 1965 (or whenever a given chart started) to 2018. We

matched these songs with lyrics from SongLyrics.com. To collect additional songs, we gathered

all the songs and their lyrics from datasets on kaggle.com (Agarwal 2017; Kuznetsov 2017;

Mishra 2017) and scraped all available songs and their lyrics from a major lyrics website. Then,

we used the million songs dataset (Mahieux, Ellis, Whitman, and Lamere 2011) and freeDB

(“FreeDB” 2018) to append year and genre for any song that did not already include this

information. We removed any song with missing information and used artist and title

combination to remove duplicates in the data.

The final dataset includes 258,937 songs from 1965 to 2018. It includes pop, rock,

country, R&B, dance, electronic, rap, and hip-hop genres. Given data sparsity, and the similarity

between rap and hip-hop, we combine these two genres. Given data sparsity, and the similarity

between dance and electronic, we do the same thing for these two genres.

Second, we cleaned the lyrics. Since we are focused on English language lyrics, we used

the langdetect package (version 1.0.7) in Python to remove any non-English lyrics. We also

removed non-informative texts in brackets, such as [Verse 1] or [Pre-Chorus 1]. Finally, we

shifted all lyrics to lower case so that variations of the same word are treated similarly (e.g., Man

and man).

Third, to identify words associated with different genders, we rely on word lists from

prior work. To identify words related to men (e.g., he and him) and women (e.g., she and her) we

used word lists from Garg et al. (2018). While LIWC (Pennebaker et al. 2015) includes a wider

Page 14: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

list of words (e.g., godmother, schoolgirl, grandson, and son-in-law) many show up quite

infrequently in the data, so averaging across them would lead to inaccurate estimates.

Consequently, we used Garg et al. (2018) more focused list.

Data Preparation. To ensure we have enough data in each period to train the models, we

use 5-year time buckets. Bucket structure was selected to ensure enough songs per data point to

train the models (including at least half a million words), as well as enough overall data points to

examine changes over time.

One question is how to balance songs from different genres. Some genres happen to have

greater representation in the dataset, and we address this in two ways. First, to ensure equal

representation from each genre, most of our analyses use under-sampling, a common machine

learning practice when datasets include imbalanced class labels. Assume data from three genres

g1, g2, and g3, where g1has the smallest number of songs. To generate a balanced sample for the

cross-genre analyses, we randomly select ¿∨g1∨¿ songs from genres g2 and g3 and perform the

analyses on that sample. We repeat this sampling and analysis 100 times, averaging across the

runs. Substantial number of dance and rap songs do not enter the data until 1980 and 1990

respectively, so we only include them (and sample accordingly) after those time points.

Second, one could argue that certain genres are less popular and so equal weighting does

not actually reflect reality. To address this possibility, as a robustness check we collect data on

genre popularity over time and weight the genres by these estimates. As shown in the Robustness

Checks section, results remain the same.

Methods and Results

Page 15: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

Methods for Object of Aggression. Some work has measured misogyny using aggression

towards women (Fischer and Greitemeyer 2006) so we start by identifying verbs that express

aggression (e.g., hate) and measuring the degree to which women and men are the recipient of

aggressive thoughts and actions. In phrases like “I hit her”, for example, “her” is the recipient

(i.e., object) of the aggressive verb “hit.” To identify verbs commonly used for expressing

aggression (e.g., hate and burn) we use Violence Vocabulary Word List

(https://myvocabulary.com/word-list/violence-vocabulary/). We use spacy package (version

2.0.18) in Python to extract words dependency in sentences and calculate the frequency of a

gender being used as an object of an aggressive verb.

Results for Object of Aggression. While women are more likely to be recipients of

aggression in songs today than in the 1960s (β = 1.45, p < 0.001), this ignores the possibility that

aggressive acts could have generally increased over time and affected both genders equally.

Indeed, men are also more likely to be recipients of aggression in recent years (β = 1.70, p =

0.07). Looking over time shows that women (47.6%) and men (52.4%) are recipients of

aggression a similar amount overall (χ2(1) = 2.50, p = 0.11) and that this does not change

significantly over the years (β = -0.0026, p = 0.86).

One could wonder whether the results are somehow driven by the specific way in which

the outcome measure was calculated, so to test robustness, we also use an alternate approach. We

calculate the difference between women and men being recipient of aggression while controlling

for how often they are recipient of any thought or action (e.g., “I kissed him” or “I kissed her”).

Consistent with the main analysis, however, this difference between the ratios does not change

over time (β = -0.0001, p = 0.77).

Page 16: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

Even misogynists might not explicitly sing about wanting to hurt women, however, for

fear that it would restrict their audience. More generally, if misogyny does exist it may be more

implicit.

To address these limitations, we next use a state-of-the-art machine learning approach to

examine a subtler form of misogyny. Not whether lyrics are explicitly aggressive towards

women, but whether woman are less likely to be linked with desirable traits.

Methods for Gender-Trait Association. Competence and warmth are two universal

dimensions of social cognition (Fiske, Cuddy, and Glick 2007), but while women are often

described as warm (e.g., kind and supportive), they are less often described as competent (e.g.,

smart and ambitious).

We use word2vec (Mikolov et al. 2013), a well-adopted word vector representation

technique, to quantify how likely women (relative to men) are to be linked to competence, as

well as other traits (i.e., intelligence, warmth, masculine stereotypes, and feminine stereotypes),

over time.1

Word2vec is a two-layer neural network which receives a corpus of text and transforms it

to numerical vectors. This approach assigns each word a high-dimensional vector such that the

relationship between vectors captures the semantic relationship between the words. Words that

relate to one another such as fruit (e.g., apple and orange) or vehicles (e.g., car and truck) appear

close together, but words that are not as related (e.g., apple and truck) appear further apart.

To train word embeddings, word2vec takes into account words co-occurrence, distance,

and occurrence in similar contexts. If “dogs are smart” shows up frequently in a body of text, for

1 We also tested several other well-known methods on a random small subset of our data and found that word2vec was most appropriate. We compared truncated SVD (a.k.a. latent semantic allocation), Positive Pointwise Mutual Information (PPMI), GloVe (Pennington et al., 2014), and word2vec. The preliminary results and consistency of word2vec results along with previous research (Sahlgren and Lenci, 2016) showed that word2vec using CBOW is the appropriate method for our question.

Page 17: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

example, that would shrink the relative distance between the vector for dog and the vector for

smart. Beyond just incidence, though, the distance between occurrences of words also

matters. Even though both “dogs are smart” and “dogs love running and are also smart” contain

the words dog and smart, the first example has them closer together, which decreases the

vectors’ distance in word2vec space. Note that two words do not have to necessarily co-occur,

occurring in similar contexts also shapes similarity. If “dogs are animals” and “animals are

smart” both appear frequently, it decreases the distance between the vectors for “dogs” and

“smart” even if the two words never appear together.

The difference between such vectors has been shown to be a powerful tool for studying

language and human perceptions about the relationship between words. Bhatia (2017), for

example, demonstrated that vectors are closer together for words that are similar or used in

similar contexts. Consequently, recent work has begun to use similar methods to measure

stereotypes. Work using the Google Books corpus (Lin, Michel, Aiden, Orwant, Brockman, and

Petrov 2012), for example, found that word embedding associations (i.e., the distance between

male and female words and target words) captured human ratings of whether those target words

were more associated with men or women (r > 0.76, p < .001, Kozlowski, Taddy, and Evans

2019). Other research using that corpus (Garg et al. 2018) found that word embedding

associations between words related to women and different occupations words tracked the

percentage of women in each of those occupations over time (r = 0.70, p < .001). Consequently,

a great deal of research in a variety of contexts has shown that word embeddings are a powerful

method for capturing people’s perceptions of the relationship between words, and those

relationships over time.

Page 18: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

We use a similar approach to measure gender bias in music. We quantify the relationship

between words related to each gender, and key dimensions of interest (e.g., competence), over

time. First, we used word lists from prior work (Garg et al. 2018) to identify words related to

men (e.g., “man, “men,” and “he”) and women (e.g., “woman, “women,” and “she,” see table A1

for full list). Second, we use word2vec to train a word embedding model per time-bucket and

generate 300-dimensional real-valued word vectors. That means for each word (e.g., woman) and

period in the dataset (e.g., 1965-1970), we use the set of song lyrics over that period to position

the vector for that word. Third, we take the word vector representation of words in our gender

word lists and average the vectors to obtain one single 300-dimensional vector per each gender.

Fourth, we use a similar approach to obtain vectors for key dimensions of interest (e.g.,

competence). In the case of competence, for example, we use a list of word from prior work

(Nicolas, Bai, and Fiske 2019) that are known to relate to competence (e.g., smart, persistent, and

knowledgeable, see table A1 for more examples).

Fifth, to measure the association, we capture the similarity between gender vectors and

key dimension vectors (e.g., competence) at any given period in time. We use cosine similarity

metric because it is frequently used in the literature and in the original word2vec article

(Mikolov et al. 2013). If a gender word and a competence word occur in similar contexts, for

example, their vector representations are close in the word2vec space, which results in a larger

cosine similarity score and indicates that the two words are highly associated. Conversely, if the

words do not occur in similar contexts, the cosine similarity is smaller, and the words are

considered less associated.

To give a sense of how this approach works, Figure 1 provides a simplified version. We

only show two dimensions for the sake of illustration, but words are represented with 300-

Page 19: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

dimensional vectors in our analyses.

Page 20: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

FIGURE 1. ILLUSTRATION OF THE APPROACH IN IN A TWO-DIMENSION

VECTOR SPACE

As shown in the figure, semantically similar words (e.g., the group related to men or the

group related to women) appear close to one another. As discussed, we create a single vector for

each gender and compare how close it is to words representing key dimensions of interest (e.g.,

the vector for the word “successful” which relates to competence). The figure shows that

compared to the female vector, the male vector is closer to “successful,” indicating that

successful is more associated with men.

Formally, to calculate similarity of dimension d (e.g., competence) to gender g, we

calculate the cosine similarity of each word vector in W d to gender vector V g, given that W d is a

matrix of n by 300 where n is the number of words in dimension d and 300 is the vector size.2 To

calculate how biased dimensions are, for each dimension word, we subtract cosine similarity

between the dimension word vector and the vector representing female from the cosine similarity

between the dimension word vector and the vector representing male. Positive values indicate

that dimension word is more associated with males, negative values indicate the opposite. This

2 While we create a vector for gender words following Garg et al., (2018), we do not do so for competence because while gender words all capture a similar semantic, competence words may not. Stubborn and ambitious are both on the competence list, for example, but have very different meanings. Averaging over gender word vectors, give us a vector representing that gender, but averaging over tens of dimension words would add noise.

Page 21: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

difference score also removes any general time trends that affect both genders equally (e.g.,

maybe more recent songs talk about competence more in general):

d∈ {competence , intelligence, warmth , masculine stereotype , feminine stereotype }

bia sd=1n∑i=1

n

(cosinesimilarity (W⃗ di , V⃗ male )−cosine similarity (W⃗ di ,⃗V female))

In addition to prior work, the Validation section below demonstrates that such difference scores

capture gender biases.3

Finally, we analyze whether the associations change over time. We use linear mixed

effect models with time as the independent variable and bias as the dependent variable. Given

that association with gender may vary across words, we add a random effect for words. We run

separate linear mixed effect models for individual dimensions (e.g., competence) and genres

(e.g., rock).4

Results for Gender-Trait Association. Results suggest that while lyrics have generally

become less misogynistic over time, they remain biased.

3 Note that we do not analyze changes in the popularity of competence words over time but how connected competence words are to each gender. Rather than looking at whether more recent songs mention the word “smart”, for example, we are interested in whether, when songs use the word smart, they are using it to refer more to men or women.4 One might wonder why we did not use word embeddings for the aggressive term analyses. Given that an aggressive act has both a subject and an object (e.g., “she hit him” or “he hit her”), being closer to aggression by itself does not distinguish between whether women are the subject or object of such acts. Consequently, looking at subject or object directly seemed more appropriate.

Page 22: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

FIGURE 2. ASSOCIATION BETWEEN GENDER AND COMPETENCE

Note: Dashed line represents equal association with either gender or values greater (less) than 0 indicate the dimension is more associated with men (women). Grey regions represent 95% confidence intervals around the estimates for each period.

Words related to competence (e.g., smart) have become more associated with women,

from strongly biased towards men to less so (β = -0.53, p < 0.001, β2 = 0.54, p < 0.001, figure

2). That said, gains level off in the late 1990s, and confidence intervals do not overlap with zero

in recent years suggesting that competence remains associated with men. Results remain similar

through a number of robustness checks and words related to intelligence (e.g., precocious and

inquisitive, Garg et al., 2018) show similar results (β = -0.40, p < 0.01, β2 = -0.59, p < 0.001,

figure A5).

One could wonder whether the leveling off is driven solely by the introduction of new

genres. Rap and dance show up later in the dataset, for example, so if they are more misogynistic

than other genres, that could lead to a leveling off. Ancillary analyses (Web Appendix),

however, show that this alone does not explain the patter. While the introduction of rap and

dance in the 1980s contributed, even focusing on genres that existed previously (i.e., pop, rock,

and country) shows similar patterns. More generally, examining different genres separately (web

appendix) suggests that while competence has become more associated with women in pop,

Page 23: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

rock, and country, such associations have not changed in R&B and dance. In rap lyrics,

competence has become less associated with women.

For completeness, we also perform the same type of analysis for warmth related words

(e.g., kind, friendly, and caring, Nicolas et al. 2019, see table A1 for more examples). Results

indicate that warmth has become less uniquely associated with women (β = 0.86, p < 0.001, β2 =

-0.53, p < 0.001, figure 3).

FIGURE 3. ASSOCIATION BETWEEN GENDER AND WARMTH, MASCULINE, AND FEMININE STEREOTYPES.

Note: Dashed line represents equal association with either gender and values greater (less) than 0 indicate the dimension is more associated with men (women). Grey regions represent 95% confidence intervals around the estimates for each period.

Further analysis of words associated with masculine stereotypes (e.g., leader and

ambitious) and feminine stereotypes (e.g., cooperate and trustworthy, Gaucher, Friesen, and Kay

2011, see table A1 for full lists) suggests that lyrics may have become somewhat less gendered

overall (figure 3).5 Words related to masculine stereotypes have become less associated with men

5 One might wonder whether lyrics becoming less gendered (i.e., less associated with either gender) is good. While decreased gendered associations is clearly good in some situations (e.g., competence moving from being more associated with men to less so), in others it is less obviously beneficial (e.g., warmth becoming less associated with women). Thus, whether it is perceived as good or bad seems to depend less on the change and more about whether the trait is seen as positive and originally associated with women or not. Consequently, we use the termed gendered rather than biased to be agnostic about the valence of the change.

Page 24: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

(β = -0.69, p < 0.001, β2 = 0.41, p < 0.001) and words related to feminine stereotypes have

become less associated with women (β = 0.17, p = 0.18, β2 = 0.26, p = 0.03). For both warmth

and masculine stereotypes, though, gains level off in the early 1990s (see web appendix for more

details).

Artists Gender Analysis. One might wonder what is driving the observed changes. Are

the linguistic shifts similar for male and female artists, for example? Or might they simply reflect

the inclusion of more female artists over time who may use less misogynistic and gendered

language?

To attempt to at least begin to address these questions, we analyze artist gender. To infer

artists’ gender, we use gender_guesser package (version 0.4.0) in Python along with national

name list (Kaggle 2017). For individual artists (e.g., Katy Perry or Billy Joel) we simply use their

name, but for artists that go by pseudonyms (e.g., Snoop Dogg) or bands that have multiple

members (e.g., Metallica or the Supremes) we first use their Wikipedia page to extract the

individual name(s). In cases where no such page is available, research assistants searched for this

information manually. Then, we use a similar approach as individual artists to detect gender and

assign the band a score based on the percentage of members that are female. Gender was able to

be assigned for 99.5% of the artists.6

Results cast doubt on the notion that the results are driven by more female artists over

time. There is no change in the percentage of female artists over time (23% female artists, β = -

0.0001, p = 0.82). Further, while one could argue that this is driven by new genres emerging,

even within genres there is no significant change (β pop = 0.0037, p = 0.06, all other genres p >

6 Note that there are three to four times more male artists in the dataset (figure S8), so the effects for male language may be more precisely estimated. Further, while there are enough solo female artists or all female groups, and solo male artists or all male groups to estimate effects for each, it is more difficult to do so for mixed gender bands. Less than 10% of songs were performed by mixed gender groups which provides limited data for estimating language changes.

Page 25: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

0.2).7 This casts doubt on the possibility that shifts in language are driven by more female artists

over time.

Instead, results are more consistent with a shift in male artists’ language. For male

artists, words related to competence (β = -0.0004, p < 0.001), and masculine stereotypes (β = -

0.0017, p < 0.01) become more associated with women while warmth related words become

more associated with men (β = 0.0002, p < 0.001). Words related to feminine stereotypes show

no change (β = -0.0006, p = 0.37).

While female artists’ language shows less evidence of change (βcompetence = 0.0001, p =

0.35; βmasculine stereotypes = 0.0006, p = 0.31; β feminine stereotypes = 0.0012, p = 0.1; βwarmth = 0.0003, p <

0.01), deeper examination suggests that this may be driven by the fact that female artists’

language was less misogynistic to begin with (figure 4 and table A4).

FIGURE 4. LINGUISTIC DIFFERENCES BASED ON ARTISTS’ GENDER.

7 Rock and rap seem to have a lower percentage of female artists than other genres (pop = 37%, rock = 13%, country = 26%, R&B = 26%, dance = 23%, rap = 17%).

Page 26: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

Note: Dashed line represents equal association with either gender and values greater (less) than 0 indicate the dimension is more associated with men (women). Grey regions represent 95% confidence intervals around the estimates for each period.

Female artists are more likely to associate women with competence overall (βgender = -

0.0144, p < 0.001), and the gender x time interaction (β = 0.0002, p < 0.001) indicates that the

gender difference has decreased over time as male artists started associating women more with

competence. The same holds for masculine stereotypes (βgender = -0.0586, p < 0.001, βgender x time =

0.0013, p < 0.01). Female artists also associate warmth less with women (βgender = 0.0082, p <

0.001) though the gender difference has not closed over time (βgender x time = 0.0000, p = 0.8).

There were no significant effects for feminine stereotypes (βgender = -0.0273, p = 0.08; βgender x time

= 0.0008, p = 0.11). Analysis of the most recent time period, however, suggests that artists of

both genders remain biased. Lyrics associate men more with competence and masculine

stereotypes, and women more with warmth.

One might wonder whether individual artists’ (e.g., U2) language has changed, but the

mechanics of word embeddings makes answering such questions challenging. One would need

thousands of songs at different time points to make comparisons over time. While this is true

across artists, no individual artist has that many songs.

Validation

Page 27: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

As noted above, an emerging body of research (e.g., Garg et al. 2018; Kozlowski et al.

2019) demonstrates that word embeddings can be used to accurately capture gender biases. That

said, to ensure that word embeddings are capturing misogyny and gender stereotypes in our

context, we conducted additional validation checks. First, we examined the relationship between

Kozlowski et al. (2019)’s human scores and the word embedding bias in our model. We find

similar results (r = 0.63 p < .001).

Second, we asked three RAs to code competence words (e.g., think and knowledge)

based on how stereotypically masculine or feminine they were (i.e., -2 to 2 scale, is a very

feminine word to is a very masculine word). We then compared these scores to the embedding

bias at the most recent time point (i.e., 2015-2018). Results demonstrate that the embedding bias

is related to human-coded stereotype scores (r = 0.41, p < .001).

Third, following Garg et al. (2018), we calculated the correlation between word

embedding bias in our models and human-coded stereotype scores used by Garg et al. (2018) at

two time points (1977 and 1990). Since we use 5-year time buckets, we trained models for 1975-

1979 and 1990-1994 periods respectively. Results revealed a significant correlation between

word embedding bias and human-coded gender stereotype scores for both time periods (r = 0.36,

p < .001 and r = 0.18, p < .05).

Taken together, the combination of prior work, as well as our own additional empirical

tests support the notion that word embeddings are accurately capturing differences in gender

associations.

We also provide a more general test that our word embeddings are accurately capturing

the semantic relationship between words. As mentioned earlier, words that are similar or are used

Page 28: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

in similar contexts, have been shown to have closer word vectors in the word embeddings space

(Bhatia 2017). We find the same thing in our data. Following Hamilton, Leskovec, and Jurafsky

(2016) we use a standard similarity task benchmark called MEN (Bruni, Boleda, Baroni, and

Tran 2012). MEN dataset includes human judgment scores on similarity of 3000 word pairs. We

calculate the similarity score between these word pairs in our model. Then, we measure the

correlation between human-coded similarity scores and model-calculated similarity scores. The

average correlation for cross-genre and within-genre analyses are 0.47 (p < .001) and 0.37 (p

< .001) respectively. The significant positive correlations show that word association in our

trained model is a good representative of word association in language judged by humans.

Robustness Checks

While the results suggest that misogyny has decreased, one could wonder whether this

simply reflects the increasing complexity of song lyrics. This argument would suggest that while

song lyrics in the 1970s were rather simplistic (e.g., Let It Be), in 2010, song lyrics may be much

more complex and varied and this would somehow lead to our effect. But while such an

argument might suggest that the increased complexity would lead to an increased distance

between men and competence words, as well as women and competence words, it says less about

why the relative distance between women and men with competence words would decrease. Said

another way, it would suggest that the distance between both male and female words and

competence would increase, but complexity alone has difficulty explaining why the increase

would be greater for one gender than the other.

Page 29: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

That said, additional empirical tests cast doubt on this alternative. The main results

persist even when controlling for any such shifts in language complexity over time. To attempt to

control for the possibility that lyrics may have become more complex, we measure how the

distance amongst words within each of the gender word lists (e.g., words related to men like

“man, “he,” and “his”) has changed over time. This captures whether the similarity of even

related words (“man” and “he,” for example, or “woman” and “she”) has shifted over time. We

then take the average of the similarity among words related to men (e.g., “man, “he,” and “his”)

and similarity among words related to women (e.g., “woman, “she,” and “her”) and normalize

our main competence measure by this average. Results are quite similar. Competence words

have become more associated with women over time, though they remain more associated with

men. Further, these gains have leveled off and even reversed as of late.

We also conducted a number of robustness checks to see whether the results are robust to

the songs, words, and method used.

Songs Used. Given that more popular songs may be more likely to have their lyrics

available, the dataset, especially in earlier years, may include more popular songs. That said,

because more popular songs are listened to more anyway, they should have a greater impact on

the population.

To ensure that equally weighting genres is not driving the results, we weight the genres

based on popularity. Billboard Hot 100 charts are the only consistent audience consumption data

available from 1960s to today, so we rely on genre popularity scores collected from Billboard by

Beckwith (2016), indicating the percentage of songs in Hot 100 charts from each genre in each

time period. We weight things accordingly. For example, assume in 2000, country, rock, and pop

songs occupy 25%, 25%, and 50% of the Billboard charts respectively. If we have 1000 songs

Page 30: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

from each genre in our original dataset, we select 1000 pop songs, and 500 songs from rock and

country genres (total of 2000 songs). The number of songs for 1965-70 year-bucket did not meet

our threshold of half a million words and was dropped for the genre popularity analysis.

Results are similar to the main analyses. Aggression verbs were directed equally likely

towards both genders (women 49% of the time and 51% of the time toward men, χ2(1) = 2.0, p =

0.2) and it did not change significantly over the years (β = -0.0029, p = 0.74). Words related to

competence (β = -0.66, p < 0.001, β2 = 0.58, p < 0.001) become less associated with men over

time as does intelligence words (β = -0.47, p < 0.001, β2 = 0.45, p < 0.001). Warmth has become

less associated with women (β = 0.69, p < 0.001, β2 = -0.16, p = 0.25), masculine stereotypes

have become less associated with men (β = -0.47, p < 0.01, β2 = 0.34, p = 0.01), and feminine

stereotypes are more equivocal (β = -0.09, p = 0.38, β2 = 0.23, p = 0.03).

Words Used. To further test robustness, we consider the words used in our analyses. First,

as in previous research (Mikolov et al. 2013), we drop words with frequency smaller than 5 in

each year-bucket and repeat our analyses. Results remain generally similar. Words related to

competence become less misogynistic (β = -0.13, p = 0.07, β2 = 0.27, p < 0.001) with a turnover

in recent years. Warmth (β = 0.30, p < 0.001, β2 = -0.04, p = 0.63) and feminine stereotypes (β =

0.29, p < 0.001, β2 = 0.09, p = 0.22) become less associated with women and masculine

stereotypes become less associated with men (β = -0.12, p = 0.07, β2 = 0.24, p < 0.001).

Second, we only consider words which have occurred in every time point. This is a

restricting rule to guarantee similar predictors (i.e., words) are compared over the years. Results

remain similar. Words related to competence become less associated with men (β = -0.16, p =

0.04, β2 = 0.15, p = 0.05). Warmth (β = 0.36, p < 0.001, β2 = -0.03, p = 0.69) and feminine

stereotypes (β = 0.24, p < 0.01, β2 = 0.15, p = 0.07) become less associated with women, and

Page 31: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

masculine stereotypes become less associated with men (β = -0.21, p < 0.01, β2 = 0.21, p <

0.01).

Methods Used. As noted previously, we tested several other well-known methods such as

truncated SVD (a.k.a. latent semantic allocation), Positive Pointwise Mutual Information

(PPMI), and GloVe (Pennington et al. 2014), and find similar results.

We also considered several alternate analyses. First, we considered testing whether there

is a shift in how men and women appear as subjects and objects. The subject of a sentence has an

active role as the agent of action (e.g., the word she in “she called him”), while the object takes a

more passive role as the recipient of action (e.g., the word him in “she called him”). We could

thus measure how often men and women appear as subjects (e.g., she and he) and objects (e.g.,

her and him) over time. The link between this and misogyny, however, is not completely

straightforward. Women could be the subject of a sentence, for example, but still talked about

negatively (e.g., she is dumb).

Second, we tried to examine changes in derogatory words (e.g., b*tch) over time, but

measurement proved challenging. For example, while “b*tch” was originally referred to women,

its meaning shifted to be used for both genders. Thus, it is difficult to accurately capture shifts in

reference to women.

Page 32: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

GENERAL DISCUSSION

There is longstanding interest in culture and cultural dynamics, both within consumer

behavior, and across other disciplines more broadly. But measurement has been a key challenge.

It’s one thing to measure brand personality, perceptions of cultural issues, or stereotypes at a

particular time point, but tracking these aspects over time becomes challenging. Surveys are

difficult to use over long time horizons, and manual coding of archival data is difficult to scale.

We suggest that an emerging method from computational linguistics may help address

some of these challenges. To illustrate how this approach might be used, we applied it to the

question of misogyny in music; whether lyrics are misogynistic, whether this has changed over

time, and whether this varies across genres.

Analyzing over a quarter of a million songs over more than 50 years begins to shed light

on these questions. While there is little evidence of explicit misogyny (e.g., greater aggression

towards women), subtler forms of misogyny suggest a more complex picture. Lyrics have

become less biased against women over time, but bias remains. Women are more likely to be

associated with competence and intelligence than they were in the past, but these concepts are

still more associated with men. Lyrics have become less gendered more broadly, but they remain

gendered. Further, reductions in bias seem to have slowed and may even have reversed in some

cases. Deeper examination suggests that this is driven by both new genres (e.g., rap) and changes

in older ones (e.g., country). Finally, ancillary analyses suggest that rather than being driven by

more female artists, these results are driven by male artists shifting their language over time.

Researchers and cultural critics alike have long debated whether popular music is

misogynistic. These results not only provide empirical evidence, but allow for quantitative

Page 33: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

comparison of where (i.e., which genres and artists) and when (i.e., time periods) such biases

may be larger.

The results also highlight the importance of considering subtler measures of bias. Across

disciplines, there is increased interest in more implicit measures of attitudes and biases (Payne,

Vuletich, and Brown-Iannuzzi 2019) and language provides another useful approach. Like a

fingerprint, words reflect things about the people and contexts that generate them (Pennebaker,

Mehl, and Niederhoffer 2003). Consequently, language can be a useful barometer in a wide

range of domains. Future work might examine whether the language used in films, for example,

has changed over time, and whether there have been shifts in the way women and minorities are

discussed.

Broader Applications

While our empirical work focused on misogyny in music, word embeddings can be used

to shed light on a variety of other questions.

To study brand personality, for example, researchers often ask people to fill out surveys

reporting whether particular brands are more or less exciting or sincere. While this works

reasonably well when looking at brand personality at one, or even a couple points in time,

measuring brand personality over a longer horizon becomes more challenging. Collecting survey

data over decades is often prohibitive, and even if prior researchers collected some related data,

they often used different measures. But word embeddings can help address these challenges.

Rather than asking people for their perceptions, researchers can use existing textual data (e.g.,

social media or online reviews) to automatically measure semantic relationships. Just as we

Page 34: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

measured the relationship between words related to women and competence, researchers could

extract the distance between brands and each of the key personality dimensions, and do so over

time. Similar approaches could be used to track consumer attitudes towards social issues or

political candidates. The Google Books dataset (Lin et al. 2012) and New York Times corpus

(Sandhaus 2008) are valuable data sources for researchers interested in even longer time

horizons.

Word embeddings could also be used to address the ongoing debate in the emotions

literature about the universality of emotions. While some (e.g., Ekman 1992) have argued that

there are a discrete number of universal emotions that are consistent across cultures (e.g., anger,

sadness, and disgust), others have suggested that the meaning of emotions are more malleable

(Jack, Garrod, Yu, Caldara, and Schyns 2012). By examining language corpora from different

cultures, researchers could use word embeddings to test whether the same words really have

similar semantic meanings. Whether anger is used to talk about the same things in Spanish, for

example, as it is in English.

Researchers studying goals, motivation, and self-regulation may also find word

embeddings useful. Work on desire and desire regulation (Hofman, Vohs, and Baumeister 2012),

for example, notes that “the majority of research on self-regulation occurs in the laboratory” and

that “little is known about what types of urges are felt strongly (or only weakly), what urges

conflict with other goals, and how successfully people resist their urges” (p. 582). Word

embeddings, topic modeling, and similar approaches can allow researchers to examine these

questions in more naturalistic settings. Social media, blog posts, and even the Google Books

dataset could be used to examine the urges people have, how they conflict, and even how these

urges may have varied over time. What people want and feel like they should do today (e.g.,

Page 35: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

balance work and family or consume less social media), for example, is likely very different

today than it was 20 years ago.

Related approaches may also be valuable for looking within cultural items themselves.

Researchers have long theorized about the nature of narrative. Books, movies, and songs, but

also online content, speeches, and even academic papers usually have beginnings, middles, and

ends, through which there is some sort of narrative flow. Why are some narratives more

engaging and impactful than others?

One possibility is how the plot or narrative develops or unfolds. Some narratives move

very quickly, while others plod along at a slower pace. Some repeatedly tread new ground while

others circle back to the same themes again and again. Natural language processing can be used

to measure the geometry of narrative and whether it shapes success.

Word embeddings examine the semantic relationship between words, but related

approaches have applied these ideas to larger chunks of linguistic information. Just as word2vec

represents words as a vector in a multidimensional space, techniques like Doc2vec (Le and

Mikolov 2014) and USE (Cer, Yang, Kong, Hua, Limtiaco, John, and Sung 2018), do something

similar for sentences or larger chunks of text. This would allow researchers to test, for example,

whether narratives that move at a faster pace (i.e., adjoining chunks are more distant from one

another) are more or less successful.

Research could also examine the relationship between cultural items. Whether certain

songs, baby names, or other cultural items become popular depends not only on the features of

those items themselves, but how similar those items are to the larger cultural context in which

they are embedded. Songs whose lyrics are more atypical, or more differentiated from their

genre, for example, are more popular on the Billboard charts (Berger and Packard 2018). Similar

Page 36: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

results might hold for movies or even academic papers. Papers whose themes are quite

differentiated from the journals they are published in, for example, may be more novel and thus

cited more. Alternatively, one could argue that if papers are too different, they may be cited less.

Topic modeling approaches may be even more useful than word embeddings here, because they

can capture underlying thematic content of cultural items.

Limitations

As with many projects involving field data, this work has a number of limitations. First,

one could wonder whether the dataset is representative of the music people are actually exposed

to. At a genre level, the fact that the results hold whether we equally weight each genre, or

weight them by popularity, casts doubt on the notion that genre weighting is driving the results.

At a song level, by collecting the Billboard charts each year, we certainly have a representative

sample of popular songs. We also sampled less popular songs, but it is harder to be certain about

whether that sample is representative because there is no complete list of songs out there. Given

we collected all available sources on songs that included lyrics, genre, and year information,

however, and the dataset is orders of magnitude larger than what has been used in prior work,

hopefully it is closer to being representative. Further, given that the population is exposed to

popular songs much more frequently than unpopular ones, we believe we have a reasonable

sample of what the population was exposed to.

Second, the results are certainly shaped by the word lists used. We relied on lists

developed by prior work, shown to tap the key constructs of interest, but one could still have

additional questions. In our competence (and warmth) dictionary, there are words with positive

relationship with competence (e.g., “smart” with weight = 1) and words with negative

Page 37: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

relationship (e.g., “dumb” with weight = -1). That said, one could wonder about negations (e.g.,

“women are not smart”). While negations are not directly taken into account, if some lyrics say

“women are not smart,” it is likely that the same lyrics or other lyrics also say “women are

dumb.” Given enough representative data, there will be several other instances of “smart,”

“dumb,” and “women” appearing in different contexts which will shape the meaning and final

positioning of the words. Therefore, as a result, “women” and “smart” will not end up close

together just because of a handful of “women are not smart” instances.

There are alternate ways one could imagine measuring misogyny. As noted previously,

misogyny is defined as a dislike of, contempt for, or ingrained prejudice against women. This

could be operationalized a multitude of different ways. Indeed, looking at prior empirical work

shows that different papers often look at completely different things. Work has looked at

everything from the presence of violence (Herd 2009) and discussion of murder and assault

(Armstrong 2001), to whether women are talked about as objects (Harding and Nett 1984), and

as evil, as sex objects, or as physically attractive (Cooper et al. 1985). Consequently, it is

unlikely that any one paper could look at all aspects of misogyny. That said, as noted above, we

tried a number of different ways including object of aggression, association with desirable traits,

and, in ancillary analyses, the use of derogatory words as well as how often men and women

appear as subjects and objects. Future work could consider additional directions.

Finally, while automated textual analysis has some benefits, it also has some clear

weaknesses. While it provides consistency across texts, and scalability across large datasets, we

do not mean to suggest that automated methods are somehow better than more manual ones.

Close reading allows researchers to pick up nuances that automated methods might miss. Indeed,

before zeroing in on which aspects to focus on in this investigation, we manually read through

Page 38: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

dozens of songs to get a sense of how language was used and how it discussed men and women.

In contrasting the differences between more automated and more manual approaches, Hart

(2001) uses the metaphor of trying to understand a city by walking the streets or viewing it from

a helicopter. Both provide different, but useful pictures of what is going on. Consequently,

automated and manual methods can be complementary, and often provide different viewpoints to

the same problem.

Conclusion

In conclusion, word embeddings and other computational linguistic techniques provide a

powerful toolkit to study culture. Researchers have long been interested in quantifying cultural

dynamics, but measurement has been a key challenge. Natural language processing, however,

provides a reliable method of extracting features, and doing so at scale. These emerging

approaches have the potential to shed light on a range of interesting questions related to

consumer behavior.

Page 39: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

REFERENCES

Aaker, J. L., Lee, A. Y. (2001). “I” seek pleasures and “we” avoid pains: The role of self-regulatory goals in information processing and persuasion. Journal of Consumer Research, 28(1), 33–49.

Adams, T. M., Fuller, D. B. (2006). The words have changed but the ideology remains the same: Misogynistic lyrics in rap music. Journal of Black Studies, 36(6), 938–957.

Agarwal, S. (2017). Every song you have heard (almost)! Retrieved December 1, 2018, from https://www.kaggle.com/artimous/every-song-you-have-heard-almost

Anderson, C. A., Carnagey, N. L. and Eubanks, J. (2003). Exposure to violent media: The effects of songs with violent lyrics on aggressive thoughts and feelings. Journal of Personality and Social Psychology, 84(5), 960–971.

Armstrong, E. G. (2001). Gangsta misogyny: A content analysis of the portrayals of violence against women in rap music, 1987–1993. Journal of Criminal Justice and Popular Culture, 8(2), 96–126.

Bass, F. (1969). A New Product Growth for Moden Consumer Durables. Management Science 15(5), 215-227.

Beckwith, J. (2016). The evolution of music genre popularity. Retrieved July 7, 2019, from http://thedataface.com/2016/09/culture/genre-lifecycles

Berger, J., Humphreys, A., Ludwig, S., Moe, W. W., Netzer, O. and Schweidel, D. A. (2019). Uniting the tribes: Using text for marketing insight. Journal of Marketing, 84(1), 1–25.

Berger, J., Packard, G. (2018). Are atypical things more popular? Psychological Science, 29(7), 1178–1184.

Bhatia, S. (2017). Associative judgment and vector space semantics. Psychological Review, 124(1), 1.

Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022.

Brummert Lennings, H. I. and Warburton, W. A. (2011). The effect of auditory versus visual violent media exposure on aggressive behaviour: the role of song lyrics, video clips and musical tone. Journal of Experimental Social Psychology, 47(4), 794-799.

Bruni, E., Boleda, G., Baroni, M. and Tran, N.-K. (2012). Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, 1, 136–145.

Carlana, M. (2019). Implicit Stereotypes: Evidence from Teachers’ Gender Bias*. The Quarterly Journal of Economics, 134(3), 1163–1224.

Cer, D., Yang, Y., Kong, S. Y., Hua, N., Limtiaco, N., John, R. S., ... & Sung, Y. H. (2018). Universal sentence encoder. arXiv preprint arXiv:1803.11175.

Cooper, C.L., Cooper, R. and Faragher, B. (1985), Stress and life event methodology. Stress Med., 1, 287-289.

Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2018). Bert: Pre-training of deep

Page 40: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

bidirectional transformers for language understanding. ArXiv Preprint ArXiv:1810.04805.Dodds, P. S., Danforth, C. M. (2010). Measuring the happiness of large-scale written expression:

Songs, blogs, and presidents. Journal of Happiness Studies, 11(4), 441–456.Ekman, P. (1992). An argument for basic emotions. Cognition and Emotion, 6(3-4), 169–200.Fischer, P., Greitemeyer, T. (2006). Music and aggression: The impact of sexual-aggressive song

lyrics on aggression-related thoughts, emotions, and behavior toward the same and the opposite sex. Personality and Social Psychology Bulletin, 32(9), 1165–1176.

Fiske, S. T., Cuddy, A. J. C. and Glick, P. (2007). Universal dimensions of social cognition: Warmth and competence. Trends in Cognitive Sciences, 11(2), 77–83.

Fiske, S. T., Cuddy, A. J. C., Glick, P. and Xu, J. (2002). A model of (often mixed) stereotype content: Competence and warmth respectively follow from perceived status and competition. Journal of Personality and Social Psychology, 82(6), 878–902.

Flynn, M. A., Craig, C. M., Anderson, C. N. and Holody, K. J. (2016). Objectification in popular music lyrics: An examination of gender and genre differences. Sex Roles, 75(3-4), 164–176.

FreeDB. (2018). Retrieved May 10, 2019, from http://www.freedb.org/en/Garg, N., Schiebinger, L., Jurafsky, D. and Zou, J. (2018). Word embeddings quantify 100 years

of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16), E3635–E3644.

Gaucher, D., Friesen, J. and Kay, A. C. (2011). Evidence that gendered wording in job advertisements exists and sustains gender inequality. Journal of Personality and Social Psychology, 101(1), 109.

Greitemeyer, T., Hollingdale, J. and Traut-Mattausch, E. (2015). Changing the track in music and misogyny: Listening to music with pro-equality lyrics improves attitudes and behavior toward women. Psychology of Popular Media Culture, 4(1), 56–67.

Hamberg, K. (2008). Gender Bias in Medicine. Womens Health, 4(3), 237–243.Hamilton, W. L., Leskovec, J. and Jurafsky, D. (2016). Diachronic word embeddings reveal

statistical laws of semantic change. ArXiv Preprint ArXiv:1605.09096.Harding, D. and Nett, E. (1984). Women and rock music. Atlantis: Critical Studies in Gender,

Culture & Social Justice, 10(1), 60–76.Hart, R. P. (2001). Theory, Method, and Practice in Computer Content Analysis. Westport:

Ablex Publishing.Herd, D. (2009). Changing images of violence in rap music lyrics: 1979-1997. Journal of Public

Health Policy, 30(4), 395-406.Hofmann, W., Vohs, K. D. and Baumeister, R. F. (2012). What People Desire, Feel Conflicted

About, and Try to Resist in Everyday Life. Psychological Science, 23(6), 582–588.Humphreys, A. (2010). Semiotic structure and the legitimation of consumption practices: The

case of casino. Journal of Consumer Research, 37(3), 490–510.Humphreys, A., Wang, R. J. H. (2018). Automated text analysis for consumer research. Journal

of Consumer Research, 44(6), 1274-1306.Hutto, C. J., Gilbert, E. (2014). Vader: A parsimonious rule-based model for sentiment analysis

Page 41: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

of social media text. Eighth international AAAI conference on weblogs and social media.Jack, R. E., Garrod, O. G. B., Yu, H., Caldara, R. and Schyns, P. G. (2012). Facial expressions of

emotion are not culturally universal. Proceedings of the National Academy of Sciences, 109(19), 7241–7244.

Kaggle. (2017). US Baby Names, Version 2. Retrieved May 10, 2019, from https://www.kaggle.com/kaggle/us-baby-names

Kashima, Y. (2008). A social psychology of cultural dynamics: Examining how cultures are formed, maintained, and transformed. Social and Personality Psychology Compass, 2(1), 107–120.

Kesebir, S. (2017). Word order denotes relevance differences: The case of conjoined phrases with lexical gender. Journal of Personality and Social Psychology, 113(2), 262–279.

Kim, H., Markus, H. R. (1999). Deviance or uniqueness, harmony or conformity? A cultural analysis. Journal of Personality and Social Psychology, 77(4), 785–800.

Kozlowski, A. C., Taddy, M. and Evans, J. A. (2019). The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings. American Sociological Review, 84(5), 905–949.

Kuznetsov, S. (2017). 55000+ Song Lyrics. Retrieved December 1, 2018, from https://www.kaggle.com/mousehead/songlyrics

Larivière, V., Ni, C., Gingras, Y., Cronin, B. and Sugimoto, C. R. (2013). Bibliometrics: Global gender disparities in science. Nature, 504(7479), 211–213.

Le, Q., Mikolov, T. (2014). Distributed representations of sentences and documents. In International conference on machine learning, 1188–1196.

Lieberson, S. (2000). A matter of taste: How names, fashions, and culture change. New Haven: Yale University Press.

Lin, Y., Michel, J.-B., Aiden, E. L., Orwant, J., Brockman, W. and Petrov, S. (2012). Syntactic annotations for the google books ngram corpus. Proceedings of the ACL 2012 system demonstrations,169–174.

Lundberg, S., Stearns, J. (2019). Women in Economics: Stalled Progress. Journal of Economic Perspectives, 33(1), 3-22.

Mahieux, T. B., Ellis, D. P. W., Whitman, B. and Lamere, P. (2011). The million song dataset. ISMIR-11.

Markus, H. R., Kitayama, S. (1991). Culture and the self: Implications for cognition, emotion, and motivation. Psychological Review, 98(2), 224.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 3111–3119.

Mishra, G. (2017). 380,000+ lyrics from MetroLyrics. Retrieved December 1, 2018, from https://www.kaggle.com/gyani95/380000-lyrics-from-metrolyrics

Moore, S. G., Mcferran, B. (2017). She said, she said: Differential interpersonal similarities predict unique linguistic mimicry in online word of mouth. Journal of the Association for

Page 42: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

Consumer Research, 2(2), 229–245.Moss-Racusin, C., Dovidio, J., Brescoll, V., Graham, M. and Handelsman, J. (2012). Science

facultys subtle gender biases favor male students. Proceedings of the National Academy of Sciences, 109(41), 16474–16479.

Netzer, O., Feldman, R., Goldenberg, J. and Fresko, M. (2012). Mine your own business: Market-structure surveillance through text mining. Marketing Science, 31(3), 521–543.

Netzer, O., Lemaire, A. and Herzenstein, M. (2019). When words sweat: Identifying signals for loan default in the text of loan applications. Journal of Marketing Research, 56(6), 960–980.

Nicolas, G., Bai, X. and Fiske, S. T. (2019). Development and validation of comprehensive stereotype content dictionaries through an automated method.

Packard, G., Moore, S. G. and Mcferran, B. (2018). (Im) happy to help (You): The impact of personal pronoun use in customer–firm interactions. Journal of Marketing Research, 55(4), 541–555.

Payne, B. K., Vuletich, H. A. and Brown-Iannuzzi, J. L. (2019). Historical roots of implicit bias in slavery. Proceedings of the National Academy of Sciences, 116(24), 11693–11698.

Pennebaker, J. W., Boyd, R. L., Jordan, K. and Blackburn, K. (2015). The development and psychometric properties of LIWC2015. Austin, TX: University of Texas at Austin.

Pennebaker, J. W., Mehl, M. R. and Niederhoffer, K. G. (2003). Psychological aspects of natural language use: Our words, our selves. Annual Review of Psychology, 54(1), 547–577.

Pennington, J., Socher, R. and Manning, C. (2014). Glove: Global vectors for word representation. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532–1543.

Rao, H., Monin, P. and Durand, R. (2005). Border Crossing: Bricolage and the Erosion of Categorical Boundaries in French Gastronomy. American Sociological Review, 70(6), 968–991.

Reuben, E., Sapienza, P. and Zingales, L. (2014). How stereotypes impair womens careers in science. Proceedings of the National Academy of Sciences, 111(12), 4403–4408.

Rocklage, M., Fazio, R. (2015). The evaluative lexicon: Adjective use as a means of assessing and distinguishing attitude valence, extremity, and emotionality. Journal of Experimental Social Psychology, 56, 214–227.

Rocklage, M., Rucker, D. and Nordgren, L. (2018). The evaluative lexicon 2.0: The measurement of emotionality, extremity, and valence in language. Behavior Research Methods, 50(4), 1327–1344.

Sahlgren, M., Lenci, A. (2016). The effects of data size and frequency range on distributional semantic models. ArXiv Preprint ArXiv:1609.08293.

Sandhaus, E. (2008). The new york times annotated corpus. Linguistic Data Consortium, Philadelphia, 6(12), e26752.

Schaller, M., Conway, L. G. III and Tanchuk, T. L. (2002). Selective pressures on the once and future contents of ethnic stereotypes: Effects of the communicability of traits. Journal of

Page 43: faculty.wharton.upenn.edu€¦  · Web viewNatural language processing of a quarter of a million songs over 50 years provides an empirical test of whether music is biased against

Personality and Social Psychology, 82(6), 861–877.Schaller, M., Crandall, C. S. (2003). The psychological foundations of culture. Mahwah NJ

Lawrence: Erlbaum Associates Inc.Shor, Eran, Arnout van de Rijt, Alex Miltsov, Vivek Kulkarni, and Steven Skiena (2015), “A

Paper Ceiling Explaining the Persistent Underrepresentation of Women in Printed News,” American Sociological Review, 80 (5), 960–84.

Simmel, G. (1957). Fashion. American Journal of Sociology, 62(6), 541-558. Toubia, O., Iyengar, G., Bunnell, R. and Lemaire, A. (2018). Extracting features of entertainment

products: A guided latent dirichlet allocation approach informed by the psychology of media consumption. Journal of Marketing Research, 56(1), 18–36.

Wolfram, W., Reaser, J. and Vaughn, C. (2008), Operationalizing Linguistic Gratuity: From Principle to Practice. Language and Linguistics Compass, 2, 1109-1134.