16
Large-Scale Topical Analysis of Multiple Online News Sources with Media Cloud Jason Chuang University of Washington [email protected] Sands Fish Harvard University {sands, dlarochelle}@cyber.law.harvard.edu David Larochelle Harvard University William P. Li Massachusetts Institute of Technology [email protected] Rebecca Weiss Stanford University [email protected] ABSTRACT Identifying topics in news, tracking their temporal dynam- ics, and understanding how different media sources cover them have important theoretical and practical implications for journalism researchers, producers, and consumers. The explosive growth of online news sources, however, suggests that scalable approaches to topical analysis are needed. We introduce our ongoing efforts to enable large-scale topical analysis of the Media Cloud corpus, a repository of over 200 million online news articles. Our initial experiments with 90 days of articles from 21 top media sources suggests that sta- tistical topic modeling can identify reasonable news-related topics and produce interesting early insights into the on- line media ecosystem. We are currently examining mixed- initiative approaches to automate the process of topic ex- traction and increase the quality of the extracted topics. Finally, we discuss our further research directions on large- scale news monitoring and measurement as well as analysis tools for news consumers and producers. 1. INTRODUCTION The importance of the news cannot be overstated; the news serves to inform us about the state of the world across a wide array of themes: from political and economic issues to entertainment and sports events. However, online news has grown increasingly diverse as the definition of a jour- nalistic entity has evolved. News consumers now have the luxury of choice, and with low barriers to entry the quantity of news-supplying outlets has exploded, ranging from vener- able print news organizations such as nationally distributed newspapers to the greater networked public sphere [1]. As a result of this growth, online news is produced at a massive scale, forcing news producers to remain viable in an increasingly competitive market. News producers must gen- erate content that will increase readership, which may come Authors are listed in alphabetical order. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD ’14 New York City, NY USA Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. at the expense of news quality. Mainstream news sources are challenged to retain high standards of journalistic integrity while also remaining commercially viable. There is grow- ing concern about an impoverished news landscape where suppliers can only publish stories guaranteed to attract a generic type of reader. In short, the explosive growth of news suppliers has actually complicated the most basic use of the news; to inform readers about the events happening in the world around them. Readers, in turn, have lost confi- dence in the news, lacking the ability to extract meaningful and accurate insight. We propose a solution to this problem on behalf of news producers and consumers based on the notion that catego- rization of news sources and summarization of news content hold the most promise for regaining tractability of online news as a valuable source of information. The ability to categorize news sources by relative attention to topics can reduce the burden of choice on consumers, making it easier to discover sources that cater to specific interests. This could increase both media trust and news consumption. Further- more, summarizing large amounts of news in a meaningful, accurate way could further decrease the burden of choice by providing a snapshot of overall event and issue coverage along with additional information such as date or time of publication, source, or framing. In this paper, we present our work-in-progress. First, we employ Media Cloud, an open news platform that enables public access to full-length historical news texts, which en- ables unprecedented access to the online news landscape. Second, we apply statistical topic modeling, a computa- tional technique for extracting themes from text corpora, to three months’ worth of news. We highlight two values of the resulting model: exploratory analysis (e.g., correlat- ing multiple variables including latent topics and metadata) and data cleaning (e.g., identifying outliers or unexpected trends such as the volume of a news source). Finally, we outline how topic modeling Media Cloud data can lead to commercial applications, as well as provide contributions to both social and computer science. 2. MEDIA CLOUD Media Cloud is an open source, open data platform that allows researchers to answer questions about the content of online media. It is intended to allow researchers to analyze the online media ecosystem without having to incur the cost of discovering and collecting news content themselves. 1

Large-Scale Topical Analysis of Multiple Online News Sources …people.csail.mit.edu/wli/papers/Chuang-Fish-Larochelle... · 2014-12-23 · line data sources: a comprehensive study

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Large-Scale Topical Analysis of Multiple Online News Sources …people.csail.mit.edu/wli/papers/Chuang-Fish-Larochelle... · 2014-12-23 · line data sources: a comprehensive study

Large-Scale Topical Analysis ofMultiple Online News Sources with Media Cloud

Jason ChuangUniversity of Washington

[email protected]

Sands FishHarvard University{sands, dlarochelle}@cyber.law.harvard.edu

David LarochelleHarvard University

William P. LiMassachusetts Institute of Technology

[email protected]

Rebecca WeissStanford University

[email protected]

ABSTRACTIdentifying topics in news, tracking their temporal dynam-ics, and understanding how different media sources coverthem have important theoretical and practical implicationsfor journalism researchers, producers, and consumers. Theexplosive growth of online news sources, however, suggeststhat scalable approaches to topical analysis are needed. Weintroduce our ongoing efforts to enable large-scale topicalanalysis of the Media Cloud corpus, a repository of over 200million online news articles. Our initial experiments with 90days of articles from 21 top media sources suggests that sta-tistical topic modeling can identify reasonable news-relatedtopics and produce interesting early insights into the on-line media ecosystem. We are currently examining mixed-initiative approaches to automate the process of topic ex-traction and increase the quality of the extracted topics.Finally, we discuss our further research directions on large-scale news monitoring and measurement as well as analysistools for news consumers and producers.

1. INTRODUCTIONThe importance of the news cannot be overstated; the

news serves to inform us about the state of the world acrossa wide array of themes: from political and economic issuesto entertainment and sports events. However, online newshas grown increasingly diverse as the definition of a jour-nalistic entity has evolved. News consumers now have theluxury of choice, and with low barriers to entry the quantityof news-supplying outlets has exploded, ranging from vener-able print news organizations such as nationally distributednewspapers to the greater networked public sphere [1].

As a result of this growth, online news is produced at amassive scale, forcing news producers to remain viable in anincreasingly competitive market. News producers must gen-erate content that will increase readership, which may come

Authors are listed in alphabetical order.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.KDD ’14 New York City, NY USACopyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.

at the expense of news quality. Mainstream news sources arechallenged to retain high standards of journalistic integritywhile also remaining commercially viable. There is grow-ing concern about an impoverished news landscape wheresuppliers can only publish stories guaranteed to attract ageneric type of reader. In short, the explosive growth ofnews suppliers has actually complicated the most basic useof the news; to inform readers about the events happeningin the world around them. Readers, in turn, have lost confi-dence in the news, lacking the ability to extract meaningfuland accurate insight.

We propose a solution to this problem on behalf of newsproducers and consumers based on the notion that catego-rization of news sources and summarization of news contenthold the most promise for regaining tractability of onlinenews as a valuable source of information. The ability tocategorize news sources by relative attention to topics canreduce the burden of choice on consumers, making it easierto discover sources that cater to specific interests. This couldincrease both media trust and news consumption. Further-more, summarizing large amounts of news in a meaningful,accurate way could further decrease the burden of choiceby providing a snapshot of overall event and issue coveragealong with additional information such as date or time ofpublication, source, or framing.

In this paper, we present our work-in-progress. First, weemploy Media Cloud, an open news platform that enablespublic access to full-length historical news texts, which en-ables unprecedented access to the online news landscape.Second, we apply statistical topic modeling, a computa-tional technique for extracting themes from text corpora,to three months’ worth of news. We highlight two valuesof the resulting model: exploratory analysis (e.g., correlat-ing multiple variables including latent topics and metadata)and data cleaning (e.g., identifying outliers or unexpectedtrends such as the volume of a news source). Finally, weoutline how topic modeling Media Cloud data can lead tocommercial applications, as well as provide contributions toboth social and computer science.

2. MEDIA CLOUDMedia Cloud is an open source, open data platform that

allows researchers to answer questions about the content ofonline media. It is intended to allow researchers to analyzethe online media ecosystem without having to incur the costof discovering and collecting news content themselves.

1

Page 2: Large-Scale Topical Analysis of Multiple Online News Sources …people.csail.mit.edu/wli/papers/Chuang-Fish-Larochelle... · 2014-12-23 · line data sources: a comprehensive study

2.1 Platform DescriptionThe core idea of Media Cloud is simple: collect and archive

the huge text dataset of news articles published online in aformat useful and accessible to researchers. To that end,Media Cloud detects RSS feeds from media sources, pollsthem multiple times daily, downloads pages containing pre-viously unseen stories, and extracts article text. Currently,the system contains over 20,000 unique media sources and200 million archived stories. For each collected story, thesystem extracts article text and removes other “boilerplate”such as ads and navigation elements. As much of this data asis legally permissible is made available through an API; pub-licly available information includes URL, publication date,and word frequency counts for each article.

To date, applications of Media Cloud have focused on me-dia coverage of specific issues. For example, it has been usedto investigate the role of blogs in the Russian network publicsphere as an alternative to Russian government informationsources and the mainstream media [6], as well as coverage ofthe SOPA/PIPA debate in the United States from Septem-ber 2010 through the end of January 2012 [2]. In addition,Media Cloud has been useful in combination with other on-line data sources: a comprehensive study of coverage of theTrayvon Martin story involved Media Cloud data in conjunc-tion with Archive.org’s TV news archive, Google Searches,and Change.org signatures to understand the evolution ofthe story’s coverage [7].

2.2 Study Purpose and DatasetGiven the size and completeness of Media Cloud, it may

be possible to use it to discover new insights about particularmedia sources or the online media ecosystem as a whole. Inparticular, having all of the articles from URLs publishedto the RSS feeds of multiple media sources for a non-trivialtime period makes it possible to conduct large-scale topicalanalysis of news.

Using the Media Cloud API, we obtained for all storiesfrom 21 top mainstream media sources published within a90-day period between 2014-03-05 and 2014-06-02. This setof 21 media sources, listed in full in the appendix, was de-fined in 2010 based on the 25 most popular sites in the newssources category according to Google AdPlanner; of thesesources, 21 were still operational and were being capturedby Media Cloud during our target time period. This datasetcontains the per-article word-frequency counts for 429,042articles. In addition to these counts, we captured metadataabout each article, including its source and its time and dateof publication.

3. STATISTICAL TOPIC MODELINGTo find thematic patterns in the retrieved Media Cloud

dataset, we apply statistical topic modeling, a probabilis-tic algorithm that seeks to uncover “topics” from a docu-ment collection, where each generated topic is a weightedlist of words representing terms that frequently co-occur inthe same set of documents. In turn, each document is rep-resented as a weighted list of topics. Specifically, we applylatent Dirichlet allocation (LDA) [3], the most widely usedtopic modeling technique. A more thorough explanation of

Media Cloud is available at: http://www.mediacloud.orgPrior to final publication, all code used in this paper will

be made available under a free and open source license.

LDA and probabilistic generative modeling of text corporacan be found in [15].

Statistical topic modeling has been applied to various tasks,such as analyses of political text [8], effects of funding inbiomedical research [16], identifications of temporal trendsin news [11], and characterizations of themes in microblogs [13].However, existing topic models are typically built by teamsof machine learning experts, whereas our goal is to enablejournalists and consumers to effectively perform topical anal-ysis of news data.

3.1 Topic Modeling the Media Cloud CorpusTo apply LDA on the Media Cloud dataset, we used Gen-

sim, an open-source Python library [14]. Gensim imple-ments an online variant of LDA for computing model out-puts to make memory requirements manageable [9].

A practical challenge of applying LDA requires the usersto specify the number of topics beforehand. One approachfor finding distinctive topics is to choose a relatively highnumber of topics; however, this can make understanding orvisualizing all of the topics difficult. For our analysis, weset this parameter to 100 to balance these two considera-tions. On a single PC with 8 CPUs and 14 GB of memory,computing the model took approximately six hours. Uponcompletion, we used the source and date/time metadata inour dataset to generate time series plots of topics and themean topic weights for our 21 media sources. These resultsare summarized and visualized in the next section and inthe appendix.

3.2 Initial ResultsThe relative average topic weights for the 21 media sources

are displayed in Figure 1. For this figure, we selected aninteresting subset of the 100 topics that appeared to corre-spond to news themes or events and have substantial vari-ability across media sources. Visually, a few “clusters” seemapparent: the British media sources (BBC, Daily Telegraph,Guardian, and Daily Mail) seem to have similar topic weights,while the American sources (Reuters to CNN) also exhibitsimilarity with each other. Meanwhile, Examiner.com, Forbes,and CNET each appear to be unique among media sources.

Figure 1 suggests that describing each media source as aset of topic weights, which we refer to as its topic “signa-ture”, seems reasonable — it reflects the coverage of differ-ent themes or events relative to each news source. Second, itmay not be necessary to use all 100 topic weights for mediasources to have distinguishable signatures, since even a smallsubset of topics already seem to be producing differences.

We chose a simple method of reducing topics: we selectedthe topics where the top five words were, in our human judg-ment, all highly related. This method resulted in keeping 42out of 100 topics. The 42 topics and an example of their rel-ative weights for one media source, CNET, is shown in Fig-ure 2. In this case, it is clear that CNET focuses most ontopic 97 (data, google, apple, mobile, technology), followedby topics 4 (company, market, price, business, sales) and 15(million, billion, money, bank, cash). The appendix includesan expanded version of Figure 1 and the topic signatures ofthe other 20 media sources.

3.3 Interpretations of Initial ResultsWith these media-source signatures, numerous further anal-

yses are possible. We chose two areas that relate to under-

2

Page 3: Large-Scale Topical Analysis of Multiple Online News Sources …people.csail.mit.edu/wli/papers/Chuang-Fish-Larochelle... · 2014-12-23 · line data sources: a comprehensive study

Figure 1: Media Sources vs. Selected Topics: Thischart describes each media source as a combinationof topics. The area of the circles represents theamount of topical weights assigned to each mediasource. Topics are labeled using the top five wordsbelonging to each topic.

standing individual media outlets and their similarities anddifferences to one another:

Media Source Topic Rankings. The 21 media sourcesdiffer in their weights on the 42 topics. Table 1 shows the topthree organizations for selected topics. The correspondingtopic weights, expressed as a percentage, can be interpretedas a measure of proportion of a media source’s total cover-age focused on that topic. For instance, about 9.3% of allof CNET’s coverage was on topic 97. CNET is top-rankedfor the technology topic, Forbes for business, and Washing-ton Post for politics. Other top-ranked media sources arealso interesting: health (CBS News), Donald Sterling (LATimes), Syria (FOX News), and baseball (New York Times).

Media Source Clustering. Using signatures based onthe 42 selected topics, we applied k -means clustering to findgroupings of media sources. Table 2 shows the results of au-tomatically choosing five clusters. This method successfullyrecovered the visual groupings identified in Figure 1 andsplit the American media sources into two groups — cluster1 tends to cover certain topics related to politics more thancluster 2. Average signatures of these clusters are includedin the appendix.

Table 1: Top Media Sources for Selected TopicsTopic Top Media Sources97: data, google, ap-ple, mobile, technol-ogy

CNET (9.3%), Forbes (3.0%),TIME.com (1.3%)

15: million, billion,money, bank, cash

Forbes (8.8%), Reuters (8.1%),Daily Telegraph (3.4%)

16: obama, presi-dent, house, party,election

Washington Post (9.2%), USAToday (4.6%), Huffington Post(3.6%)

54: cancer, study,health, people, re-search

CBS News (2.0%), USA Today(1.8%), FOX News (1.8%)

12: sterling, jones,chamberlain, adam,carey

LA Times (0.48%), TIME.com(0.48%), MSNBC (0.46%)

21: attack, peo-ple, killed, govern-ment, syria

FOX News (2.8%), Reuters(2.4%), TIME.com (1.5%)

30: run, game, runs,inning, hit

New York Times (3.1%), NewYork Post (2.50%), LA Times(2.2%).

Table 2: Clusters of Media SourcesCluster Media Sources1 Examiner.com, CNN, LA Times, Daily

News New York, New York Post2 FOX News, MSNBC, San Francisco

Chronicle, TIME.com, Huffington Post,USA Today, Washington Post

3 BBC, Daily Mail, Daily Telegraph,Guardian

4 CNET5 Forbes, Reuters

These initial results suggest that statistical topic modelingmay be an appropriate method for characterizing the MediaCloud corpus (and subsequently online news). Many of theautomatically discovered topics seem distinctive and recog-nizable, meaning that comparing media sources in terms oftheir coverage of these topics led to reasonable and under-standable insights. These insights were made possible bya mixed-initiative approach involving automatically discov-ered topics, human involvement to identify distinctive topicsand groups of media sources, and computer-based rankingand clustering of media sources.

4. DISCUSSIONTopic modeling can increase the scale of news analysis and

remove subjective bias, but the process also requires addi-tional manual efforts to assess and interpret the model. Wediscuss our approach’s advantages and our ongoing work toimprove the usefulness of topic modeling in the news do-main.

4.1 Increasing Scalability and Reducing BiasOur automatic approach, when extended to the full Media

Cloud corpus, would allow us to examine millions of docu-ments at once. As a point of comparison, civic watchdogssuch as Media Matters [10] and the Pew Research Center’s

3

Page 4: Large-Scale Topical Analysis of Multiple Online News Sources …people.csail.mit.edu/wli/papers/Chuang-Fish-Larochelle... · 2014-12-23 · line data sources: a comprehensive study

Figure 2: Topic Signature for CNET: A signature isthe topic weights averaged over all documents be-longing to an entity; in this case, all articles pub-lished by CNET.

Journalism Project [12] currently track news through man-ual coding. However, because the process is labor-intensive,typically only a sample of 500 to 1,000 articles are taggedper year. The capacity to characterize all of the articlespublished by media sources could give a more complete andaccurate picture of the media ecosystem.

Topic modeling may also sidestep various origins of biasthat typically affect news content analyses. The topic ex-traction process does not depend on users’ a priori knowl-edge of the set of topics, which may be incomplete and par-tial, or human judgment in terms of prorportions to assignto different topis, which can be subjective and erroneous.As a result, topic modeling may uncover unspecified eventsand quantify topics in a more balanced manner than manualcoding.

4.2 Research DirectionsWhile our approach reduces the upfront efforts required in

manually reading or coding the source documents, statisticaltopic modeling introduces additional pain points. Peopleneed to inspect and interpret topics generated by a model,select appropriate model parameters, and refine a model’soutput to reveal clustering relationships and structures.

We are currently creating tools to help non-machine learn-ing experts more effectively build topic models, includingvisual analysis tools [4] for interactively comparing multi-ple models, aligning similar topics, and updating a model,as well as automatic algorithms to operationalize commontasks. In addition, we are also currently designing visual-izations [5] to help everyday users better understand andtrust topical analysis results. Visualizations can provide anat-a-glance quantitative and qualitative sense of the ways inwhich issues are being covered in the media; perhaps moreimportantly, the absence of a topic might also reveal whattopics are not being discussed. Citizens may also be able to

cross-reference extracted topics with known events, people,and other facts to gain more confidence or identify inconsis-tencies in either the model output or the underlying newscorpus they’re investigating.

4.3 Potential ApplicationsEditorial Support and Content Management. Topic

modeling provides an abstract view across very large amountsof content, media sources, and time allowing for quantita-tive and qualitative assessments. For news producers, thiscould aid in evaluating their own content in comparison toothers, allowing them to become a more diverse, unique, orwell-balanced source of news.

When researching previous coverage of an issue, or tryingto characterize a narrative leading up to a story, topic mod-eling could enable data journalists to analyze large amountsof published text on an issue. This combination of toolscould be useful to understand discursive elements, at whattimes certain topics have been discussed, and potentiallywhat frames are used to characterize a specific issue.

User Management. The topic “signature” generated asa result of fitting a topic model to Media Cloud data could beuseful for recommending other news properties to interestedconsumers. For example, if a news consumer tends to followmedia sources with a topic signature similar to Figure 2, asearch could be performed for additional news sources withsimilar topic signatures. For news sources, which are oftenpart of media conglomerates that own multiple news prop-erties, such a comparison method could lead to the develop-ment of sophisticated recommendation engines.

Media Literacy for News Consumers. In addition,this work could lead to tools for news consumers that reporton a media source’s breadth and depth of event and issuecoverage. Given the explosion of potential sources of news,being able to quickly evaluate different media sources couldhelp news consumers better discover and assess their appro-priateness. For example, if a consumer encounters a mediasource reporting on politics, but its topic signature indi-cates that this news source primarily covers entertainment,the consumer can assess the quality of that news source ac-cordingly. For news consumers, topic modeling could helpthem understand how and why media outlets are addressinga specific issue in a given news cycle, how different news or-ganizations relate to one another in terms of topic coverage,and what latent biases may be present.

5. CONCLUSIONWe present our initial work on automating topical analysis

of large-scale news data. Future work will involve applyingthe analysis across a more diverse set of content available inMedia Cloud such as blogs, enabling insight across a broadermedia landscape. Further work is needed on visualizing ourtopic modeling efforts, as well as evaluating the results ofour model a) across individual media sources; b) over longerlengths of time; c) on a specifically scoped set of documentscomposing a controversy; d) for an array of explicit textsearches; and e) comparing collections of media sources (e.g.left vs. right wing blogs).

6. ACKNOWLEDGMENTSThe authors would like to thank Pamela Mishkin for valu-

able feedback on early drafts of the paper; Hal Roberts and

4

Page 5: Large-Scale Topical Analysis of Multiple Online News Sources …people.csail.mit.edu/wli/papers/Chuang-Fish-Larochelle... · 2014-12-23 · line data sources: a comprehensive study

Linas Valiukas who helped build and maintain the MediaCloud platform; and the Ford Foundation, the Open Soci-ety Foundations, the John S. and James L. Knight Founda-tion, and the Brown Institute for Media Innovation for theirgenerous financial support.

7. REFERENCES[1] Y. Benkler. The wealth of networks: How social

production transforms markets and freedom. YaleUniversity Press, 2006.

[2] Y. Benkler, H. Roberts, R. Faris,A. Solow-Niederman, and B. Etling. Socialmobilization and the networked public sphere:Mapping the sopa-pipa debate. Berkman CenterResearch Publication, (2013-16), 2013.

[3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latentdirichlet allocation. the Journal of Machine LearningResearch, 3:993–1022, 2003.

[4] J. Chuang, S. Gupta, C. D. Manning, and J. Heer.Topic model diagnostics: Assessing domain relevancevia topical alignment. In International Conference onMachine Learning (ICML), 2013.

[5] J. Chuang, C. D. Manning, and J. Heer. Termite:Visualization techniques for assessing textual topicmodels. In Advanced Visual Interfaces, 2012.

[6] B. Etling, H. Roberts, and R. Faris. Blogs as analternative public sphere: The role of blogs,mainstream media, and tv in russia’s media ecology.Berkman Center Research Publication, (2014-8), 2014.

[7] E. Graeff, M. Stempeck, and E. Zuckerman. The battlefor ‘trayvon martin’: Mapping a media controversyonline and off-line. First Monday, 19(2), 2014.

[8] J. Grimmer. A bayesian hierarchical topic model forpolitical texts: Measuring expressed agendas in senatepress releases. Political Analysis, 18(1):1–35, 2010.

[9] M. Hoffman, F. R. Bach, and D. M. Blei. Onlinelearning for latent dirichlet allocation. In J. Lafferty,C. Williams, J. Shawe-Taylor, R. Zemel, andA. Culotta, editors, Advances in Neural InformationProcessing Systems 23, pages 856–864. CurranAssociates, Inc., 2010.

[10] M. Matters. Black and white and re(a)d all over: Theconservative advantage in syndicated op-ed columns.http://mediamatters.org/research/oped/, 2014.

[11] D. Newman, C. Chemudugunta, P. Smyth, andM. Steyvers. Analyzing entities and topics in newsarticles using statistical topic models. In Intelligenceand Security Informatics, pages 93–104. Springer,2006.

[12] P. R. J. Project. News coverage index methodology.http://www.journalism.org/newsindexmethodology/99/, 2014.

[13] D. Ramage, S. T. Dumais, and D. J. Liebling.Characterizing microblogs with topic models. ICWSM,10:1–1, 2010.

[14] R. Rehurek and P. Sojka. Software Framework forTopic Modelling with Large Corpora. In Proceedings ofthe LREC 2010 Workshop on New Challenges for NLPFrameworks, pages 45–50, Valletta, Malta, May 2010.ELRA. http://is.muni.cz/publication/884893/en.

[15] M. Steyvers and T. Griffiths. Probabilistic topicmodels. Handbook of latent semantic analysis,427(7):424–440, 2007.

[16] E. M. Talley, D. Newman, D. Mimno, B. W. Herr II,H. M. Wallach, G. A. Burns, A. M. Leenders, andA. McCallum. Database of NIH grants usingmachine-learned categories and graphical clustering.Nature Methods, 8(6):443–444, 2011.

5

Page 6: Large-Scale Topical Analysis of Multiple Online News Sources …people.csail.mit.edu/wli/papers/Chuang-Fish-Larochelle... · 2014-12-23 · line data sources: a comprehensive study

APPENDIXThe following supplementary materials are included in theappendix:

A List of 21 online media sources

B Seriated topic weights by media source for full set of100 topics

C Topics versus time (daily)

D Topics versus time (hourly)

E Topic signatures for 21 online media sources

F Cluster memberships for media sources and averagetopic signatures from k-means clustering

6

Page 7: Large-Scale Topical Analysis of Multiple Online News Sources …people.csail.mit.edu/wli/papers/Chuang-Fish-Larochelle... · 2014-12-23 · line data sources: a comprehensive study

Appendix A: List of Online Media Sources

BBCCBS NewsCNETCNNDaily MailDaily TelegraphExaminer.comFOX NewsForbesGuardianLA TimesMSNBCNew York TimesReutersSan Francisco ChronicleTIME.comThe Daily News New YorkThe Huffington PostThe New York PostUSA TodayWashington Post

Page 8: Large-Scale Topical Analysis of Multiple Online News Sources …people.csail.mit.edu/wli/papers/Chuang-Fish-Larochelle... · 2014-12-23 · line data sources: a comprehensive study

topic_index / topic_desc

48

work, help, make, new, business

94

percent, year, average, rate, growth

4

company, market, price, business, sales

15

million, billion, money, bank, cash

57

said, president, united, states, group

16

obama, president, house, party, election

90

police, said, year, court, man

58

report, department, public, said, office

97

data, google, apple, mobile, technology

0

london, british, britain, european, party

81

club, league, football, players, season

56

family, old, mother, father, year

36

use, problem, problems, number, result

89

russia, ukraine, russian, pro, ukrainian

21

attack, people, killed, government, syria

45

game, team, nba, games, series

23

said, car, people, near, crash

30

run, game, runs, inning, hit

83

team, sports, detroit, season, year

87

june, day, event, summer, free

9

home, house, small, like, room

93

film, movie, story, character, series

8

world, cup, england, team, final

24

hair, red, white, fashion, look

54

cancer, study, health, people, research

29

black, star, rodger, year, wedding

60

school, students, university, court, college

59

music, stage, song, performance, musical

35

second, goal, ball, minutes, game

86

friday, saturday, sunday, day, thursday

27

said, round, got, year, play

85

government, tax, political, economic, party

5

band, love, fans, rock, singer

34

uri, food, sugar, cook, minutes

1

climate, cars, change, power, ballmer

53

new, city, york, jersey, mayor

14

health, care, state, insurance, medical

13

death, hospital, died, surgery, medical

41

french, brazil, france, italy, world

62

photo, twitter, facebook, social, images

7

airport, mexico, flight, coast, island

66

military, war, veterans, army, afghanistan

61

game, games, release, new, characters

68

children, parents, girls, young, child

63

water, energy, oil, gas, environmental

20

dog, animal, dogs, segment, pet

25

race, road, horse, racing, run

80

air, cnn, plane, space, pakistan

11

india, indian, israel, jewish, minister

26

art, museum, memorial, barbara, history

73

local, people, country, town, community

33

weight, food, diet, eating, fat

38

shinseki, brain, body, skin, pain

44

media, cbs, news, morning, season

31

star, season, stars, bryan, bell

77

cricket, africa, african, van, mae

71

god, path, life, christian, faith

75

book, los, angeles, books, ctm

17

american, world, topic, america, false

64

county, park, florida, river, state

65

clinton, prince, wine, blackhawks, nigeria

50

women, men, bst, woman, angelou

19

com, http, pid, bitrate, mpx

67

china, beijing, asian, malaysia, tiananmen

49

festival, open, beats, antonio, tennis

22

james, miami, heat, oklahoma, spurs

98

weather, snowden, mountain, lake, boston

92

church, paul, francis, pope, john

96

bergdahl, taliban, defense, hagel, trial

32

award, best, elliot, winning, awards

76

michelle, kin, justin, bottle, volume

47

moore, teeth, fuel, male, rugby

39

kings, ice, japan, stanley, japanese

40

org, medias, oregon, opera, photographer

72

min, playlist, train, height, div

37

chicago, phoenix, beer, illinois, vista

3

search, security, card, information, credit

82

south, north, korea, carolina, prisoners

42

glass, drink, taste, chef, cheese

2

gay, marriage, sex, davis, couples

69

thumbnail, williams, tour, australia, australian

99

slug, fight, froch, vegas, dek

51

video, camera, smoking, harris, footage

43

rangers, santa, louis, ryan, miller

18

bowe, ipad, device, cable, protesters

91

hub, smith, martin, ray, disney

55

san, california, francisco, download, bay

79

cbsnews, duration, marine, species, plant

10

maya, simon, soul, jazz, mad

46

gun, guns, camp, mount, owners

74

calif, stone, comic, racist, walls

70

workers, employees, carl, union, german

84

murray, turkey, christ, flowers, shelly

12

sterling, jones, chamberlain, adam, carney

95

lee, hall, jack, watson, bruce

28

brown, plants, marijuana, gordon, wood

88

foods, valley, insider, katz, silicon

78

donald, thomas, tom, massachusetts, patrick

6

johnson, michigan, lewis, charlie, lik

Examiner.com

Forbes

Reuters

New York Times

Washington Post

MSNBC

USA Today

FOX News

The Huffington Post

TIME.com

The New York Post

San Francisco Chronicle

CBS News

The Daily News New Yo..

LA Times

CNN

CNET

BBC

Daily Telegraph

Guardian

Daily Mail

topics vs source (seriated)% of Total value

0.01%

5.00%

10.00%

15.00%

20.00%

24.21%

% of Total value (size) broken down by topic_index and topic_desc vs. media_name. The view is filtered on topic_index, which keeps 99 of 100 members. Percents are based on each row of the table.

wli
Typewritten Text
Appendix B
wli
Typewritten Text
wli
Typewritten Text
wli
Typewritten Text
wli
Typewritten Text
wli
Typewritten Text
Page 9: Large-Scale Topical Analysis of Multiple Online News Sources …people.csail.mit.edu/wli/papers/Chuang-Fish-Larochelle... · 2014-12-23 · line data sources: a comprehensive study

Mar 15 Mar 30 Apr 14 Apr 29 May 14 May 29Day of publish_date [2014]

0 london, british, britain, european, party

1 climate, cars, change, power, ballmer

2 gay, marriage, sex, davis, couples

3search, security, card, information,credit

4company, market, price, business,sales

5 band, love, fans, rock, singer

6 johnson, michigan, lewis, charlie, lik

7 airport, mexico, flight, coast, island

8 world, cup, england, team, final

9 home, house, small, like, room

10 maya, simon, soul, jazz, mad

11 india, indian, israel, jewish, minister

12sterling, jones, chamberlain, adam,carney

13 death, hospital, died, surgery, medical

14 health, care, state, insurance, medical

15 million, billion, money, bank, cash

16obama, president, house, party,election

17 american, world, topic, america, false

18 bowe, ipad, device, cable, protesters

19 com, http, pid, bitrate, mpx

20 dog, animal, dogs, segment, pet

21attack, people, killed, government,syria

22 james, miami, heat, oklahoma, spurs

23 said, car, people, near, crash

24 hair, red, white, fashion, look

25 race, road, horse, racing, run

26art, museum, memorial, barbara,history

27 said, round, got, year, play

28 brown, plants, marijuana, gordon, wood

29 black, star, rodger, year, wedding

30 run, game, runs, inning, hit

31 star, season, stars, bryan, bell

32 award, best, elliot, winning, awards

33 weight, food, diet, eating, fat

34 uri, food, sugar, cook, minutes

35 second, goal, ball, minutes, game

36 use, problem, problems, number, result

37 chicago, phoenix, beer, illinois, vista

38 shinseki, brain, body, skin, pain

39 kings, ice, japan, stanley, japanese

40org, medias, oregon, opera,photographer

41 french, brazil, france, italy, world

42 glass, drink, taste, chef, cheese

43 rangers, santa, louis, ryan, miller

44 media, cbs, news, morning, season

45 game, team, nba, games, series

46 gun, guns, camp, mount, owners

47 moore, teeth, fuel, male, rugby

48 work, help, make, new, business

49 festival, open, beats, antonio, tennis

50 women, men, bst, woman, angelou

51 video, camera, smoking, harris, footage

52 like, time, people, don, way

53 new, city, york, jersey, mayor

54 cancer, study, health, people, research

55san, california, francisco, download,bay

56 family, old, mother, father, year

57 said, president, united, states, group

58 report, department, public, said, office

59music, stage, song, performance,musical

60school, students, university, court,college

61 game, games, release, new, characters

62 photo, twitter, facebook, social, images

63 water, energy, oil, gas, environmental

64 county, park, florida, river, state

65clinton, prince, wine, blackhawks,nigeria

66military, war, veterans, army,afghanistan

67china, beijing, asian, malaysia,tiananmen

68 children, parents, girls, young, child

69thumbnail, williams, tour, australia,australian

70workers, employees, carl, union,german

71 god, path, life, christian, faith

72 min, playlist, train, height, div

73 local, people, country, town, community

74 calif, stone, comic, racist, walls

75 book, los, angeles, books, ctm

76 michelle, kin, justin, bottle, volume

77 cricket, africa, african, van, mae

78donald, thomas, tom, massachusetts,patrick

79cbsnews, duration, marine, species,plant

80 air, cnn, plane, space, pakistan

81 club, league, football, players, season

82 south, north, korea, carolina, prisoners

83 team, sports, detroit, season, year

84 murray, turkey, christ, flowers, shelly

85government, tax, political, economic,party

86 friday, saturday, sunday, day, thursday

87 june, day, event, summer, free

88 foods, valley, insider, katz, silicon

89 russia, ukraine, russian, pro, ukrainian

90 police, said, year, court, man

91 hub, smith, martin, ray, disney

92 church, paul, francis, pope, john

93 film, movie, story, character, series

94 percent, year, average, rate, growth

95 lee, hall, jack, watson, bruce

96 bergdahl, taliban, defense, hagel, trial

97 data, google, apple, mobile, technology

98weather, snowden, mountain, lake,boston

99 slug, fight, froch, vegas, dek

topics vs time (daily)

The trend of sum of Number of Records for publish_date Day broken down by topic_index and topic_desc. The data is filtered on pub-lish_date, which keeps 379,900 of 379,901 members.

wli
Typewritten Text
Appendix C
wli
Typewritten Text
wli
Typewritten Text
wli
Typewritten Text
wli
Typewritten Text
Page 10: Large-Scale Topical Analysis of Multiple Online News Sources …people.csail.mit.edu/wli/papers/Chuang-Fish-Larochelle... · 2014-12-23 · line data sources: a comprehensive study

Mar 15 Mar 30 Apr 14 Apr 29 May 14 May 29Hour of publish_date [2014]

0 london, british, britain, european, party

1 climate, cars, change, power, ballmer

2 gay, marriage, sex, davis, couples

3search, security, card, information,credit

4company, market, price, business,sales

5 band, love, fans, rock, singer

6 johnson, michigan, lewis, charlie, lik

7 airport, mexico, flight, coast, island

8 world, cup, england, team, final

9 home, house, small, like, room

10 maya, simon, soul, jazz, mad

11 india, indian, israel, jewish, minister

12sterling, jones, chamberlain, adam,carney

13 death, hospital, died, surgery, medical

14 health, care, state, insurance, medical

15 million, billion, money, bank, cash

16obama, president, house, party,election

17 american, world, topic, america, false

18 bowe, ipad, device, cable, protesters

19 com, http, pid, bitrate, mpx

20 dog, animal, dogs, segment, pet

21attack, people, killed, government,syria

22 james, miami, heat, oklahoma, spurs

23 said, car, people, near, crash

24 hair, red, white, fashion, look

25 race, road, horse, racing, run

26art, museum, memorial, barbara,history

27 said, round, got, year, play

28 brown, plants, marijuana, gordon, wood

29 black, star, rodger, year, wedding

30 run, game, runs, inning, hit

31 star, season, stars, bryan, bell

32 award, best, elliot, winning, awards

33 weight, food, diet, eating, fat

34 uri, food, sugar, cook, minutes

35 second, goal, ball, minutes, game

36 use, problem, problems, number, result

37 chicago, phoenix, beer, illinois, vista

38 shinseki, brain, body, skin, pain

39 kings, ice, japan, stanley, japanese

40org, medias, oregon, opera,photographer

41 french, brazil, france, italy, world

42 glass, drink, taste, chef, cheese

43 rangers, santa, louis, ryan, miller

44 media, cbs, news, morning, season

45 game, team, nba, games, series

46 gun, guns, camp, mount, owners

47 moore, teeth, fuel, male, rugby

48 work, help, make, new, business

49 festival, open, beats, antonio, tennis

50 women, men, bst, woman, angelou

51 video, camera, smoking, harris, footage

52 like, time, people, don, way

53 new, city, york, jersey, mayor

54 cancer, study, health, people, research

55san, california, francisco, download,bay

56 family, old, mother, father, year

57 said, president, united, states, group

58 report, department, public, said, office

59music, stage, song, performance,musical

60school, students, university, court,college

61 game, games, release, new, characters

62 photo, twitter, facebook, social, images

63 water, energy, oil, gas, environmental

64 county, park, florida, river, state

65clinton, prince, wine, blackhawks,nigeria

66military, war, veterans, army,afghanistan

67china, beijing, asian, malaysia,tiananmen

68 children, parents, girls, young, child

69thumbnail, williams, tour, australia,australian

70workers, employees, carl, union,german

71 god, path, life, christian, faith

72 min, playlist, train, height, div

73 local, people, country, town, community

74 calif, stone, comic, racist, walls

75 book, los, angeles, books, ctm

76 michelle, kin, justin, bottle, volume

77 cricket, africa, african, van, mae

78donald, thomas, tom, massachusetts,patrick

79cbsnews, duration, marine, species,plant

80 air, cnn, plane, space, pakistan

81 club, league, football, players, season

82 south, north, korea, carolina, prisoners

83 team, sports, detroit, season, year

84 murray, turkey, christ, flowers, shelly

85government, tax, political, economic,party

86 friday, saturday, sunday, day, thursday

87 june, day, event, summer, free

88 foods, valley, insider, katz, silicon

89 russia, ukraine, russian, pro, ukrainian

90 police, said, year, court, man

91 hub, smith, martin, ray, disney

92 church, paul, francis, pope, john

93 film, movie, story, character, series

94 percent, year, average, rate, growth

95 lee, hall, jack, watson, bruce

96 bergdahl, taliban, defense, hagel, trial

97 data, google, apple, mobile, technology

98weather, snowden, mountain, lake,boston

99 slug, fight, froch, vegas, dek

topics vs time (hourly)

The trend of sum of Number of Records for publish_date Hour broken down by topic_index and topic_desc. The datais filtered on publish_date, which keeps 379,900 of 379,901 members.

wli
Typewritten Text
Appendix D
wli
Typewritten Text
wli
Typewritten Text
wli
Typewritten Text
wli
Typewritten Text
Page 11: Large-Scale Topical Analysis of Multiple Online News Sources …people.csail.mit.edu/wli/papers/Chuang-Fish-Larochelle... · 2014-12-23 · line data sources: a comprehensive study

Appendix E Topic signatures for 21 online media sources

 

Page 12: Large-Scale Topical Analysis of Multiple Online News Sources …people.csail.mit.edu/wli/papers/Chuang-Fish-Larochelle... · 2014-12-23 · line data sources: a comprehensive study

       

Page 13: Large-Scale Topical Analysis of Multiple Online News Sources …people.csail.mit.edu/wli/papers/Chuang-Fish-Larochelle... · 2014-12-23 · line data sources: a comprehensive study

 

 

 

Page 14: Large-Scale Topical Analysis of Multiple Online News Sources …people.csail.mit.edu/wli/papers/Chuang-Fish-Larochelle... · 2014-12-23 · line data sources: a comprehensive study

 

 

Page 15: Large-Scale Topical Analysis of Multiple Online News Sources …people.csail.mit.edu/wli/papers/Chuang-Fish-Larochelle... · 2014-12-23 · line data sources: a comprehensive study

Appendix F Cluster memberships for media sources and average topic signatures from k-means clustering

 

Page 16: Large-Scale Topical Analysis of Multiple Online News Sources …people.csail.mit.edu/wli/papers/Chuang-Fish-Larochelle... · 2014-12-23 · line data sources: a comprehensive study