Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Large-Scale Topical Analysis ofMultiple Online News Sources with Media Cloud
Jason ChuangUniversity of Washington
Sands FishHarvard University{sands, dlarochelle}@cyber.law.harvard.edu
David LarochelleHarvard University
William P. LiMassachusetts Institute of Technology
Rebecca WeissStanford University
ABSTRACTIdentifying topics in news, tracking their temporal dynam-ics, and understanding how different media sources coverthem have important theoretical and practical implicationsfor journalism researchers, producers, and consumers. Theexplosive growth of online news sources, however, suggeststhat scalable approaches to topical analysis are needed. Weintroduce our ongoing efforts to enable large-scale topicalanalysis of the Media Cloud corpus, a repository of over 200million online news articles. Our initial experiments with 90days of articles from 21 top media sources suggests that sta-tistical topic modeling can identify reasonable news-relatedtopics and produce interesting early insights into the on-line media ecosystem. We are currently examining mixed-initiative approaches to automate the process of topic ex-traction and increase the quality of the extracted topics.Finally, we discuss our further research directions on large-scale news monitoring and measurement as well as analysistools for news consumers and producers.
1. INTRODUCTIONThe importance of the news cannot be overstated; the
news serves to inform us about the state of the world acrossa wide array of themes: from political and economic issuesto entertainment and sports events. However, online newshas grown increasingly diverse as the definition of a jour-nalistic entity has evolved. News consumers now have theluxury of choice, and with low barriers to entry the quantityof news-supplying outlets has exploded, ranging from vener-able print news organizations such as nationally distributednewspapers to the greater networked public sphere [1].
As a result of this growth, online news is produced at amassive scale, forcing news producers to remain viable in anincreasingly competitive market. News producers must gen-erate content that will increase readership, which may come
Authors are listed in alphabetical order.
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.KDD ’14 New York City, NY USACopyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.
at the expense of news quality. Mainstream news sources arechallenged to retain high standards of journalistic integritywhile also remaining commercially viable. There is grow-ing concern about an impoverished news landscape wheresuppliers can only publish stories guaranteed to attract ageneric type of reader. In short, the explosive growth ofnews suppliers has actually complicated the most basic useof the news; to inform readers about the events happeningin the world around them. Readers, in turn, have lost confi-dence in the news, lacking the ability to extract meaningfuland accurate insight.
We propose a solution to this problem on behalf of newsproducers and consumers based on the notion that catego-rization of news sources and summarization of news contenthold the most promise for regaining tractability of onlinenews as a valuable source of information. The ability tocategorize news sources by relative attention to topics canreduce the burden of choice on consumers, making it easierto discover sources that cater to specific interests. This couldincrease both media trust and news consumption. Further-more, summarizing large amounts of news in a meaningful,accurate way could further decrease the burden of choiceby providing a snapshot of overall event and issue coveragealong with additional information such as date or time ofpublication, source, or framing.
In this paper, we present our work-in-progress. First, weemploy Media Cloud, an open news platform that enablespublic access to full-length historical news texts, which en-ables unprecedented access to the online news landscape.Second, we apply statistical topic modeling, a computa-tional technique for extracting themes from text corpora,to three months’ worth of news. We highlight two valuesof the resulting model: exploratory analysis (e.g., correlat-ing multiple variables including latent topics and metadata)and data cleaning (e.g., identifying outliers or unexpectedtrends such as the volume of a news source). Finally, weoutline how topic modeling Media Cloud data can lead tocommercial applications, as well as provide contributions toboth social and computer science.
2. MEDIA CLOUDMedia Cloud is an open source, open data platform that
allows researchers to answer questions about the content ofonline media. It is intended to allow researchers to analyzethe online media ecosystem without having to incur the costof discovering and collecting news content themselves.
1
2.1 Platform DescriptionThe core idea of Media Cloud is simple: collect and archive
the huge text dataset of news articles published online in aformat useful and accessible to researchers. To that end,Media Cloud detects RSS feeds from media sources, pollsthem multiple times daily, downloads pages containing pre-viously unseen stories, and extracts article text. Currently,the system contains over 20,000 unique media sources and200 million archived stories. For each collected story, thesystem extracts article text and removes other “boilerplate”such as ads and navigation elements. As much of this data asis legally permissible is made available through an API; pub-licly available information includes URL, publication date,and word frequency counts for each article.
To date, applications of Media Cloud have focused on me-dia coverage of specific issues. For example, it has been usedto investigate the role of blogs in the Russian network publicsphere as an alternative to Russian government informationsources and the mainstream media [6], as well as coverage ofthe SOPA/PIPA debate in the United States from Septem-ber 2010 through the end of January 2012 [2]. In addition,Media Cloud has been useful in combination with other on-line data sources: a comprehensive study of coverage of theTrayvon Martin story involved Media Cloud data in conjunc-tion with Archive.org’s TV news archive, Google Searches,and Change.org signatures to understand the evolution ofthe story’s coverage [7].
2.2 Study Purpose and DatasetGiven the size and completeness of Media Cloud, it may
be possible to use it to discover new insights about particularmedia sources or the online media ecosystem as a whole. Inparticular, having all of the articles from URLs publishedto the RSS feeds of multiple media sources for a non-trivialtime period makes it possible to conduct large-scale topicalanalysis of news.
Using the Media Cloud API, we obtained for all storiesfrom 21 top mainstream media sources published within a90-day period between 2014-03-05 and 2014-06-02. This setof 21 media sources, listed in full in the appendix, was de-fined in 2010 based on the 25 most popular sites in the newssources category according to Google AdPlanner; of thesesources, 21 were still operational and were being capturedby Media Cloud during our target time period. This datasetcontains the per-article word-frequency counts for 429,042articles. In addition to these counts, we captured metadataabout each article, including its source and its time and dateof publication.
3. STATISTICAL TOPIC MODELINGTo find thematic patterns in the retrieved Media Cloud
dataset, we apply statistical topic modeling, a probabilis-tic algorithm that seeks to uncover “topics” from a docu-ment collection, where each generated topic is a weightedlist of words representing terms that frequently co-occur inthe same set of documents. In turn, each document is rep-resented as a weighted list of topics. Specifically, we applylatent Dirichlet allocation (LDA) [3], the most widely usedtopic modeling technique. A more thorough explanation of
Media Cloud is available at: http://www.mediacloud.orgPrior to final publication, all code used in this paper will
be made available under a free and open source license.
LDA and probabilistic generative modeling of text corporacan be found in [15].
Statistical topic modeling has been applied to various tasks,such as analyses of political text [8], effects of funding inbiomedical research [16], identifications of temporal trendsin news [11], and characterizations of themes in microblogs [13].However, existing topic models are typically built by teamsof machine learning experts, whereas our goal is to enablejournalists and consumers to effectively perform topical anal-ysis of news data.
3.1 Topic Modeling the Media Cloud CorpusTo apply LDA on the Media Cloud dataset, we used Gen-
sim, an open-source Python library [14]. Gensim imple-ments an online variant of LDA for computing model out-puts to make memory requirements manageable [9].
A practical challenge of applying LDA requires the usersto specify the number of topics beforehand. One approachfor finding distinctive topics is to choose a relatively highnumber of topics; however, this can make understanding orvisualizing all of the topics difficult. For our analysis, weset this parameter to 100 to balance these two considera-tions. On a single PC with 8 CPUs and 14 GB of memory,computing the model took approximately six hours. Uponcompletion, we used the source and date/time metadata inour dataset to generate time series plots of topics and themean topic weights for our 21 media sources. These resultsare summarized and visualized in the next section and inthe appendix.
3.2 Initial ResultsThe relative average topic weights for the 21 media sources
are displayed in Figure 1. For this figure, we selected aninteresting subset of the 100 topics that appeared to corre-spond to news themes or events and have substantial vari-ability across media sources. Visually, a few “clusters” seemapparent: the British media sources (BBC, Daily Telegraph,Guardian, and Daily Mail) seem to have similar topic weights,while the American sources (Reuters to CNN) also exhibitsimilarity with each other. Meanwhile, Examiner.com, Forbes,and CNET each appear to be unique among media sources.
Figure 1 suggests that describing each media source as aset of topic weights, which we refer to as its topic “signa-ture”, seems reasonable — it reflects the coverage of differ-ent themes or events relative to each news source. Second, itmay not be necessary to use all 100 topic weights for mediasources to have distinguishable signatures, since even a smallsubset of topics already seem to be producing differences.
We chose a simple method of reducing topics: we selectedthe topics where the top five words were, in our human judg-ment, all highly related. This method resulted in keeping 42out of 100 topics. The 42 topics and an example of their rel-ative weights for one media source, CNET, is shown in Fig-ure 2. In this case, it is clear that CNET focuses most ontopic 97 (data, google, apple, mobile, technology), followedby topics 4 (company, market, price, business, sales) and 15(million, billion, money, bank, cash). The appendix includesan expanded version of Figure 1 and the topic signatures ofthe other 20 media sources.
3.3 Interpretations of Initial ResultsWith these media-source signatures, numerous further anal-
yses are possible. We chose two areas that relate to under-
2
Figure 1: Media Sources vs. Selected Topics: Thischart describes each media source as a combinationof topics. The area of the circles represents theamount of topical weights assigned to each mediasource. Topics are labeled using the top five wordsbelonging to each topic.
standing individual media outlets and their similarities anddifferences to one another:
Media Source Topic Rankings. The 21 media sourcesdiffer in their weights on the 42 topics. Table 1 shows the topthree organizations for selected topics. The correspondingtopic weights, expressed as a percentage, can be interpretedas a measure of proportion of a media source’s total cover-age focused on that topic. For instance, about 9.3% of allof CNET’s coverage was on topic 97. CNET is top-rankedfor the technology topic, Forbes for business, and Washing-ton Post for politics. Other top-ranked media sources arealso interesting: health (CBS News), Donald Sterling (LATimes), Syria (FOX News), and baseball (New York Times).
Media Source Clustering. Using signatures based onthe 42 selected topics, we applied k -means clustering to findgroupings of media sources. Table 2 shows the results of au-tomatically choosing five clusters. This method successfullyrecovered the visual groupings identified in Figure 1 andsplit the American media sources into two groups — cluster1 tends to cover certain topics related to politics more thancluster 2. Average signatures of these clusters are includedin the appendix.
Table 1: Top Media Sources for Selected TopicsTopic Top Media Sources97: data, google, ap-ple, mobile, technol-ogy
CNET (9.3%), Forbes (3.0%),TIME.com (1.3%)
15: million, billion,money, bank, cash
Forbes (8.8%), Reuters (8.1%),Daily Telegraph (3.4%)
16: obama, presi-dent, house, party,election
Washington Post (9.2%), USAToday (4.6%), Huffington Post(3.6%)
54: cancer, study,health, people, re-search
CBS News (2.0%), USA Today(1.8%), FOX News (1.8%)
12: sterling, jones,chamberlain, adam,carey
LA Times (0.48%), TIME.com(0.48%), MSNBC (0.46%)
21: attack, peo-ple, killed, govern-ment, syria
FOX News (2.8%), Reuters(2.4%), TIME.com (1.5%)
30: run, game, runs,inning, hit
New York Times (3.1%), NewYork Post (2.50%), LA Times(2.2%).
Table 2: Clusters of Media SourcesCluster Media Sources1 Examiner.com, CNN, LA Times, Daily
News New York, New York Post2 FOX News, MSNBC, San Francisco
Chronicle, TIME.com, Huffington Post,USA Today, Washington Post
3 BBC, Daily Mail, Daily Telegraph,Guardian
4 CNET5 Forbes, Reuters
These initial results suggest that statistical topic modelingmay be an appropriate method for characterizing the MediaCloud corpus (and subsequently online news). Many of theautomatically discovered topics seem distinctive and recog-nizable, meaning that comparing media sources in terms oftheir coverage of these topics led to reasonable and under-standable insights. These insights were made possible bya mixed-initiative approach involving automatically discov-ered topics, human involvement to identify distinctive topicsand groups of media sources, and computer-based rankingand clustering of media sources.
4. DISCUSSIONTopic modeling can increase the scale of news analysis and
remove subjective bias, but the process also requires addi-tional manual efforts to assess and interpret the model. Wediscuss our approach’s advantages and our ongoing work toimprove the usefulness of topic modeling in the news do-main.
4.1 Increasing Scalability and Reducing BiasOur automatic approach, when extended to the full Media
Cloud corpus, would allow us to examine millions of docu-ments at once. As a point of comparison, civic watchdogssuch as Media Matters [10] and the Pew Research Center’s
3
Figure 2: Topic Signature for CNET: A signature isthe topic weights averaged over all documents be-longing to an entity; in this case, all articles pub-lished by CNET.
Journalism Project [12] currently track news through man-ual coding. However, because the process is labor-intensive,typically only a sample of 500 to 1,000 articles are taggedper year. The capacity to characterize all of the articlespublished by media sources could give a more complete andaccurate picture of the media ecosystem.
Topic modeling may also sidestep various origins of biasthat typically affect news content analyses. The topic ex-traction process does not depend on users’ a priori knowl-edge of the set of topics, which may be incomplete and par-tial, or human judgment in terms of prorportions to assignto different topis, which can be subjective and erroneous.As a result, topic modeling may uncover unspecified eventsand quantify topics in a more balanced manner than manualcoding.
4.2 Research DirectionsWhile our approach reduces the upfront efforts required in
manually reading or coding the source documents, statisticaltopic modeling introduces additional pain points. Peopleneed to inspect and interpret topics generated by a model,select appropriate model parameters, and refine a model’soutput to reveal clustering relationships and structures.
We are currently creating tools to help non-machine learn-ing experts more effectively build topic models, includingvisual analysis tools [4] for interactively comparing multi-ple models, aligning similar topics, and updating a model,as well as automatic algorithms to operationalize commontasks. In addition, we are also currently designing visual-izations [5] to help everyday users better understand andtrust topical analysis results. Visualizations can provide anat-a-glance quantitative and qualitative sense of the ways inwhich issues are being covered in the media; perhaps moreimportantly, the absence of a topic might also reveal whattopics are not being discussed. Citizens may also be able to
cross-reference extracted topics with known events, people,and other facts to gain more confidence or identify inconsis-tencies in either the model output or the underlying newscorpus they’re investigating.
4.3 Potential ApplicationsEditorial Support and Content Management. Topic
modeling provides an abstract view across very large amountsof content, media sources, and time allowing for quantita-tive and qualitative assessments. For news producers, thiscould aid in evaluating their own content in comparison toothers, allowing them to become a more diverse, unique, orwell-balanced source of news.
When researching previous coverage of an issue, or tryingto characterize a narrative leading up to a story, topic mod-eling could enable data journalists to analyze large amountsof published text on an issue. This combination of toolscould be useful to understand discursive elements, at whattimes certain topics have been discussed, and potentiallywhat frames are used to characterize a specific issue.
User Management. The topic “signature” generated asa result of fitting a topic model to Media Cloud data could beuseful for recommending other news properties to interestedconsumers. For example, if a news consumer tends to followmedia sources with a topic signature similar to Figure 2, asearch could be performed for additional news sources withsimilar topic signatures. For news sources, which are oftenpart of media conglomerates that own multiple news prop-erties, such a comparison method could lead to the develop-ment of sophisticated recommendation engines.
Media Literacy for News Consumers. In addition,this work could lead to tools for news consumers that reporton a media source’s breadth and depth of event and issuecoverage. Given the explosion of potential sources of news,being able to quickly evaluate different media sources couldhelp news consumers better discover and assess their appro-priateness. For example, if a consumer encounters a mediasource reporting on politics, but its topic signature indi-cates that this news source primarily covers entertainment,the consumer can assess the quality of that news source ac-cordingly. For news consumers, topic modeling could helpthem understand how and why media outlets are addressinga specific issue in a given news cycle, how different news or-ganizations relate to one another in terms of topic coverage,and what latent biases may be present.
5. CONCLUSIONWe present our initial work on automating topical analysis
of large-scale news data. Future work will involve applyingthe analysis across a more diverse set of content available inMedia Cloud such as blogs, enabling insight across a broadermedia landscape. Further work is needed on visualizing ourtopic modeling efforts, as well as evaluating the results ofour model a) across individual media sources; b) over longerlengths of time; c) on a specifically scoped set of documentscomposing a controversy; d) for an array of explicit textsearches; and e) comparing collections of media sources (e.g.left vs. right wing blogs).
6. ACKNOWLEDGMENTSThe authors would like to thank Pamela Mishkin for valu-
able feedback on early drafts of the paper; Hal Roberts and
4
Linas Valiukas who helped build and maintain the MediaCloud platform; and the Ford Foundation, the Open Soci-ety Foundations, the John S. and James L. Knight Founda-tion, and the Brown Institute for Media Innovation for theirgenerous financial support.
7. REFERENCES[1] Y. Benkler. The wealth of networks: How social
production transforms markets and freedom. YaleUniversity Press, 2006.
[2] Y. Benkler, H. Roberts, R. Faris,A. Solow-Niederman, and B. Etling. Socialmobilization and the networked public sphere:Mapping the sopa-pipa debate. Berkman CenterResearch Publication, (2013-16), 2013.
[3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latentdirichlet allocation. the Journal of Machine LearningResearch, 3:993–1022, 2003.
[4] J. Chuang, S. Gupta, C. D. Manning, and J. Heer.Topic model diagnostics: Assessing domain relevancevia topical alignment. In International Conference onMachine Learning (ICML), 2013.
[5] J. Chuang, C. D. Manning, and J. Heer. Termite:Visualization techniques for assessing textual topicmodels. In Advanced Visual Interfaces, 2012.
[6] B. Etling, H. Roberts, and R. Faris. Blogs as analternative public sphere: The role of blogs,mainstream media, and tv in russia’s media ecology.Berkman Center Research Publication, (2014-8), 2014.
[7] E. Graeff, M. Stempeck, and E. Zuckerman. The battlefor ‘trayvon martin’: Mapping a media controversyonline and off-line. First Monday, 19(2), 2014.
[8] J. Grimmer. A bayesian hierarchical topic model forpolitical texts: Measuring expressed agendas in senatepress releases. Political Analysis, 18(1):1–35, 2010.
[9] M. Hoffman, F. R. Bach, and D. M. Blei. Onlinelearning for latent dirichlet allocation. In J. Lafferty,C. Williams, J. Shawe-Taylor, R. Zemel, andA. Culotta, editors, Advances in Neural InformationProcessing Systems 23, pages 856–864. CurranAssociates, Inc., 2010.
[10] M. Matters. Black and white and re(a)d all over: Theconservative advantage in syndicated op-ed columns.http://mediamatters.org/research/oped/, 2014.
[11] D. Newman, C. Chemudugunta, P. Smyth, andM. Steyvers. Analyzing entities and topics in newsarticles using statistical topic models. In Intelligenceand Security Informatics, pages 93–104. Springer,2006.
[12] P. R. J. Project. News coverage index methodology.http://www.journalism.org/newsindexmethodology/99/, 2014.
[13] D. Ramage, S. T. Dumais, and D. J. Liebling.Characterizing microblogs with topic models. ICWSM,10:1–1, 2010.
[14] R. Rehurek and P. Sojka. Software Framework forTopic Modelling with Large Corpora. In Proceedings ofthe LREC 2010 Workshop on New Challenges for NLPFrameworks, pages 45–50, Valletta, Malta, May 2010.ELRA. http://is.muni.cz/publication/884893/en.
[15] M. Steyvers and T. Griffiths. Probabilistic topicmodels. Handbook of latent semantic analysis,427(7):424–440, 2007.
[16] E. M. Talley, D. Newman, D. Mimno, B. W. Herr II,H. M. Wallach, G. A. Burns, A. M. Leenders, andA. McCallum. Database of NIH grants usingmachine-learned categories and graphical clustering.Nature Methods, 8(6):443–444, 2011.
5
APPENDIXThe following supplementary materials are included in theappendix:
A List of 21 online media sources
B Seriated topic weights by media source for full set of100 topics
C Topics versus time (daily)
D Topics versus time (hourly)
E Topic signatures for 21 online media sources
F Cluster memberships for media sources and averagetopic signatures from k-means clustering
6
Appendix A: List of Online Media Sources
BBCCBS NewsCNETCNNDaily MailDaily TelegraphExaminer.comFOX NewsForbesGuardianLA TimesMSNBCNew York TimesReutersSan Francisco ChronicleTIME.comThe Daily News New YorkThe Huffington PostThe New York PostUSA TodayWashington Post
topic_index / topic_desc
48
work, help, make, new, business
94
percent, year, average, rate, growth
4
company, market, price, business, sales
15
million, billion, money, bank, cash
57
said, president, united, states, group
16
obama, president, house, party, election
90
police, said, year, court, man
58
report, department, public, said, office
97
data, google, apple, mobile, technology
0
london, british, britain, european, party
81
club, league, football, players, season
56
family, old, mother, father, year
36
use, problem, problems, number, result
89
russia, ukraine, russian, pro, ukrainian
21
attack, people, killed, government, syria
45
game, team, nba, games, series
23
said, car, people, near, crash
30
run, game, runs, inning, hit
83
team, sports, detroit, season, year
87
june, day, event, summer, free
9
home, house, small, like, room
93
film, movie, story, character, series
8
world, cup, england, team, final
24
hair, red, white, fashion, look
54
cancer, study, health, people, research
29
black, star, rodger, year, wedding
60
school, students, university, court, college
59
music, stage, song, performance, musical
35
second, goal, ball, minutes, game
86
friday, saturday, sunday, day, thursday
27
said, round, got, year, play
85
government, tax, political, economic, party
5
band, love, fans, rock, singer
34
uri, food, sugar, cook, minutes
1
climate, cars, change, power, ballmer
53
new, city, york, jersey, mayor
14
health, care, state, insurance, medical
13
death, hospital, died, surgery, medical
41
french, brazil, france, italy, world
62
photo, twitter, facebook, social, images
7
airport, mexico, flight, coast, island
66
military, war, veterans, army, afghanistan
61
game, games, release, new, characters
68
children, parents, girls, young, child
63
water, energy, oil, gas, environmental
20
dog, animal, dogs, segment, pet
25
race, road, horse, racing, run
80
air, cnn, plane, space, pakistan
11
india, indian, israel, jewish, minister
26
art, museum, memorial, barbara, history
73
local, people, country, town, community
33
weight, food, diet, eating, fat
38
shinseki, brain, body, skin, pain
44
media, cbs, news, morning, season
31
star, season, stars, bryan, bell
77
cricket, africa, african, van, mae
71
god, path, life, christian, faith
75
book, los, angeles, books, ctm
17
american, world, topic, america, false
64
county, park, florida, river, state
65
clinton, prince, wine, blackhawks, nigeria
50
women, men, bst, woman, angelou
19
com, http, pid, bitrate, mpx
67
china, beijing, asian, malaysia, tiananmen
49
festival, open, beats, antonio, tennis
22
james, miami, heat, oklahoma, spurs
98
weather, snowden, mountain, lake, boston
92
church, paul, francis, pope, john
96
bergdahl, taliban, defense, hagel, trial
32
award, best, elliot, winning, awards
76
michelle, kin, justin, bottle, volume
47
moore, teeth, fuel, male, rugby
39
kings, ice, japan, stanley, japanese
40
org, medias, oregon, opera, photographer
72
min, playlist, train, height, div
37
chicago, phoenix, beer, illinois, vista
3
search, security, card, information, credit
82
south, north, korea, carolina, prisoners
42
glass, drink, taste, chef, cheese
2
gay, marriage, sex, davis, couples
69
thumbnail, williams, tour, australia, australian
99
slug, fight, froch, vegas, dek
51
video, camera, smoking, harris, footage
43
rangers, santa, louis, ryan, miller
18
bowe, ipad, device, cable, protesters
91
hub, smith, martin, ray, disney
55
san, california, francisco, download, bay
79
cbsnews, duration, marine, species, plant
10
maya, simon, soul, jazz, mad
46
gun, guns, camp, mount, owners
74
calif, stone, comic, racist, walls
70
workers, employees, carl, union, german
84
murray, turkey, christ, flowers, shelly
12
sterling, jones, chamberlain, adam, carney
95
lee, hall, jack, watson, bruce
28
brown, plants, marijuana, gordon, wood
88
foods, valley, insider, katz, silicon
78
donald, thomas, tom, massachusetts, patrick
6
johnson, michigan, lewis, charlie, lik
Examiner.com
Forbes
Reuters
New York Times
Washington Post
MSNBC
USA Today
FOX News
The Huffington Post
TIME.com
The New York Post
San Francisco Chronicle
CBS News
The Daily News New Yo..
LA Times
CNN
CNET
BBC
Daily Telegraph
Guardian
Daily Mail
topics vs source (seriated)% of Total value
0.01%
5.00%
10.00%
15.00%
20.00%
24.21%
% of Total value (size) broken down by topic_index and topic_desc vs. media_name. The view is filtered on topic_index, which keeps 99 of 100 members. Percents are based on each row of the table.
Mar 15 Mar 30 Apr 14 Apr 29 May 14 May 29Day of publish_date [2014]
0 london, british, britain, european, party
1 climate, cars, change, power, ballmer
2 gay, marriage, sex, davis, couples
3search, security, card, information,credit
4company, market, price, business,sales
5 band, love, fans, rock, singer
6 johnson, michigan, lewis, charlie, lik
7 airport, mexico, flight, coast, island
8 world, cup, england, team, final
9 home, house, small, like, room
10 maya, simon, soul, jazz, mad
11 india, indian, israel, jewish, minister
12sterling, jones, chamberlain, adam,carney
13 death, hospital, died, surgery, medical
14 health, care, state, insurance, medical
15 million, billion, money, bank, cash
16obama, president, house, party,election
17 american, world, topic, america, false
18 bowe, ipad, device, cable, protesters
19 com, http, pid, bitrate, mpx
20 dog, animal, dogs, segment, pet
21attack, people, killed, government,syria
22 james, miami, heat, oklahoma, spurs
23 said, car, people, near, crash
24 hair, red, white, fashion, look
25 race, road, horse, racing, run
26art, museum, memorial, barbara,history
27 said, round, got, year, play
28 brown, plants, marijuana, gordon, wood
29 black, star, rodger, year, wedding
30 run, game, runs, inning, hit
31 star, season, stars, bryan, bell
32 award, best, elliot, winning, awards
33 weight, food, diet, eating, fat
34 uri, food, sugar, cook, minutes
35 second, goal, ball, minutes, game
36 use, problem, problems, number, result
37 chicago, phoenix, beer, illinois, vista
38 shinseki, brain, body, skin, pain
39 kings, ice, japan, stanley, japanese
40org, medias, oregon, opera,photographer
41 french, brazil, france, italy, world
42 glass, drink, taste, chef, cheese
43 rangers, santa, louis, ryan, miller
44 media, cbs, news, morning, season
45 game, team, nba, games, series
46 gun, guns, camp, mount, owners
47 moore, teeth, fuel, male, rugby
48 work, help, make, new, business
49 festival, open, beats, antonio, tennis
50 women, men, bst, woman, angelou
51 video, camera, smoking, harris, footage
52 like, time, people, don, way
53 new, city, york, jersey, mayor
54 cancer, study, health, people, research
55san, california, francisco, download,bay
56 family, old, mother, father, year
57 said, president, united, states, group
58 report, department, public, said, office
59music, stage, song, performance,musical
60school, students, university, court,college
61 game, games, release, new, characters
62 photo, twitter, facebook, social, images
63 water, energy, oil, gas, environmental
64 county, park, florida, river, state
65clinton, prince, wine, blackhawks,nigeria
66military, war, veterans, army,afghanistan
67china, beijing, asian, malaysia,tiananmen
68 children, parents, girls, young, child
69thumbnail, williams, tour, australia,australian
70workers, employees, carl, union,german
71 god, path, life, christian, faith
72 min, playlist, train, height, div
73 local, people, country, town, community
74 calif, stone, comic, racist, walls
75 book, los, angeles, books, ctm
76 michelle, kin, justin, bottle, volume
77 cricket, africa, african, van, mae
78donald, thomas, tom, massachusetts,patrick
79cbsnews, duration, marine, species,plant
80 air, cnn, plane, space, pakistan
81 club, league, football, players, season
82 south, north, korea, carolina, prisoners
83 team, sports, detroit, season, year
84 murray, turkey, christ, flowers, shelly
85government, tax, political, economic,party
86 friday, saturday, sunday, day, thursday
87 june, day, event, summer, free
88 foods, valley, insider, katz, silicon
89 russia, ukraine, russian, pro, ukrainian
90 police, said, year, court, man
91 hub, smith, martin, ray, disney
92 church, paul, francis, pope, john
93 film, movie, story, character, series
94 percent, year, average, rate, growth
95 lee, hall, jack, watson, bruce
96 bergdahl, taliban, defense, hagel, trial
97 data, google, apple, mobile, technology
98weather, snowden, mountain, lake,boston
99 slug, fight, froch, vegas, dek
topics vs time (daily)
The trend of sum of Number of Records for publish_date Day broken down by topic_index and topic_desc. The data is filtered on pub-lish_date, which keeps 379,900 of 379,901 members.
Mar 15 Mar 30 Apr 14 Apr 29 May 14 May 29Hour of publish_date [2014]
0 london, british, britain, european, party
1 climate, cars, change, power, ballmer
2 gay, marriage, sex, davis, couples
3search, security, card, information,credit
4company, market, price, business,sales
5 band, love, fans, rock, singer
6 johnson, michigan, lewis, charlie, lik
7 airport, mexico, flight, coast, island
8 world, cup, england, team, final
9 home, house, small, like, room
10 maya, simon, soul, jazz, mad
11 india, indian, israel, jewish, minister
12sterling, jones, chamberlain, adam,carney
13 death, hospital, died, surgery, medical
14 health, care, state, insurance, medical
15 million, billion, money, bank, cash
16obama, president, house, party,election
17 american, world, topic, america, false
18 bowe, ipad, device, cable, protesters
19 com, http, pid, bitrate, mpx
20 dog, animal, dogs, segment, pet
21attack, people, killed, government,syria
22 james, miami, heat, oklahoma, spurs
23 said, car, people, near, crash
24 hair, red, white, fashion, look
25 race, road, horse, racing, run
26art, museum, memorial, barbara,history
27 said, round, got, year, play
28 brown, plants, marijuana, gordon, wood
29 black, star, rodger, year, wedding
30 run, game, runs, inning, hit
31 star, season, stars, bryan, bell
32 award, best, elliot, winning, awards
33 weight, food, diet, eating, fat
34 uri, food, sugar, cook, minutes
35 second, goal, ball, minutes, game
36 use, problem, problems, number, result
37 chicago, phoenix, beer, illinois, vista
38 shinseki, brain, body, skin, pain
39 kings, ice, japan, stanley, japanese
40org, medias, oregon, opera,photographer
41 french, brazil, france, italy, world
42 glass, drink, taste, chef, cheese
43 rangers, santa, louis, ryan, miller
44 media, cbs, news, morning, season
45 game, team, nba, games, series
46 gun, guns, camp, mount, owners
47 moore, teeth, fuel, male, rugby
48 work, help, make, new, business
49 festival, open, beats, antonio, tennis
50 women, men, bst, woman, angelou
51 video, camera, smoking, harris, footage
52 like, time, people, don, way
53 new, city, york, jersey, mayor
54 cancer, study, health, people, research
55san, california, francisco, download,bay
56 family, old, mother, father, year
57 said, president, united, states, group
58 report, department, public, said, office
59music, stage, song, performance,musical
60school, students, university, court,college
61 game, games, release, new, characters
62 photo, twitter, facebook, social, images
63 water, energy, oil, gas, environmental
64 county, park, florida, river, state
65clinton, prince, wine, blackhawks,nigeria
66military, war, veterans, army,afghanistan
67china, beijing, asian, malaysia,tiananmen
68 children, parents, girls, young, child
69thumbnail, williams, tour, australia,australian
70workers, employees, carl, union,german
71 god, path, life, christian, faith
72 min, playlist, train, height, div
73 local, people, country, town, community
74 calif, stone, comic, racist, walls
75 book, los, angeles, books, ctm
76 michelle, kin, justin, bottle, volume
77 cricket, africa, african, van, mae
78donald, thomas, tom, massachusetts,patrick
79cbsnews, duration, marine, species,plant
80 air, cnn, plane, space, pakistan
81 club, league, football, players, season
82 south, north, korea, carolina, prisoners
83 team, sports, detroit, season, year
84 murray, turkey, christ, flowers, shelly
85government, tax, political, economic,party
86 friday, saturday, sunday, day, thursday
87 june, day, event, summer, free
88 foods, valley, insider, katz, silicon
89 russia, ukraine, russian, pro, ukrainian
90 police, said, year, court, man
91 hub, smith, martin, ray, disney
92 church, paul, francis, pope, john
93 film, movie, story, character, series
94 percent, year, average, rate, growth
95 lee, hall, jack, watson, bruce
96 bergdahl, taliban, defense, hagel, trial
97 data, google, apple, mobile, technology
98weather, snowden, mountain, lake,boston
99 slug, fight, froch, vegas, dek
topics vs time (hourly)
The trend of sum of Number of Records for publish_date Hour broken down by topic_index and topic_desc. The datais filtered on publish_date, which keeps 379,900 of 379,901 members.
Appendix E Topic signatures for 21 online media sources
Appendix F Cluster memberships for media sources and average topic signatures from k-means clustering