View
352
Download
0
Embed Size (px)
DESCRIPTION
With online publication and social media taking the main role in dissemination of news, and with the decline of traditional printed media, it has become necessary to devise ways to automatically extract meaningful information from the plethora of sources available and to make that information readily available to interested parties. In this paper we present a method of automated analysis of the underlying structure of online newspapers based on Q-analysis and modularity. We show how the combination of the two strategies allows for the identification of well defined news clusters that are free of noise (unrelated stories) and provide automated clustering of information on trending topics on news published online.
Citation preview
Iden%fying news clusters using Q-‐analysis and Modularity
David Rodrigues+ Centre for Complexity and Design
+The Open University, UK – [email protected]
1
v
complexityanddesign.com Thursday am – Room S11
2
Complexity & Design Workshop at ECCS13
Mo%va%on
• Find Structure in collec%ons of text documents • Create Computer Algorithms to automate this discovery with minimal human supervision.
• Use of hybrid methodologies to improve quality of results – Topology based approach describes data – Clustering technique to iden%fy modules
3
Problem Descrip%on
• Iden%fy the Structure of the news published online by The Guardian (among other newspapers) – Clustering? – Topology? – Topic Modelling? – Noise? – Novelty? – Change?
4
[Kohut, A. and Remez, M. (2008)]
Clustering Techniques in Topic Modelling
• Nearest neighbour classifica%on • Bayesian probabilis%c techniques • Decision trees • Regression Models • Neural Networks • Support Vector Machines
• Language dependent / Human interven%on in the defini%on of categories for training samples.
5
Clustering in Graphs is Community Detec%on
• Modularity based techniques [majority] • Spectral algorithms • Synchroniza%on based techniques • … • [Community detecBon in graphs -‐ Fortunato, 2010, for comprehensive review]
• Binary rela%ons between nodes don’t capture the mul%-‐level structure of exis%ng rela%ons. – Move to n-‐ary rela%ons and descrip%ons
6
Previously
• We used a sliding window over the %me series of the news stories
• Used Varia%on of Informa%on to measure changes in an evolving adap%ve network of news[Meilã 2007, Rodrigues 2010]
7
Our Proposal
• Use a high dimensional representa%on of the documents (Simplicial Complex)
• Use Q-‐analysis to describe the system constructed from the Documents x Tags Incidence Matrix
• Use Q-‐connected components to filter noise. • Use modularity opBmisaBon to find communi%es in the resul%ng induced graphs
8
Noise?
• In the news context, we define noise news as news that are loosely related to the main topics published.
• We can filter them by assuming that the Q-‐connectedness of this news is very low.
9
The Guardian
• Classifies news with useful metadata: – … – Sec%on – Tags – …
hkp://www.theguardian.com/open-‐plalorm Open Plalorm with API for applica%on development. 3 years of data: 2010, 2011 and 2012
10
Pseudo code for the automated news clustering and filtering algorithm
11
Pseudo code for the automated news clustering and filtering algorithm
12
Incidence Matrix
TAG 1 TAG 2 TAG 3 TAG 4 TAG 5 …
NEWS 1 1 1 0 0 0 …
NEWS 2 0 1 1 0 1 …
NEWS 3 0 1 0 0 1 …
NEWS 4 1 0 0 0 1 …
NEWS 5 0 0 0 1 1 …
… … … … … … …
13
Documents x Tags
Results
14
Community detec%on on the 0-‐connected graph
15
1 Month of News – November 2011 Modularity = 0.48 9 communi%es
Small frac%on of ver%ces is highly connected
16
Giant component only for low connected graph
17
Modularity vs. connectedness
18
Number of nodes decreases quickly with Q
19
Number of nodes and Edge Density
20 November 2011
Average Clustering and Degree Assorta%vity
21
n. Components and Modularity
22
Q=5 + Modularity
23
Examples Of Clusters (I)
24
Examples Of Clusters (II)
25
Developed Tools
• Theseus – A python applica%on for collec%ng, processing and visualisa%on of the textual dataset -‐ hkps://github.com/sixhat/theseus
• Visualisa%on tool
26
Visualisa%on Tool
27
Conclusions
• Q-‐analysis gives an descrip%ve overview of the structure of the system, it terms of the local connec%vity of the news stories.
• Clustering (on top of the Q-‐analysis) gives a natural (highly modular) division of the resul%ng structures.
• This allows the iden%fica%on of coherent news cluster and the filtering of noise news.
28
Generalisa%on of applicability
• Instead of Human tagged documents, one can apply this to any kind of text based documents: – HTML Webpages: Use keywords tag from header
• or – Extract keywords with topic modelling (LDA, for example)
– Scien%fic Documents: Tag documents with topic modelling strategies like LDA and instead of noise, explore the possibility that low connected stories might be emerging scien%fic trends.
29
Take home message
• Real Complex Systems are mul%-‐dimensional. Community detec%on methods need to take into account those descrip%ons
• The construc%on of descrip%ons with all the rela%ons (hyper-‐simplicies) gives beker qualita%ve of the results
• In the newspapers case, this helps the filtering of ``noise’’ news (unrelated news).
30