30
Iden%fying news clusters using Qanalysis and Modularity David Rodrigues + Centre for Complexity and Design + The Open University, UK – [email protected] 1

Identifying news clusters using Q-analysis and Modularity

  • View
    352

  • Download
    0

Embed Size (px)

DESCRIPTION

With online publication and social media taking the main role in dissemination of news, and with the decline of traditional printed media, it has become necessary to devise ways to automatically extract meaningful information from the plethora of sources available and to make that information readily available to interested parties. In this paper we present a method of automated analysis of the underlying structure of online newspapers based on Q-analysis and modularity. We show how the combination of the two strategies allows for the identification of well defined news clusters that are free of noise (unrelated stories) and provide automated clustering of information on trending topics on news published online.

Citation preview

Page 1: Identifying news clusters using Q-analysis and Modularity

Iden%fying  news  clusters  using  Q-­‐analysis  and  Modularity  

David  Rodrigues+  Centre  for  Complexity  and  Design  

+The Open University, UK – [email protected]

 

1  

Page 2: Identifying news clusters using Q-analysis and Modularity

v  

complexityanddesign.com  Thursday  am  –  Room  S11  

2  

Complexity  &  Design  Workshop  at  ECCS13  

Page 3: Identifying news clusters using Q-analysis and Modularity

Mo%va%on  

•  Find  Structure  in  collec%ons  of  text  documents  •  Create  Computer  Algorithms  to  automate  this  discovery  with  minimal  human  supervision.  

•  Use  of  hybrid  methodologies  to  improve  quality  of  results  –  Topology  based  approach  describes  data  –  Clustering  technique  to  iden%fy  modules  

3  

Page 4: Identifying news clusters using Q-analysis and Modularity

Problem  Descrip%on  

•  Iden%fy  the  Structure  of  the  news  published  online  by  The  Guardian  (among  other  newspapers)  – Clustering?    – Topology?  – Topic  Modelling?  – Noise?  – Novelty?  – Change?  

4  

[Kohut,  A.  and  Remez,  M.  (2008)]  

Page 5: Identifying news clusters using Q-analysis and Modularity

Clustering  Techniques  in  Topic  Modelling  

•  Nearest  neighbour  classifica%on  •  Bayesian  probabilis%c  techniques  •  Decision  trees  •  Regression  Models  •  Neural  Networks  •  Support  Vector  Machines  

•  Language  dependent  /  Human  interven%on  in  the  defini%on  of  categories  for  training  samples.  

5  

Page 6: Identifying news clusters using Q-analysis and Modularity

Clustering  in  Graphs  is  Community  Detec%on  

•  Modularity  based  techniques  [majority]  •  Spectral  algorithms  •  Synchroniza%on  based  techniques  •  …    •  [Community  detecBon  in  graphs  -­‐  Fortunato,  2010,  for  comprehensive  review]  

•  Binary  rela%ons  between  nodes  don’t  capture  the  mul%-­‐level  structure  of  exis%ng  rela%ons.  – Move  to  n-­‐ary  rela%ons  and  descrip%ons  

6  

Page 7: Identifying news clusters using Q-analysis and Modularity

Previously  

•  We  used  a  sliding  window  over  the  %me  series  of  the  news  stories  

•  Used  Varia%on  of  Informa%on  to  measure  changes  in  an  evolving  adap%ve  network  of  news[Meilã  2007,  Rodrigues  2010]  

7  

Page 8: Identifying news clusters using Q-analysis and Modularity

Our  Proposal  

•  Use  a  high  dimensional  representa%on  of  the  documents  (Simplicial  Complex)  

•  Use  Q-­‐analysis  to  describe  the  system  constructed  from  the  Documents  x  Tags  Incidence  Matrix  

•  Use  Q-­‐connected  components  to  filter  noise.  •  Use  modularity  opBmisaBon  to  find  communi%es  in  the  resul%ng  induced  graphs  

8  

Page 9: Identifying news clusters using Q-analysis and Modularity

Noise?  

•  In  the  news  context,  we  define  noise  news  as  news  that  are  loosely  related  to  the  main  topics  published.  

•  We  can  filter  them  by  assuming  that  the  Q-­‐connectedness  of  this  news  is  very  low.    

9  

Page 10: Identifying news clusters using Q-analysis and Modularity

The  Guardian  

•  Classifies  news  with  useful  metadata:  – …  –  Sec%on  –  Tags  – …  

hkp://www.theguardian.com/open-­‐plalorm  Open  Plalorm  with  API  for  applica%on  development.    3  years  of  data:  2010,  2011  and  2012  

10  

Page 11: Identifying news clusters using Q-analysis and Modularity

Pseudo  code  for  the  automated  news  clustering  and  filtering  algorithm  

11  

Page 12: Identifying news clusters using Q-analysis and Modularity

Pseudo  code  for  the  automated  news  clustering  and  filtering  algorithm  

12  

Page 13: Identifying news clusters using Q-analysis and Modularity

Incidence  Matrix  

TAG  1   TAG  2   TAG  3   TAG  4    TAG  5     …  

NEWS  1   1   1   0   0   0   …  

NEWS  2   0   1   1   0   1   …  

NEWS  3   0   1   0   0   1   …  

NEWS  4   1   0   0   0   1   …  

NEWS  5   0   0   0   1   1   …  

…   …   …   …   …   …   …  

13  

Documents  x  Tags  

Page 14: Identifying news clusters using Q-analysis and Modularity

Results  

14  

Page 15: Identifying news clusters using Q-analysis and Modularity

Community  detec%on  on  the    0-­‐connected  graph  

15  

1  Month  of  News  –  November  2011    Modularity  =  0.48    9  communi%es  

Page 16: Identifying news clusters using Q-analysis and Modularity

Small  frac%on  of  ver%ces  is  highly  connected  

16  

Page 17: Identifying news clusters using Q-analysis and Modularity

Giant  component  only  for  low  connected  graph  

17  

Page 18: Identifying news clusters using Q-analysis and Modularity

Modularity  vs.  connectedness  

18  

Page 19: Identifying news clusters using Q-analysis and Modularity

Number  of  nodes  decreases  quickly  with  Q  

19  

Page 20: Identifying news clusters using Q-analysis and Modularity

Number  of  nodes  and  Edge  Density  

20  November  2011  

Page 21: Identifying news clusters using Q-analysis and Modularity

Average  Clustering  and  Degree  Assorta%vity  

21  

Page 22: Identifying news clusters using Q-analysis and Modularity

n.  Components  and  Modularity  

22  

Page 23: Identifying news clusters using Q-analysis and Modularity

Q=5  +  Modularity  

23  

Page 24: Identifying news clusters using Q-analysis and Modularity

Examples  Of  Clusters  (I)  

24  

Page 25: Identifying news clusters using Q-analysis and Modularity

Examples  Of  Clusters  (II)  

25  

Page 26: Identifying news clusters using Q-analysis and Modularity

Developed  Tools  

•  Theseus  –  A  python  applica%on  for  collec%ng,    processing  and  visualisa%on  of  the  textual  dataset  -­‐  hkps://github.com/sixhat/theseus    

•  Visualisa%on  tool    

26  

Page 27: Identifying news clusters using Q-analysis and Modularity

Visualisa%on  Tool  

27  

Page 28: Identifying news clusters using Q-analysis and Modularity

Conclusions  

•  Q-­‐analysis  gives  an  descrip%ve  overview  of  the  structure  of  the  system,  it  terms  of  the  local  connec%vity  of  the  news  stories.  

•  Clustering  (on  top  of  the  Q-­‐analysis)  gives  a  natural  (highly  modular)  division  of  the  resul%ng  structures.    

•  This  allows  the  iden%fica%on  of  coherent  news  cluster  and  the  filtering  of  noise  news.  

28  

Page 29: Identifying news clusters using Q-analysis and Modularity

Generalisa%on  of  applicability  

•  Instead  of  Human  tagged  documents,  one  can  apply  this  to  any  kind  of  text  based  documents:  – HTML  Webpages:  Use  keywords  tag  from  header    

•  or  – Extract  keywords  with  topic  modelling  (LDA,  for  example)  

– Scien%fic  Documents:  Tag  documents  with  topic  modelling  strategies  like  LDA  and  instead  of  noise,  explore  the  possibility  that  low  connected  stories  might  be  emerging  scien%fic  trends.  

29  

Page 30: Identifying news clusters using Q-analysis and Modularity

Take  home  message  

•  Real  Complex  Systems  are  mul%-­‐dimensional.  Community  detec%on  methods  need  to  take  into  account  those  descrip%ons  

•  The  construc%on  of  descrip%ons  with  all  the  rela%ons  (hyper-­‐simplicies)  gives  beker  qualita%ve  of  the  results  

•  In  the  newspapers  case,  this  helps  the  filtering  of  ``noise’’  news  (unrelated  news).  

30