Upload
csedays
View
1.072
Download
6
Tags:
Embed Size (px)
Citation preview
ICWSM’11 TutorialExploratory Network Analysis with:
Instructors: Sébastien Heymann, Julian [email protected], [email protected]
July 17, 2011 | 1 PM - 4 PM
Exploratory Network Analysis with Gephi
This tutorial is an introduction to Gephi, the open source graph network visualization and manipulation software.
Gephi aims to fulfill the complete chain from data importing to aesthetics refinements and interaction.
Users interact with the visualization and manipulate structures, shapes and colors to reveal hidden properties.
The goal is to help data analysts to make hypotheses, intuitively discover patterns or errors in large data collections.
At the end, the participants will walk away with the practical knowledge enabling them to use Gephi for their own projects.
OFFLINE
Exploratory Network Analysis with Gephi
It starts with a brief introduction on the network exploration process and a hands-on demonstration of the essential functionalities of Gephi.
Participants are guided step by step through the complete chain of rep-resentation, manipulation, layout, analysis and aesthetics refinements. Next, teams work on real datasets.
They finally present their preliminary results. The tutorial concludes with a general question and answer session.
OFFLINE
Requirements
Bring your own laptop with Java and Gephi installed.Gephi should be updated (menu Help > Check for Updates).
Bring a mouse with a wheel.
Bring a dataset of your own if you want, verify if it loads well in Gephi.[1]
[1] http://gephi.org/users/supported-graph-formats/
Workshop Schedule - Part I
Exploratory Network Analysis
• Exploratory Data Analysis• Exploratory Network Analysis• Looking for Orderness in Data• Examples• Guideline
Introduction to Gephi
• Approach and Community• Networked Data• Quick Start Demo
* 30 min break *
Workshop Schedule - Part II
Hands-On!
• Team Work on a Dataset• Presentation of Preliminary Results
Q&A
Exploratory Data Analysis
“The greatest value of a picture is when it forces us to notice what we never expected to see”
started with John Tukey (1962)
ConfirmatoryExploratorySerendipity
resultsintuitionsurprise
Exploratory Data Analysis
Non-linear processing chain of Ben Fry in Computational Information Design (2004)
Dummy Example
P2P file size distribution (Latapy et al., 2008)
Observation: visual saliences on specific file sizes
External knowledge:these sizes correspond to films
New hypothesis on data:films are highly exchanged, so the study might dig in this direction
Exploratory Network Analysis
see the network1
1st graph viz tool: Pajek (1996)Vladimir Batagelj, Andrej Mrvar
interact in real time2
3
Gephi prototype (2008)group, filter, compute metrics...
size by rank, color by partition,label, curved edges, thickness...
build a visual language
Looking for a “Simple Small Truth”?
Drew Conway, What Data Visualization Should Do: 1. Make complex things simple2. Extract small information from large data3. Present truth, do not deceive
http://www.dataists.com/2010/10/what-data-visualization-should-do-simple-small-truth/
Looking for Orderness in Data
Make varying 3 cursors simultaneously to extract meaningful patterns
MICRO level MACRO level
1 dimension N dimensions
T+0 T+N
at different levels
on multiple dimensions
at time scale
“Zoom” cursor on Quantitative Data
Global- connectivity- density- centralization
Local- communities- bridges between communities- local centers vs periphery
Individual- centrality- distances- neighborhood- location- local authority vs hub
MICRO level MACRO level
“Crossing” cursor on Qualitative Data
Social- who with whom- communities- brokerage- influence and power- homophily
Semantic- topics- thematic clusters
Geographic- spatial phenomena
1 dimension N dimensions
“Timeline” cursor on Temporal Data
Evolution of social ties
Evolution of communities
Evolution of topics
T+0 T+N
Mapping an Innovation CenterCollaborations on projects at Images et Réseaux
Themes and content
Actors
Territory
Franck Ghitalla & Ecole de Design de Nantes
Mapping Scientific Cooperations
Network Map: a Series of Choices
corpus
data
algorithms
thresholds
graphicaloperations
communication goals
Guideline
lists + edges in bonus, focus on qualitative data
How attributes explain the structure?• easy to read, “obvious” patterns• focus on entities (in context)• metrics are tools to describe the graph (centrality, bridging...)• links help to build and interpret categories of entitieschallenge: mix attribute crossing and connectivity
How the structure explains attributes?• hard to read, problem of “hidden signals”:
track patterns with various layouts and filtering• focus on structures• metrics are tools to build the graph (cosine similarity...)• categories help to understand the structurechallenge: pattern recognition
require high computational power
1 - 100
100 - 1,000
1,000 - 50,000
> 50,000
# nodes
Gephi now!
Gephi in a Nutshell
« Like Photoshop™ for graphs. »
Helps data analysts to reveal patterns and trends,highlight outliers and tells story with their data.
• Network visualization platform
• Open source, supported by a community
• Built for performance and usability
• Extensible by plug-ins
• Windows, MacOS X, Linux
Gephi Community
ContributorsCommunities
Mathieu Bastian, Mathieu Jacomy, Eduardo Ramos Ibañez, Sébastien Heymann, Guillaume Ceccarelli, André Panisson, Antonio Patriarca, Cezary Bartosiak, Martin Škurla, Patrick McSweeney, Yi Du, Hélder Suzuki, Daniel Bernardes, Ernesto Aneiro, Keheliya Gallaba, Luiz Ribeiro, Urban Škudnik, Vojtech Bardiovsky, Yudi Xue
Nonprofit organization
Community Mission
Provide a “sustainable” software
Maintain the technical ecosystem
Build a business ecosystem
Face cutting-edge technological challenges with a long-term vision
Distribute the software in Open Source
Community Values
Open innovation: ideas and features come from the entire community.
Decisions are taken with transparency.
We consider this technology as a public good,and will keep it in open source.
Diversity of Usages
business leisure :-)
communication academic art
Diversity of Network Encoding
V = { a, b, c, d, e }E = { (a,b), (a,d), (b,c), (e,a), (c,e) }
Textual
a b c d ea - 1 - 1 -b - - 1 - -c - - - - 1d - - - - -e 1 - - - -
Tabular
<graph> <nodes> <node id=”a” /> <node id=”b” /> <node id=”c” /> <node id=”d” /> <node id=”e” /> </nodes> <edges> <edge source=”a” target=”b” /> <edge source=”a” target=”d” /> <edge source=”b” target=”c” /> <edge source=”e” target=”a” /> <edge source=”c” target=”e” /> </edges></graph>
XMLGraphical
and many others...
Software I/O
} >
graph streaming
databases
file
file
CSVPajek NETGuess GDFGEXFGraphMLGraphviz DOTUCInet DLNetdrawVNATulip TLPExcel Spreadsheet
MySQL PostgreSL
SQL ServerNeo4j
CSVPajek NETGuess GDFGEXFGraphMLExcel SpreadsheetSVGPDFPNG
user input
Choosing a File Format
Table of features supported by Gephi
* spreadsheets can be loaded in the Data Laboratory
Edg
e List
/Matr
ix Str
uctur
e
XML S
trutur
e
Ed
ge W
eight
At
tribu
tes
Vi
suali
zatio
n Attr
ibutes
At
tribu
te Defa
ult Va
lue
H
ierarc
hical
Graphs
D
ynam
ics
CSVDL UcinetDOT GraphvizGDFGEXFGMLGraphMLNET PajekTLP TulipVNA NetdrawSpreadsheet*
Do you need...
GEXFSpreadsheetGraphMLGuess GDFGMLUCINet DLNetdraw VNAGraphviz DOTPajek NETCSVTulip TLP
Many features
Few features
XMLTabularText
File Type
Using Gephi
DEMO
Team work
Create a team of 2~3 people.1
Two teams present their preliminary findings.
Explore it during 1H.
Choose a dataset.2
3
4
Dataset #1: GitHub Software Repository
“GitHub is an application used by nearly a million people to store over two million code repositories, making GitHub the largest code
host in the world.”
Started in 2008, it provides the features of an online social network and a software repository to lower the barriers of collaboration and make the code easier to contribute.
https://github.com
Dataset #1: GitHub Software Repository
Data extracted by Franck Cuny* at Linkfluence SAS
1st release in March 2010 -> this poster2nd release in June 2011 -> your data
_____________Network of user profiles__________
Nodes: peoples with at least one repository who are followed by at least two other peopleEdges: A follows B
_____________Network of repositories__________
Nodes: repositoriesEdges: A shares a developer with B
Very few research publications on this OSN!
Dataset #1: GitHub Software Repository
Data extracted by a crawl using the GitHub APISeed: 10 well-known contributors in the Perl community
Networks by country: Japan, France, United StatesNetworks by language: Perl, PHP, Python, Ruby
Node attributes:• user country• number of followers• main programming language
Edges:• directed• weight = number of projects A has forked from B
Dataset #1: GitHub Software Repository
Your mission (should you decide to accept it): find research hypotheses based on your exploration
Example question: are the Perl communities based on geography?
Dataset #2: The Irish Blogosphere
_______________Blogroll Network______________
Nodes: blogs with more than two blogroll linksEdges: blogroll link (in-link)
_______________Post-link Network_____________
Nodes: blogs with more than two blogroll linksEdges: hyperlink inside post from a blog to another (post-link)
“Identifying Representative Textual Sources in Blog Networks”. K. Wade, D. Greene, C. Lee, D. Archambault, P. Cunningham (2011) http://mlg.ucd.ie/blogs
Dataset #2: The Irish Blogosphere
Data extracted by a crawl at distance 2 from the seed for the in-links and Google Blog Search for the post-links.Seed: 21 popular blogs, winners of the “2010 Irish Blog Awards”
Node attributes:• post count = total number of posts by blog• category = from the irish blog index at www.irishblogdirectory.com,
where available• infomap_comm = community to which a node belongs (infomap algo)• gce_comms = overlapping communities (GCE algo)• moses_comms = overlapping communities (MOSES algo)
Edges:• directed• weight = number of hyperlinks in the Post-link network
crawl at distance 2 from the seed
Dataset #2: The Irish Blogosphere
Your mission: explore and try to confirm the official results
Hands-On!
Start:
• Load a graph• Apply a layout• Color the nodes by a qualitative variable in Partition Panel• Size the nodes by a quantitative variable in Ranking Panel• Start to explore...compute metrics, filter the network
End:
• Export maps to PDF in Preview Tab• Save
Presentations
GitHub Repository Irish Blogosphere
Gephi Documentation
Web Site:
Support:Wiki:Source code:
Online Tutorialshttp://gephi.org/users/quick-start/http://gephi.org/users/tutorial-visualization/http://gephi.org/users/tutorial-layouts/http://wiki.gephi.org/index.php/Import_CSV_Datahttp://wiki.gephi.org/index.php/Import_Dynamic_Data
Tutorial in Spanishhttps://code.google.com/p/camon/wiki/Taller_Gephi
Supported Graph Formatshttp://gephi.org/users/supported-graph-formats/
http://gephi.org
http://forum.gephi.org
http://wiki.gephi.org
https://launchpad.net/gephi
Thank You!
Caspar David Friedrich - Wanderer Above the Sea of Fog
Credits
[slide 11] images from Drew Conway
http://www.dataists.com/2010/10/what-data-visualization-should-do-simple-small-truth/
[slide 22 top left] Benoît Vidal at MFG Labs
[slide 22 bottom center] Franck Ghitalla at UTC
[slide 22 right] Studies in MA Digital Fashion at LCF by Peter Jeun Ho Tsang
http://jeunhotsang.com/blog/2010/12/07/prototype/
[slide 27] sketches from Ben Fry, Computational Information Design
Special Thanks to Franck Ghitalla and Mathieu Jacomy
for their insightful discussions.