Upload
michael-mathioudakis
View
218
Download
0
Embed Size (px)
Citation preview
mining the social web
Aris2des Gionis Michael Mathioudakis
Aalto University Spring 2015
social web
facebook twiEer linkedin foursquare flickr instagram pinterest youtube ustream
github stackoverflow wikipedia
2
social web
websites and plaHorms that enable users to
produce content blog posts, ‘status’ messages, videos, pictures, podcasts
consume content read text -‐ blog posts, ‘status’ messages
listen to podcasts, watch videos
interact with each other comment on each other’s posts, ‘like’ or rate items
3
mining the social web a lot of users... a lot of data...
what could we learn*? * assuming we have the data -‐ more on that later
gain insights into... social behavior
how many connec2ons does an average person have? do people connect with like-‐minded people?
poli2cal sen2ment what do people think about current poli2cal issues?
how we experience our ci2es what’s the best neighborhood for food/nightlife?
how we build our careers how oRen do people change careers?
how beneficial is it to ‘network’ professionally? other?
4
mining the social web
there is already research that explores those ques2ons
we will discuss some of it
now and in the next two lectures
5
twiEer • a social sensor – social network + news media – what is happening? – where, who? happening? – trends – events – opinions – poli2cal views – sen2ments – demographics
6
twiEer studies
• finding news events and stories • detec2ng trends • predic2ng consumer behavior • predic2ng stock market(!) • disaster response • rumor analysis and credibility assessment • influence analysis • poli2cal analysis – polariza2on, bias of news media
• sociology studies – sen2ment vs. demographics, gender inequality
7
• photo sharing + social network • photos contain addi2onal informa2on – tags – geoloca2on – comments, favorites – assigned to groups
8
9
Eric Fischer 10
recommend tourist i2neraries
11
foursquare
• loca2on-‐based social network • users check-‐in to different loca2ons • loca2ons have types (hierarchy) – restaurant, sport venue, museum, college, …
• ques2ons: – where do people hang out? – where events take place? – do friends influence each other?
12
when/where people check in? �.� exploration ��
0 5 10 15 20
New-York
London
Barcelona
Helsinki
Total
(a) Hourly check-ins frequency during the day. The activity is at its lowestaround � a.m. and after that, there are three peaks: one when peoplego to work in the morning, one in the middle of the day and the lastone at the end of the evening. Yet, depending of the city, these peaksdo not happen at the same time, nor with the same intensity. Therefore,instead of working directly the raw values of features, we use the numberof standard deviation or z-score.
Mon0–4
Mon8–12
Mon16–20
Tue0–4
Tue8–12
Tue16–20
Wed0–4
Wed8–12
Wed16–20
Thu0–4
Thu8–12
Thu16–20
Fri 0–4
Fri 8–12
Fri 16–20
Sat 0–4
Sat 8–12
Sat 16–20
Sun0–4
Sun8–12
Sun16–20
New-YorkLondon
BarcelonaHelsinkiTotal
(b) Four-hour check-in frequency during the week. We observe a day/nightcycle, which becomes less distinctive as the weekend is approaching anddisappears on Saturday.
Figure �: Check-ins temporal pattern.
�� datasets
�–� �–� �–�� ��–��
��–�� ��–��
��–�� ��–�
10
20
30
40
50
hour
percentage
� hours time clusters in Paris
�–� �–� �–�� ��–�� ��–�� ��–�
10
20
30
40
50
60
hour
percentage
� hours time clusters in Paris
Figure �: Venues clustered by time of check-ins.
13
when/where people check in? �� datasets
City Name Category Entropy
BarcelonaCastellers de Barcelona Non-Profit 0.0139
Café de la Pompeu Café 0.0172
Ràdio � Radio Station 0.0176
ParisBoutique Orange Electronics Store 0.0099
Métro Goncourt [��] Subway 0.0105
Blue Acacia Office 0.0112
Barcelona
Plaça de Catalunya Plaza 0.5835
Sants Estació Train Station 0.6298
Sagrada Família Government Building 0.6309
Camp Nou Stadium 0.6852
Paris
Gare SNCF : Gare de Lyon Train Station 0.6725
Gare SNCF : Paris Nord Train Station 0.6911
Musée du Louvre Museum 0.6924
Tour Eiffel Government Building 0.7167
(a) Venues in Paris and Barcelona with lowest and highest user en-tropy.
0 0.2 0.4 0.6 0.80
0.2
0.4
0.6
time entropy Ht
user
entropyH
u
0.738x2 � 0.385x + 0.179
0.009e�4.103x + 0.1
(b) Venues entropy in Paris with respect to user population and timeof check-in during the day.
Figure �: Extreme values of user entropy, and its relation with time entropy.
14
data sources less obvious
traffic sensors
15
detec2ng events with traffic sensors
16
project ideas less obvious
17
your project
come up with a project idea implement it!
report on your results and findings
18
types of projects
• form a hypothesis and set out to test it – are rich people happier?
• start with an interes2ng ques2on – which are hipster neighborhoods in my city?
• start with a business idea – recommend relevant music to music listeners – recommend clothes to music listeners
• start with a problem that you (think) can solve – how to iden2fy trends in space and 2me?
• start with a cool dataset and explore it
19
your project
analyze data
set a goal for your project (what’s the ques2on you want to answer)
study related literature (what has / hasn’t been done already?
or you think you can do it beEer)
collect data (some data are more difficult to come by)
results evalua2on
(have you answered the ques2on asked originally? possible improvements?
future work?)
1 2
3
4
5
6
20
coming up with a project idea
• conferences: SIGKDD, ICWSM, WWW, WSDM
• themes – urban compu2ng, trend / event detec2on, social networks, poli2cal sen2ment, privacy
– other • google scholar • talk with us
office hours: Mon, 14:15-‐15:30 and by appointment
21
collec2ng the data
• what data are available? – different plaHorms share different data about their users’ ac2vity – browse dev sites of social networks & find out about privacy policies
and APIs – browse public data repositories – the data mining group has data for
blog posts, twiEer, google+, facebook, foursquare
• code Mining the Social Web (github) hEps://github.com/ptwobrussell/Mining-‐the-‐Social-‐Web-‐2nd-‐Edi2on
22
schedule
• Today: overview • February 2nd : discuss literature (Aris) • February 9th : discuss literature (Michael) • February 16th & 23rd: present project proposals • March 30th : students submit progress report • March 30th & April 6th: intermediate presenta2ons • May 4th & May 11th : final presenta2ons • May 15th : final report due
23
final report
• introduc2on • related work • problem statement • proposed technique (algorithms) • data descrip2on • empirical evalua2on
– results – comparison with state of the art
• future work
24
grading
• originality (has it been done before) • poten2al impact (how interes2ng it is & why) • rigorousness of proposed technique • reproducibility (public code) • presenta2on
• teams of 2 are encouraged • presenta2ons & reports are required • surveys of exis2ng techniques are ok, too
25
schedule
• Today: overview • February 2nd : discuss literature (Aris) • February 9th : discuss literature (Michael) • February 16th and 23rd: students present project proposals
• March 30th : students submit progress report • March 30th & April 6th: intermediate presenta2ons • May 4th & May 11th : final presenta2ons • May 15th : final report due
26
un2l then... browse literature
see papers posted on noppa for a sample conferences KDD, ICWSM, WWW, WSDM
google scholar dev websites, for example...
hEps://dev.twiEer.com, hEps://developers.facebook.com, hEps://developer.github.com/, hEps://developer.foursquare.com
code samples, hEps://github.com/ptwobrussell/Mining-‐the-‐Social-‐Web-‐2nd-‐Edi2on
data repositories, hEp://snap.stanford.edu/, hEp://icwsm.org/2013/datasets/datasets/,
hEp://wadam-‐data.dis.uniroma1.it
and talk to us!
27
see you next week!
Aris2des Gionis
Michael Mathioudakis contact: [email protected]
Office Hours: Mon, 14:15-‐15:30 and by appointment
28