86
Social Media DataSet UPV, Aplicaciones de la Lingüística Computacional Abril 22nd, 2010 http://www.slideshare.com/jccortizo

Social Media Dataset

Embed Size (px)

DESCRIPTION

The goal of this presentation is to allow researchers to understand the possibilities of Social Media as a research field on the fields related to NLP/IR/DM.

Citation preview

Page 1: Social Media Dataset

Social Media DataSetUPV, Aplicaciones de la Lingüística Computacional

Abril 22nd, 2010

http://www.slideshare.com/jccortizo

Page 2: Social Media Dataset

José Carlos CortizoTwitter: @josek_net

Page 3: Social Media Dataset

Goal

‣ Understand the possibilities of Social Media for Research

Page 4: Social Media Dataset

Index‣ About the Speaker

‣ Research at UEM

‣ SGP, Wipley, BrainSins

‣ Social Media statistics

‣ Social Media as research field

‣ Applications

‣ Academic research vs Enterprise

Page 5: Social Media Dataset

José Carlos Cortizo

2004 2005 2006 2007 2008 2009 2010

consultancy

NLP, DM

10 R&P projects, 4 workshops, 20 papers

Page 6: Social Media Dataset

My Company

SGPVideogamers Social Network

2.600 reg. users, ~10K monthly users

SaaS software for SM & e-CRecSys, Social Search...

Page 7: Social Media Dataset

SGP Funding

• €350K from CDTI (PID project)

• €100K from FFF

• still looking for extra €100K from BA

Page 8: Social Media Dataset

Web 2.0 vs.Social Web

Page 9: Social Media Dataset

What’s Web 2.0?

Page 10: Social Media Dataset

An evolution (of) and (based on) the Web...

Page 11: Social Media Dataset

...focused on usershttp://www.flickr.com/photos/cayusa/431036565

Page 12: Social Media Dataset

Concept introduced by “Tim O’Reilly” in 2004

http://www.flickr.com/photos/thomashawk/153656919/

Page 13: Social Media Dataset

Users own the Information

They should manage the information

Page 14: Social Media Dataset

It’s not a technology, it’s a philosophical

concept

Page 15: Social Media Dataset

Wikis (and Wikipedia), blogs, etc.

Page 16: Social Media Dataset

Evolution of the Web

Page 17: Social Media Dataset

Concepts

Page 18: Social Media Dataset

AJAX

•Web development techniques empowering Web 2.0

•Much developers misunderstand the concept of Web 2.0 and think about AJAX

Page 19: Social Media Dataset

Social Web

• Describe how people socialize with each othe throughout the WWW

• 2 descriptions

•Web 2.0

• Proposal for a future network similar to WWW

Page 20: Social Media Dataset

Social Media

•Media designed to be disseminated through Social interaction

• Internet forums, blogs, microblogging, wikis, podcasts, social networks, etc.

•More info: http://www.slideshare.net/jccortizo/taller-redes-sociales-presentation

Page 21: Social Media Dataset

Social Media Statistics

Page 22: Social Media Dataset

Facebook

Page 23: Social Media Dataset

Facebook stats.

• > 400M. active users

• 50% log in to FB in any given day

• > 35M u. update their status each day

• > 60M. status updates per day

• 3 billion photos uploaded each month

• 5 billion contents shared each week

Page 24: Social Media Dataset

Facebook stats.

• Avg. user has 130 friends on FB

• Avg. user sends 8 friend req. per month

• Avg. user spends > 55 min. per day

• > 70% FB users are outside USA

• > 500K applications

Page 25: Social Media Dataset

Facebook stats.

• > 60M FB users use FB Connect on external websites

• > 100M accessing FB though mobile

• > 200 mobile operators in 60 cuntries deploying/promoting FB mobile products

http://www.facebook.com/press/info.php?statistics

Page 26: Social Media Dataset

Twitter

Page 27: Social Media Dataset

Twitter stats.

• > 105M. registered users

• 300K users sign up every day

• > 180M. unique visitors per month

• 75% traffic come from 3rd party apps.

• > 600M search queries on Twitter/day

• 37% of active users use mobile

http://www.readwriteweb.com/archives/just_the_facts_statistics_from_twitter_chirp.php

Page 28: Social Media Dataset

Facts

Page 29: Social Media Dataset

SM is the New Web

• Facebook traffic tops Google (for USA)

• FB > 7% of US traffic

•March 2010

• http://money.cnn.com/2010/03/16/technology/facebook_most_visited/

Page 30: Social Media Dataset

SM envisioning the Future

•Mobile Web

• Search

• Real-time search

• Social search

•Online identity

http://www.madrimasd.org/blogs/sistemas_inteligentes/2009/01/19/111413

Page 31: Social Media Dataset

Mobile Web

•No real mobile web until Social Media

• 25% FB users and 37% Twitter users accesing from mobile devices

• Trend: more mobile web users than “regular” ones within the next 5 years [1]

[1] J. C. Cortizo, L. I. Diaz, F. Carrero, B. Monsalve, “On the Future of Mobile Phones as the Heart of Community Built Databases” to appear in 2011

Page 32: Social Media Dataset

Real-Time Search

Page 33: Social Media Dataset

Social Search

Page 34: Social Media Dataset

Online Identity•User identity is a real business

• Facebook Connect, OAuth, OpenID...

Page 35: Social Media Dataset

Social Media as Research Field

Page 36: Social Media Dataset

We need Data

•We need data to validate our research

•Why use “non-real”/small/”non-relevant”/old-fashioned datasets

•UCI

• Reuters

• ...

Page 37: Social Media Dataset

SM, Huge DataSet

• Billions users

• Billions contents

• Textual, Multimedia (image, videos, etc.)

• Billions of connections

• Behaviors, preferences, trends...

Page 38: Social Media Dataset

SM Openness

• It’s easy to get data from SM

• SM based datasets

• Developers APIS

• Spidering the Web

Page 39: Social Media Dataset

Available DataSets

• Social Tagging (CiteULike, Bibsonomy, MovieLens, Delicious, Flickr, Last.FM...)

• http://kmi.tugraz.at/staff/markus/datasets/

• Yahoo! Firehose (750K ratings/day, 8K reviews/day, 150K comments/day, status updates, Flickr, Delicious...)

• http://developer.yahoo.net/blog/archives/2010/04/yahoo_updates_firehose.html

Page 42: Social Media Dataset

Key Benefits

• There’s a lot of data on SM

• It’s fun!

• You can work on a real-real domain

• ¿Make (real) money with your research?

Page 43: Social Media Dataset

Where to publish?

• ICSWM: AAAI Conference on Weblogs and Social Media

•MSM/SMUC: Workshop on Search and Mining User generated Contents

•WWW: 4 Social Networks sessions + other 15 S.M. related papers

• ACM RecSys + Social Web workshop

Page 44: Social Media Dataset

Where to publish?

• ICSWM: AAAI Conference on Weblogs and Social Media

•MSM/SMUC: Workshop on Search and Mining User generated Contents

•WWW: 4 Social Networks sessions + other 15 S.M. related papers

• ACM RecSys + Social Web workshop

Page 45: Social Media Dataset

Where to publish?

• Any other ‘typical’ conference from your research area

• Social web/search/mining/networks analysis workshops on almost any relevant conference

Page 46: Social Media Dataset

Other Research Uses

Twitter Lists

Page 47: Social Media Dataset

Other Research Uses

Twitter Searches

Page 48: Social Media Dataset

Other Research Uses

Twitter Users

Page 49: Social Media Dataset

Other Research Uses

Blogs!

Page 50: Social Media Dataset

Other Research Uses

Page 51: Social Media Dataset

Other Research Uses

Page 52: Social Media Dataset

• Don’t wait ‘till the conferences to know about advances

• Follow interesting researchers through Twitter and their blogs

• Peer-reviewing sucks!

• You can learn even more from failed attempts, or work in progress

•Open your mind...

Page 53: Social Media Dataset

Some Applications

Page 54: Social Media Dataset

Buzzer: Twitter RecSys

Page 55: Social Media Dataset

Buzzer

•O. Phelan (@phelo), K. McCarthy, B. Smith, “Using Twitter to Recommend Real-Time Topical News”, ACM RecSys 2009

•Goal: News Recommendation

•Not using Reuters or similar datasets

Page 56: Social Media Dataset

Buzzer

Page 57: Social Media Dataset

Buzzer

Page 58: Social Media Dataset

Why use Twitter?

• “Typical” news sites are boring

• You’ll get compared to Google News

• You’re innovating just by use Twitter

• You’ll benefit from Twitter hype

• You get a real and interesting system to deploy on real conditions

Page 59: Social Media Dataset

FlickrBabel: Multilingual multimedia search

Page 60: Social Media Dataset

Machine Translation

• An open problem

• But actual state-of-the-art enough for some applications

Page 61: Social Media Dataset

Idea

• Do we need a Spanish Metamap?

• F. Carrero, J. C. Cortizo, J. M. Gomez, M. de Buenaga “In the Development of a Spanish Metamap”, CIKM 2008

• F. Carrero, J. C. Cortizo, J. M. Gomez, “Testing Concept Indexing in Crosslingual Medical Text Classification”, ICDIM 2008

Page 62: Social Media Dataset

Idea

Page 63: Social Media Dataset

Idea

Page 64: Social Media Dataset

Idea

• Results show that using Google Translate is just enough for “simulating” a Spanish Metamap

• for classification purposes

Page 65: Social Media Dataset

Extending to SM

•We applied the same idea to FlickrBabel

Page 66: Social Media Dataset

Extending to SM

•We applied the same idea to FlickrBabel

• Search images from Flickr

• Babxel also searchs on YouTube

• Expands the query and improves the recall

Page 67: Social Media Dataset

FlickrBabel

Page 68: Social Media Dataset

FlickrBabel

•We got a lot of buzz/users from

•Mashable

• Loogic

•WwwhatsNew

• And thousand more blogs/sites

Page 69: Social Media Dataset

evri: entity based search engine (+ API)

Page 70: Social Media Dataset
Page 71: Social Media Dataset

evri

•Not a typical search engine

• Real-time

• Semantic

• Entity recognition

•Opinion Mining

Page 72: Social Media Dataset

Offers a great API

Page 73: Social Media Dataset

What’s good

• They integrate a lot of technologies from the state-of-the-art on NLP/IR into something usable

• The API can be used to develop evri based products and applications

• If you have a good technology, build a good product/service around it

Page 74: Social Media Dataset

Entity recognition

Page 75: Social Media Dataset

Recommendations

Page 76: Social Media Dataset

Sentiment Analysis

Page 77: Social Media Dataset

No good enough?

• There isn’t “no good enough” technologies

• There are useful or not useful products/services

• Show your technology to the world, they’d be the best ‘reviewers’

Page 78: Social Media Dataset

Academic vs Enterprise

Page 79: Social Media Dataset

· Too idealist· ‘Fantasy?’ world· Too many assumptions· Research· Guided by public funds· Non-applicable

Too pragmatic ·‘Real?’ world ·

Too little assumptions ·‘Innovate’ ·

Guided by revenues ·Cuts innovation ·

Page 80: Social Media Dataset

· Too idealist· ‘Fantasy?’ world· Too many assumptions· Research· Guided by public funds· Non-applicable

Too pragmatic ·‘Real?’ world ·

Too little assumptions ·‘Innovate’ ·

Guided by revenues ·Cuts innovation ·

There’s a lot ofopportunitiesin the middle!!

Page 81: Social Media Dataset

Entrepreneurship: the Research Way

Page 82: Social Media Dataset

• Choose a real world problem (take care of data availability, competitors and utility)

• Develop a great technology

• Test in a lab environment and publish

• Develop a prototype and grant access to beta testers

Page 83: Social Media Dataset

• Analyze the new results

• Write a (presentation based) business plan

• Get money from FFF

• Develop your product (out of beta)

• Get some clients/users

Page 84: Social Media Dataset

• Write a full business plan

• You can get help from your University/other institutions

• Get more funding from BA’s and VC’s

• Hire the best coders/employees you can get

• Monetize your product/service

Page 85: Social Media Dataset

• And don’t stop researching and innovating

Page 86: Social Media Dataset

José Carlos Cortizo PérezBY, 2010