Social Media Dataset

Preview:

DESCRIPTION

The goal of this presentation is to allow researchers to understand the possibilities of Social Media as a research field on the fields related to NLP/IR/DM.

Citation preview

Social Media DataSetUPV, Aplicaciones de la Lingüística Computacional

Abril 22nd, 2010

http://www.slideshare.com/jccortizo

José Carlos CortizoTwitter: @josek_net

Goal

‣ Understand the possibilities of Social Media for Research

Index‣ About the Speaker

‣ Research at UEM

‣ SGP, Wipley, BrainSins

‣ Social Media statistics

‣ Social Media as research field

‣ Applications

‣ Academic research vs Enterprise

José Carlos Cortizo

2004 2005 2006 2007 2008 2009 2010

consultancy

NLP, DM

10 R&P projects, 4 workshops, 20 papers

My Company

SGPVideogamers Social Network

2.600 reg. users, ~10K monthly users

SaaS software for SM & e-CRecSys, Social Search...

SGP Funding

• €350K from CDTI (PID project)

• €100K from FFF

• still looking for extra €100K from BA

Web 2.0 vs.Social Web

What’s Web 2.0?

An evolution (of) and (based on) the Web...

...focused on usershttp://www.flickr.com/photos/cayusa/431036565

Concept introduced by “Tim O’Reilly” in 2004

http://www.flickr.com/photos/thomashawk/153656919/

Users own the Information

They should manage the information

It’s not a technology, it’s a philosophical

concept

Wikis (and Wikipedia), blogs, etc.

Evolution of the Web

Concepts

AJAX

•Web development techniques empowering Web 2.0

•Much developers misunderstand the concept of Web 2.0 and think about AJAX

Social Web

• Describe how people socialize with each othe throughout the WWW

• 2 descriptions

•Web 2.0

• Proposal for a future network similar to WWW

Social Media

•Media designed to be disseminated through Social interaction

• Internet forums, blogs, microblogging, wikis, podcasts, social networks, etc.

•More info: http://www.slideshare.net/jccortizo/taller-redes-sociales-presentation

Social Media Statistics

Facebook

Facebook stats.

• > 400M. active users

• 50% log in to FB in any given day

• > 35M u. update their status each day

• > 60M. status updates per day

• 3 billion photos uploaded each month

• 5 billion contents shared each week

Facebook stats.

• Avg. user has 130 friends on FB

• Avg. user sends 8 friend req. per month

• Avg. user spends > 55 min. per day

• > 70% FB users are outside USA

• > 500K applications

Facebook stats.

• > 60M FB users use FB Connect on external websites

• > 100M accessing FB though mobile

• > 200 mobile operators in 60 cuntries deploying/promoting FB mobile products

http://www.facebook.com/press/info.php?statistics

Twitter

Twitter stats.

• > 105M. registered users

• 300K users sign up every day

• > 180M. unique visitors per month

• 75% traffic come from 3rd party apps.

• > 600M search queries on Twitter/day

• 37% of active users use mobile

http://www.readwriteweb.com/archives/just_the_facts_statistics_from_twitter_chirp.php

Facts

SM is the New Web

• Facebook traffic tops Google (for USA)

• FB > 7% of US traffic

•March 2010

• http://money.cnn.com/2010/03/16/technology/facebook_most_visited/

SM envisioning the Future

•Mobile Web

• Search

• Real-time search

• Social search

•Online identity

http://www.madrimasd.org/blogs/sistemas_inteligentes/2009/01/19/111413

Mobile Web

•No real mobile web until Social Media

• 25% FB users and 37% Twitter users accesing from mobile devices

• Trend: more mobile web users than “regular” ones within the next 5 years [1]

[1] J. C. Cortizo, L. I. Diaz, F. Carrero, B. Monsalve, “On the Future of Mobile Phones as the Heart of Community Built Databases” to appear in 2011

Real-Time Search

Social Search

Online Identity•User identity is a real business

• Facebook Connect, OAuth, OpenID...

Social Media as Research Field

We need Data

•We need data to validate our research

•Why use “non-real”/small/”non-relevant”/old-fashioned datasets

•UCI

• Reuters

• ...

SM, Huge DataSet

• Billions users

• Billions contents

• Textual, Multimedia (image, videos, etc.)

• Billions of connections

• Behaviors, preferences, trends...

SM Openness

• It’s easy to get data from SM

• SM based datasets

• Developers APIS

• Spidering the Web

Available DataSets

• Social Tagging (CiteULike, Bibsonomy, MovieLens, Delicious, Flickr, Last.FM...)

• http://kmi.tugraz.at/staff/markus/datasets/

• Yahoo! Firehose (750K ratings/day, 8K reviews/day, 150K comments/day, status updates, Flickr, Delicious...)

• http://developer.yahoo.net/blog/archives/2010/04/yahoo_updates_firehose.html

Key Benefits

• There’s a lot of data on SM

• It’s fun!

• You can work on a real-real domain

• ¿Make (real) money with your research?

Where to publish?

• ICSWM: AAAI Conference on Weblogs and Social Media

•MSM/SMUC: Workshop on Search and Mining User generated Contents

•WWW: 4 Social Networks sessions + other 15 S.M. related papers

• ACM RecSys + Social Web workshop

Where to publish?

• ICSWM: AAAI Conference on Weblogs and Social Media

•MSM/SMUC: Workshop on Search and Mining User generated Contents

•WWW: 4 Social Networks sessions + other 15 S.M. related papers

• ACM RecSys + Social Web workshop

Where to publish?

• Any other ‘typical’ conference from your research area

• Social web/search/mining/networks analysis workshops on almost any relevant conference

Other Research Uses

Twitter Lists

Other Research Uses

Twitter Searches

Other Research Uses

Twitter Users

Other Research Uses

Blogs!

Other Research Uses

Other Research Uses

• Don’t wait ‘till the conferences to know about advances

• Follow interesting researchers through Twitter and their blogs

• Peer-reviewing sucks!

• You can learn even more from failed attempts, or work in progress

•Open your mind...

Some Applications

Buzzer: Twitter RecSys

Buzzer

•O. Phelan (@phelo), K. McCarthy, B. Smith, “Using Twitter to Recommend Real-Time Topical News”, ACM RecSys 2009

•Goal: News Recommendation

•Not using Reuters or similar datasets

Buzzer

Buzzer

Why use Twitter?

• “Typical” news sites are boring

• You’ll get compared to Google News

• You’re innovating just by use Twitter

• You’ll benefit from Twitter hype

• You get a real and interesting system to deploy on real conditions

FlickrBabel: Multilingual multimedia search

Machine Translation

• An open problem

• But actual state-of-the-art enough for some applications

Idea

• Do we need a Spanish Metamap?

• F. Carrero, J. C. Cortizo, J. M. Gomez, M. de Buenaga “In the Development of a Spanish Metamap”, CIKM 2008

• F. Carrero, J. C. Cortizo, J. M. Gomez, “Testing Concept Indexing in Crosslingual Medical Text Classification”, ICDIM 2008

Idea

Idea

Idea

• Results show that using Google Translate is just enough for “simulating” a Spanish Metamap

• for classification purposes

Extending to SM

•We applied the same idea to FlickrBabel

Extending to SM

•We applied the same idea to FlickrBabel

• Search images from Flickr

• Babxel also searchs on YouTube

• Expands the query and improves the recall

FlickrBabel

FlickrBabel

•We got a lot of buzz/users from

•Mashable

• Loogic

•WwwhatsNew

• And thousand more blogs/sites

evri: entity based search engine (+ API)

evri

•Not a typical search engine

• Real-time

• Semantic

• Entity recognition

•Opinion Mining

Offers a great API

What’s good

• They integrate a lot of technologies from the state-of-the-art on NLP/IR into something usable

• The API can be used to develop evri based products and applications

• If you have a good technology, build a good product/service around it

Entity recognition

Recommendations

Sentiment Analysis

No good enough?

• There isn’t “no good enough” technologies

• There are useful or not useful products/services

• Show your technology to the world, they’d be the best ‘reviewers’

Academic vs Enterprise

· Too idealist· ‘Fantasy?’ world· Too many assumptions· Research· Guided by public funds· Non-applicable

Too pragmatic ·‘Real?’ world ·

Too little assumptions ·‘Innovate’ ·

Guided by revenues ·Cuts innovation ·

· Too idealist· ‘Fantasy?’ world· Too many assumptions· Research· Guided by public funds· Non-applicable

Too pragmatic ·‘Real?’ world ·

Too little assumptions ·‘Innovate’ ·

Guided by revenues ·Cuts innovation ·

There’s a lot ofopportunitiesin the middle!!

Entrepreneurship: the Research Way

• Choose a real world problem (take care of data availability, competitors and utility)

• Develop a great technology

• Test in a lab environment and publish

• Develop a prototype and grant access to beta testers

• Analyze the new results

• Write a (presentation based) business plan

• Get money from FFF

• Develop your product (out of beta)

• Get some clients/users

• Write a full business plan

• You can get help from your University/other institutions

• Get more funding from BA’s and VC’s

• Hire the best coders/employees you can get

• Monetize your product/service

• And don’t stop researching and innovating

José Carlos Cortizo PérezBY, 2010

Recommended