Upload
jose-carlos-cortizo-perez
View
12.766
Download
0
Tags:
Embed Size (px)
DESCRIPTION
The goal of this presentation is to allow researchers to understand the possibilities of Social Media as a research field on the fields related to NLP/IR/DM.
Citation preview
Social Media DataSetUPV, Aplicaciones de la Lingüística Computacional
Abril 22nd, 2010
http://www.slideshare.com/jccortizo
José Carlos CortizoTwitter: @josek_net
Goal
‣ Understand the possibilities of Social Media for Research
Index‣ About the Speaker
‣ Research at UEM
‣ SGP, Wipley, BrainSins
‣ Social Media statistics
‣ Social Media as research field
‣ Applications
‣ Academic research vs Enterprise
José Carlos Cortizo
2004 2005 2006 2007 2008 2009 2010
consultancy
NLP, DM
10 R&P projects, 4 workshops, 20 papers
My Company
SGPVideogamers Social Network
2.600 reg. users, ~10K monthly users
SaaS software for SM & e-CRecSys, Social Search...
SGP Funding
• €350K from CDTI (PID project)
• €100K from FFF
• still looking for extra €100K from BA
Web 2.0 vs.Social Web
What’s Web 2.0?
An evolution (of) and (based on) the Web...
...focused on usershttp://www.flickr.com/photos/cayusa/431036565
Concept introduced by “Tim O’Reilly” in 2004
http://www.flickr.com/photos/thomashawk/153656919/
Users own the Information
They should manage the information
It’s not a technology, it’s a philosophical
concept
Wikis (and Wikipedia), blogs, etc.
Evolution of the Web
Concepts
AJAX
•Web development techniques empowering Web 2.0
•Much developers misunderstand the concept of Web 2.0 and think about AJAX
Social Web
• Describe how people socialize with each othe throughout the WWW
• 2 descriptions
•Web 2.0
• Proposal for a future network similar to WWW
Social Media
•Media designed to be disseminated through Social interaction
• Internet forums, blogs, microblogging, wikis, podcasts, social networks, etc.
•More info: http://www.slideshare.net/jccortizo/taller-redes-sociales-presentation
Social Media Statistics
Facebook stats.
• > 400M. active users
• 50% log in to FB in any given day
• > 35M u. update their status each day
• > 60M. status updates per day
• 3 billion photos uploaded each month
• 5 billion contents shared each week
Facebook stats.
• Avg. user has 130 friends on FB
• Avg. user sends 8 friend req. per month
• Avg. user spends > 55 min. per day
• > 70% FB users are outside USA
• > 500K applications
Facebook stats.
• > 60M FB users use FB Connect on external websites
• > 100M accessing FB though mobile
• > 200 mobile operators in 60 cuntries deploying/promoting FB mobile products
http://www.facebook.com/press/info.php?statistics
Twitter stats.
• > 105M. registered users
• 300K users sign up every day
• > 180M. unique visitors per month
• 75% traffic come from 3rd party apps.
• > 600M search queries on Twitter/day
• 37% of active users use mobile
http://www.readwriteweb.com/archives/just_the_facts_statistics_from_twitter_chirp.php
Facts
SM is the New Web
• Facebook traffic tops Google (for USA)
• FB > 7% of US traffic
•March 2010
• http://money.cnn.com/2010/03/16/technology/facebook_most_visited/
SM envisioning the Future
•Mobile Web
• Search
• Real-time search
• Social search
•Online identity
http://www.madrimasd.org/blogs/sistemas_inteligentes/2009/01/19/111413
Mobile Web
•No real mobile web until Social Media
• 25% FB users and 37% Twitter users accesing from mobile devices
• Trend: more mobile web users than “regular” ones within the next 5 years [1]
[1] J. C. Cortizo, L. I. Diaz, F. Carrero, B. Monsalve, “On the Future of Mobile Phones as the Heart of Community Built Databases” to appear in 2011
Real-Time Search
Social Search
Online Identity•User identity is a real business
• Facebook Connect, OAuth, OpenID...
Social Media as Research Field
We need Data
•We need data to validate our research
•Why use “non-real”/small/”non-relevant”/old-fashioned datasets
•UCI
• Reuters
• ...
SM, Huge DataSet
• Billions users
• Billions contents
• Textual, Multimedia (image, videos, etc.)
• Billions of connections
• Behaviors, preferences, trends...
SM Openness
• It’s easy to get data from SM
• SM based datasets
• Developers APIS
• Spidering the Web
Available DataSets
• Social Tagging (CiteULike, Bibsonomy, MovieLens, Delicious, Flickr, Last.FM...)
• http://kmi.tugraz.at/staff/markus/datasets/
• Yahoo! Firehose (750K ratings/day, 8K reviews/day, 150K comments/day, status updates, Flickr, Delicious...)
• http://developer.yahoo.net/blog/archives/2010/04/yahoo_updates_firehose.html
Available DataSets
•MySpace data (real-time data, multimedia content, ...)
• http://blog.infochimps.org/2010/03/12/announcing-bulk-redistribution-of-myspace-data/
• Spinn3r Blog Dataset/JDPA Sentiment Corpus
• http://www.icwsm.org/data/
...and more
• http://delicious.com/pskomoroch/dataset
Key Benefits
• There’s a lot of data on SM
• It’s fun!
• You can work on a real-real domain
• ¿Make (real) money with your research?
Where to publish?
• ICSWM: AAAI Conference on Weblogs and Social Media
•MSM/SMUC: Workshop on Search and Mining User generated Contents
•WWW: 4 Social Networks sessions + other 15 S.M. related papers
• ACM RecSys + Social Web workshop
Where to publish?
• ICSWM: AAAI Conference on Weblogs and Social Media
•MSM/SMUC: Workshop on Search and Mining User generated Contents
•WWW: 4 Social Networks sessions + other 15 S.M. related papers
• ACM RecSys + Social Web workshop
Where to publish?
• Any other ‘typical’ conference from your research area
• Social web/search/mining/networks analysis workshops on almost any relevant conference
Other Research Uses
Twitter Lists
Other Research Uses
Twitter Searches
Other Research Uses
Twitter Users
Other Research Uses
Blogs!
Other Research Uses
Other Research Uses
• Don’t wait ‘till the conferences to know about advances
• Follow interesting researchers through Twitter and their blogs
• Peer-reviewing sucks!
• You can learn even more from failed attempts, or work in progress
•Open your mind...
Some Applications
Buzzer: Twitter RecSys
Buzzer
•O. Phelan (@phelo), K. McCarthy, B. Smith, “Using Twitter to Recommend Real-Time Topical News”, ACM RecSys 2009
•Goal: News Recommendation
•Not using Reuters or similar datasets
Buzzer
Buzzer
Why use Twitter?
• “Typical” news sites are boring
• You’ll get compared to Google News
• You’re innovating just by use Twitter
• You’ll benefit from Twitter hype
• You get a real and interesting system to deploy on real conditions
FlickrBabel: Multilingual multimedia search
Machine Translation
• An open problem
• But actual state-of-the-art enough for some applications
Idea
• Do we need a Spanish Metamap?
• F. Carrero, J. C. Cortizo, J. M. Gomez, M. de Buenaga “In the Development of a Spanish Metamap”, CIKM 2008
• F. Carrero, J. C. Cortizo, J. M. Gomez, “Testing Concept Indexing in Crosslingual Medical Text Classification”, ICDIM 2008
Idea
Idea
Idea
• Results show that using Google Translate is just enough for “simulating” a Spanish Metamap
• for classification purposes
Extending to SM
•We applied the same idea to FlickrBabel
Extending to SM
•We applied the same idea to FlickrBabel
• Search images from Flickr
• Babxel also searchs on YouTube
• Expands the query and improves the recall
FlickrBabel
FlickrBabel
•We got a lot of buzz/users from
•Mashable
• Loogic
•WwwhatsNew
• And thousand more blogs/sites
evri: entity based search engine (+ API)
evri
•Not a typical search engine
• Real-time
• Semantic
• Entity recognition
•Opinion Mining
Offers a great API
What’s good
• They integrate a lot of technologies from the state-of-the-art on NLP/IR into something usable
• The API can be used to develop evri based products and applications
• If you have a good technology, build a good product/service around it
Entity recognition
Recommendations
Sentiment Analysis
No good enough?
• There isn’t “no good enough” technologies
• There are useful or not useful products/services
• Show your technology to the world, they’d be the best ‘reviewers’
Academic vs Enterprise
· Too idealist· ‘Fantasy?’ world· Too many assumptions· Research· Guided by public funds· Non-applicable
Too pragmatic ·‘Real?’ world ·
Too little assumptions ·‘Innovate’ ·
Guided by revenues ·Cuts innovation ·
· Too idealist· ‘Fantasy?’ world· Too many assumptions· Research· Guided by public funds· Non-applicable
Too pragmatic ·‘Real?’ world ·
Too little assumptions ·‘Innovate’ ·
Guided by revenues ·Cuts innovation ·
There’s a lot ofopportunitiesin the middle!!
Entrepreneurship: the Research Way
• Choose a real world problem (take care of data availability, competitors and utility)
• Develop a great technology
• Test in a lab environment and publish
• Develop a prototype and grant access to beta testers
• Analyze the new results
• Write a (presentation based) business plan
• Get money from FFF
• Develop your product (out of beta)
• Get some clients/users
• Write a full business plan
• You can get help from your University/other institutions
• Get more funding from BA’s and VC’s
• Hire the best coders/employees you can get
• Monetize your product/service
• And don’t stop researching and innovating
José Carlos Cortizo PérezBY, 2010