Upload
luz-rello
View
516
Download
0
Tags:
Embed Size (px)
DESCRIPTION
In this study we present an analysis of the lexical quality of social media in the Web, focusing on the Web 2.0, social networks, blogs and micro-blogs, multimedia and opinions. We find that blogs and social networks are the main players and also the main contributors to the bad lexical quality of the Web. We also compare our results with the rest of the Web finding that in general social media has worse lexical quality than the average Web and that their quality is one order of magnitude worse than high quality sites.
Citation preview
How Bad Do You Spell? The Lexical Quality of Social Media
Ricardo Baeza-Yates
Yahoo! Research & Web Research Group, Pompeu Fabra University, Barcelona, Spain
Web Research and NLP GroupsPompeu Fabra University, Barcelona, Spain
Luz Rello
FoSW 2011, Barcelona
Outline
— What & Why
— How
— Results
Outline
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
word sample, criteria, measure
lexical quality
comparison of Social Media with the Web
OutlineWhat
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
Quality
Social MediaUser-generated content: social networks, blogs, micro-blogs, multimedia & opinions.
representation:— legibility, spelling errors, grammatical errors, etc.
content:— accuracy, source reputation, objectivity, highly current, etc.
community feedback, user interactions, click counts...
Lexical Quality
OutlineWhat
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
Lexical Quality Lexical quality mainly refers to the degree of excellence of words in a text.
A lexical representation has high quality to the ex t ent tha t i t ha s a fu l ly sp ec ified orthographic representation (a spelling) and redundant phonological representations (one from spoken language and one recoverable from orthographies-to-phonological mapping). (Perfetti and Hart, 2002)
OutlineWhy
Lexical Quality
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
— Spam detection (Castillo, Donato, Gionis, Murdock and Silvestri 2007)
— Credibility determination (Fogg et al. 2001)
— Wikipedia vandalism detection (Potthast, Stein, and Gerling 2008)
used for
— There is a strong correlation between spelling errors and web data content quality
(Gelman and Barletta 2008)
useful metric
OutlineMethodology
(Baeza-Yates and Rello 2011)
Sample the Web using ameasure for lexical quality
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
A set of words (errors)(Gelman and Barletta 2008)
Detailed classification of spelling errors in the Web
(Perfetti and Hart 2002)Theoretical reason
Practical reason(Ringlstetter, Schulz and Mihov 2006)
(Baeza-Yates and Rello 2011)
OutlineHow Many Kinds of Errors in the Web?
Regular spelling errors:
produced by non-impaired native English individuals, such as the transposition error, i.e. *recieve.
caused by the adjacency of letters in the keyboard, i.e. *teceive.
Regular typos:
Non-native speakers errors:
made by people who use English as a foreign language, i.e. *receibe.
Dyslexic errors:
errors commonly made made by dyslexics (i.e. unfinishedwords or letters, omitted words, inconsistent spaces between words and letters (Vellutino, 1979). *reiecve instead of receive.
due to letters of similar shape, such as *ieceive.OCR errors:
OutlineSample of Words
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
Ten words extracted from a sample 50 words + variants with errors = 1,345 words.
e.g.: comparison errorsDyslexic error:
Spelling errors:
Typos:
OCR errors: Non-native speakers:
*comaprsion.
*comparision, *conparison and *coparison.
*vomparison, *xomparison, *cimparison, *cpmparison, *conparison, *co,parison, *comoarison, *com[arison, *comprison, *compsrison, *compaeison, *compatison, *comparuson, *comparoson, *compariaon,*comparidon, *comparisin, *comparispn, *comparisob and *comparisom.
*compaiison and *comparisom.
*comparition and *comparizon.
OutlineHow Did We Find/Generate the Errors?
Regular spelling errors:
Dyslexic errors:
Regular typos:
Non-native speakers errors:
OCR errors:
High frequency in query logs.
Generated by substituting each of the letters of the intended word with the letter situated immediately up, down, left and right from the intended letter.
Substituting typical letters which are usually mistaken.
Linguistic knowledge.
Texts written by dyslexic users and literature. (Pedler 2007)
*gobernment is a typical error made by Spanish learners of English. — Graphemes <b> and <v> are pronounced as /b/.—Phoneme /v/ does not exist in the standard Spanish phonemic system. — In Spanish is written with <b>.
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
Outline
Ricardo Baeza-Yates and Luz Rello Estimating Dyslexia in the Web
Selection Criteria
— To avoid the overlap of different type of errors:
— To avoid the overlap of dyslexic errors and real words:
— E.g.: We consider only words written by dyslexics containing multi-errors, that is, the dyslexic word differs from the intended correct word by more than one letter. For example, the dyslexic word *konwlegde from knowledge.
— Errors which coincide with other existing words in English are omitted, i.e. *trust being the intended word truth. — Errors which give as a result a proper name are also filtered, for instance the typo *wirries from worries is also a proper name.
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
— The errors related to word are unique and not ambiguous.
— Starting point was the selection of dyslexic errors.
Outline
album
always
around
because
enough
everything
having
problem
remember
working
*albun
*alwasy
*arround
*becuase
*enoguh
*everyhting
*haveing
*problen
*remenber
*workig
Sample of Words
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
— Relatively long (an average length of 8.2 letters).
OutlineLexical Quality Measure
— Relative ratio of the misspells to the correct spellings averaged over our word sample:
— A lower value of LQ implies better lexical quality, being 0 perfect quality.— We estimate df by searching each word only in the English pages of a major search engine.— The relative order of the measure will hardly change as the size of the set grows.
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
OutlineSocial Media Sites Classes
Blogger Foursquare LiveJournal Tumblr Twitter
Blogs & Micro-blogs (B)
Bebo Facebook Friendster Hi5 LinkedIn MySpace
Social Networks (S)
CiteuLike Digg Wikia Wikipedia Wikispaces
Collaboration Sites (C)
Flickr Fotolog Last.fm Picasa Youtube
Multimedia Sites (M)
EpinionsQuora Yelp Y! Answers
Opinions & Community Question-Answering Systems (O)
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
Based on:— Number of users— Alexa ranking
OutlineThe Lexical Quality of Social Media
Range and average lexical quality in percentages(Values over the social media average are highlighted)
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
— Relative size considering public content (# words according to a major search engine).
— 55% social networks — 23% multimedia — 20% blogs
— 62% Social Networks (S) — 19% Multimedia (M)— 47% Facebook— Almost 80% Facebook + Fotolog + Blogger + MySpace + Hi5
— Errors in the Social Media Web
— No clear order for the site classes: Collaborative sites are the best ones, followed by blogs, multimedia, social networks, and further away opinions.
— No correlation of public content and LQ.
OutlineThe Lexical Quality of Social Media
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
— Just eight sites have lexical quality that is better than the average of the Web, but those account for less than 27% of the content.
— On average, social media classes have lexical quality worse than the Web itself.
— Compared to high quality sites, the quality of social media is one order of magnitude worse.
— The lower quality of social media impacts many more sites. For example we found that the community section of the NY Times is the main contributor to the decrease of their lexical quality. A similar effect occurs in CNN or Microsoft.
Range of percentages and average for a sample of frequent misspellings in several sets of Web sites
(Values over the Web average are highlighted)
OutlineGeographical Distribution of Web LQ
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
— Lower LQ and higher Internet penetration in Ireland, United Kingdom, Australia, New Zealand and Canada. — Higher impact of social media in countries where Internet penetration is higher.
OutlineOn-going Work
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
— These types of errors are quite word dependent, which explains the wide range of percentage values obtained.
— More left-right than up and down typos.
— The percentage of dyslexic errors is much lower than the corresponding number of dyslexic users (say 10%).
Range of percentages and average for the different error classes and its prevalence in the Web
Error Types in the Web
Absolute error rate for different typing errors
OutlineConclusions
• Correlation between popularity and perceived semantic quality and our defined lexical quality.
• Lexical quality of social media can be used to estimate the semantic quality of social media.
• Nevertheless, these estimations should be taken with a grain of salt, as they will change with a different sample of sites and/or words sample.
• Nevertheless, we believe that the main results will be maintained, e.g. that the lexical quality of social media is worse than the overall Web.
• Our results contribute to the difficult and still open problem of measuring the quality of content in social media and the Web in general.
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
OutlineFuture Work
1 — Define new ways to measure LQ and compare them with these results to check for consistency.
2 — Increase the sample of social media sites studied.
3 — Increase the sample of words to measure the LQ.
4 — A linguistic model for improving accessibility to Web based textual material
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media
Outline
Thank you :-)
Any Questions?
Ricardo Baeza-Yates and Luz Rello FoSW 2011, Barcelona The Lexical Quality of Social Media