16
Does Size Matter? When Small is Good Enough A.L. Gentile, A.E. Cano , A.-S. Dadzie, V. Lanfranchi and N. Ireson The Oak Group, Department of Computer Science, The University of Sheffield

Does sizematter

Embed Size (px)

Citation preview

Page 1: Does sizematter

Does Size Matter? When Small is Good Enough

A.L. Gentile, A.E. Cano, A.-S. Dadzie, V. Lanfranchi and N. IresonThe Oak Group,

Department of Computer Science, The University of Sheffield

Page 2: Does sizematter

Outline• Introduction • Email Corpus • Dynamic Topic Classification Of Short Texts • Experiments • Conclusions

Outline

Page 3: Does sizematter

IntroductionSettings

• Main Goal: observation of the influence of the size of documents on the accuracy of a defined text processing task

• Hypothesis: Results obtained using longer texts may be approximated by short texts, of micropost size, i.e., maximum length 140 characters

• Dataset: artificially generated comparable corpora, consisting of truncated emails, from micropost size (140 characters), and successive multiples thereof

• Methodology: corpus-driven topic extraction/ document topic classification

Page 4: Does sizematter

IntroductionMicropost Services

• Mainly used for social information exchange, but also used in more formal (working) environments [Herbsleb et al., 2002, Isaacs et al., 2002]

• Sometimes perceived negatively in the workplace, as they may be seen to reduce productivity [TNS US Group, 2009], and/or pose threats to security and privacy

• where restrictions to use are in place, alternatives are sought that obtain the same benefits

- Same communication patterns in alternative media: email as a short message service for communication via, e.g., mailing lists

Page 5: Does sizematter

IntroductionResearch Questions

• Statistical analysis of emails exchanged via Oak mailing list (over a period of six months) to determine if email is indeed used as a short messaging service.

• Content analysis of emails as microposts, to evaluate to what degree the knowledge content of truncated or abbreviated messages can be compared to the complete message.

Page 6: Does sizematter

Email CorpusOak Mailing ListSix month period (July - January 2011), 659 emails

Page 7: Does sizematter

Topic ClassificatTopic Classification

• Goal: evaluate to what degree the knowledge content of a shorter message can be compared to that of a full message.

• Chosen task: text classification on non-predefined topics.

• Test bed: generated by preprocessing the Oak email corpus to obtain several fixed-size corpora.

• Method: - Corpus-driven topic extraction: a number of topics are automatically extracted from a document collection; each topic is represented as a weighted vector of terms; - Document topic classification: each document is labelled with the topic it is most similar to, and classified into the corresponding cluster.

Page 8: Does sizematter

Topic ClassificatTopic Extraction: Proximity-based Clustering

• Document corpus D = {d1,...,dk } - each document di = {t1,...,tv } is a vector of weighted terms

• Term clusters C = {c1,...,ck } (clustering performed by using as feature space the inverted index of D)- each cluster ck = {t1,...,tn} is a vector of weighted terms- each cluster ideally represents a topic in the document

collection

Page 9: Does sizematter

Topic ClassificatEmail Topic Classification

• sim(d,Ci): similarity between documents and clusters (by cosine similarity)• labelDoc(D): each document d is mapped to the topic Ci , which maximises the similarity sim(d,Ci)

Page 10: Does sizematter

ExperimentsDataset Preparation: Comparable Corpora

Corpus Name Maximum text length of each document

Corpus 140 truncated at length 140 (if longer, full text otherwise)

Corpus 280 truncated at length 280 (if longer, full text otherwise)

Corpus 420 truncated at length 420 (if longer, full text otherwise)

Corpus 560 truncated at length 560 (if longer, full text otherwise)

Corpus 700 truncated at length 700 (if longer, full text otherwise)

Corpus 840 truncated at length 840 (if longer, full text otherwise)

Corpus 980 truncated at length 980 (if longer, full text otherwise)

Main Corpus full email body

Page 11: Does sizematter

ExperimentsExperimental Approach

• Corpus-driven topic extraction: using mainCorpus • Email classification: apply the labelDoc procedure to all different comparable corpora, including mainCorpus.

Page 12: Does sizematter

ExperimentsCorpus Topics

Page 13: Does sizematter

ExperimentsResultsPrecision Recall F-Measure

Corpus 140 0.80 0.63 0.69

Corpus 280 0.90 0.86 0.87

Corpus 420 0.93 0.92 0.92

Corpus 560 0.98 0.96 0.97

Corpus 700 0.95 0.94 0.94

Corpus 840 0.98 0.98 0.98

Corpus 980 0.99 0.99 0.99

Page 14: Does sizematter

Conclusions• A fair portion of the emails exchanged, for the corpus

generated from a mailing list, are very short, with approximately 40% falling within the single micropost size, and 65% up to two microposts;

• For the text classification task described, the accuracy of classification for micropost size texts is an acceptable approximation of classification performed on longer texts, with a decrease of only 5% for up to the second micropost ∼block within a long e-mail.

Conclusions

Page 15: Does sizematter

Conclusions• Enriching the micro-emails with semantic information (e.g.,

concepts extracted from domain and standard ontologies). would improve the results obtained using unannotated text

• Investigate the influence of other similarity measures.• Application to expert finding tasks, exploiting dynamic topic

extraction as a means to determine authors’ and recipients’ areas of expertise.

• Formal evaluation of topic validity will be required, including the human (expert) annotator in the loop.

Future Work

Page 16: Does sizematter

ReferencesReferences

[1] Herbsleb, J. D., Atkins, D. L., Boyer, D. G., Handel, M., and Finholt, T. A. (2002). Introducing instant messaging and chat in the workplace. In Proc., SIGCHI conference on Human factors in computing systems: Changing our world, changing ourselves, pages 171–178.

[2] Isaacs, E., Walendowski, A., Whittaker, S., Schiano, D. J., and Kamm, C. (2002). The character, functions, and styles of instant messaging in the workplace. In Proc., ACM conference on Computer supported cooperative work, pages 11–20.

[3] TNS US Group (2009). Social media exploding: More than 40% use online social networks.http://www.tns- us.com/news/social_media_exploding_more_than.php.