FeedWiz: Using Automated Document Clustering to “Map the Blogosphere”

FeedWiz: Using Automated Document Clustering to “Map the Blogosphere”

David Schuff ([email protected])Temple University

Ozgur Turetken ([email protected])Ryerson University

The role of weblogs

Increasingly important mode of discourse Is this really the “new media”?

The consequences

Proliferation of information Easy self-publishing Proliferation of content

Leads to a “silo effect” Limited information diet of

only a few blogs Will tend to seek out

confirmatory points of view

Our area of interest is news and political blogs. Not a blog about Paris Hilton (yes, there is one).

The consequences

(Strict) filtering is seen as a threat to public discourse and democracy (Sunstein 2004)

At least, the true potential of the blogosphere is not being realized

The power law distribution

An exponential relationship between two variables

Used to explain website popularity

On the right: number of inbound links by weblog (2002)

http://www.shirky.com/writings/powerlaw_weblog.html

The top 3% of the political blog sites accounted for 20% of the inbound links

The decision support and information systems context

A key challenge is to create tools that help “filter, sort, and navigate” the blogosphere (Cayzer 2004)

Blogging is essentially a form of CMC (Tan et al. 2005)

Can facilitate “common understanding” The formation of an opinion is essentially

a decision-making issue

Research question

How can information presentation techniques be used to improve information consumption on the blogosphere?

Our proposition: This can be done by presenting information organized by content, not by author (or site)

What we’re drawing from

Chunking and semantic networks (Miller 1964, Mandler 1967, Quillian 1968, Collins and Quillian 1969)

Clustering of text-based documents(Chen et al. 1996, Chen et al. 1996, Pirolli et al. 1997, Spangler et al. 2003, Roussinov and Chen 2001, Turetken and Sharda 2004)

Information visualization “Preattentive” extraction of information (Bray

1996)

Size and color (Shneiderman 1994)

FeedWiz (demo)

Live demo… How it works…

Select/create a list of weblogs

Navigate clusters of blog entries

Browse the individual clusters

Study 1 design

Quasi-experiment (semi-controlled)

Two groups of subjects Both given a list of webogs Group A: Given an ordered list of URLs Group B: Given FeedWiz

O X OMeasuring effectiveness

Study how attitudes change (OXO design)

Measuring… Opinion (agree/disagree and supporting rationale) Level of conviction Sources (blogs) used to form the opinion

Ask subjects’ opinion on an

issue (i.e., hybrid cars)

Give subjects an hour to read the

list of blogs

Ask subjects again for their opinion on that

issue

Hypotheses

H1: In forming their opinions, FeedWiz users will use more sources than those who use an ordered list

H2: FeedWiz users will be more likely to change their opinions than those who use an ordered list

H3: FeedWiz users are less likely to form strong opinions than those who use an ordered list

Study 2 design

Intensive data collection with small sample Tracking of eye-movements Recording verbal comments

Protocol analysis For further insights on usability of tool

Expected contributions

Investigate how opinions are formed from blogs

Understand how information presentation techniques can influence information consumption Implications for public discourse on the web

Creation of a highly usable tool which demonstrates those techniques

References

Bray, T. (1996). Measuring the Web, In Proceedings of the Fifth International World Wide Web Conference, Paris, France.

Cayzer, S. (2004). Semantic blogging and decentralized knowledge management. Communications of the ACM, 47(12), 47-52.

Chen, H., Nunamaker, J., Orwig, R.E., & Titkova, O. (1998). Information visualization for collaborative computing. IEEE Computer, 31(8), 75-82.

Chen, H., Schuffels, C., & Orwig, R.E. (1996). Internet categorization and search: A self-organizing approach. Journal of Visual Communication and Image Representation, 7(1), 88-102.

Collins, A.M. & Quillian, M.R. (1969). Retrieval time from semantic memory. Journal of Learning and Verbal Behavior, 8, 240-247.

Mandler, G. (1967). Organization in memory. In K. W. Spence, & J. T. Spence (Eds.), The Psychology of Learning and Motivation (pp. 327-372). New York, NY: Academic Press.

Miller, G.A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63(2), 81-97.

Pirolli, P. Schank, P., Hearst, M., & Diehl, C. (1996). Scatter/gather browsing communicates the topic structure of a very large text collection. In Proceedings of the Conference on Human Factors in Computing Systems, New York, NY: ACM Press, 213-220.

Quillian, M.R. (1968). Semantic memory. In M. Minsky (Ed.), Semantic Information Processing (pp. 227-270), Cambridge, MA: The MIT Press.

References (continued)

Roussinov, D.G. & Chen, H. (2001). Information navigation on the web by clustering and summarizing query results. Information Processing and Management, 37(6), 789-817.

Shirky, C. (2003). Power laws, weblogs, and inequality. Accessed September 26, 2006 from http://www.shirky.com/writings/powerlaw_weblog.html.

Shneiderman, B. (1994). Dynamic queries for visual information seeing. IEEE Software, 11(6), 70.

Spangler, S., Kreulen, J.T., & Lessler, J. (2003). Generating and browsing multiple taxonomies over a document collection. Journal of Management Information Systems, 19(4), 191-212.

Sunstein, C.R. (2004). Democracy and filtering. Communications of the ACM, 47(12), 57-59.

Tan, C., Goswami, S., Chan, Y., & Zhong, Y. (2005). Conceptual evaluation of weblog as a computer-mediated communication application. In Proceedings from the 11th Americas Conference on Information Systems, Omaha, NE, 2361-2367.

Turetken, O. & Sharda, R. (2004). Development of a fisheye-based information search processing aid (FISPA) for managing information overload in the web environment. Decision Support Systems, 37(3), 415-434.

Appendix: How FeedWiz Works

FeedWiz Application Architecture

FeedWiz Application Server

HierarchicalClustering

Module Intelligent Miner for Text

Feed Aggregation

Module

.NET Web Service (C#)

Weblog sites(RSS feeds)

FeedWizClient

Flash applicationList of blogURLs

Hierarchy (XML) and individual posts

Appendix: How the documents are clustered

Blog posts are saved as text files on the FeedWiz server

Those files are grouped into clusters based on similarity

An output file is generated that describes the hierarchy

HierarchicalClustering

Module

Hierarchical Clustering Module

Original collection

1st

Iteration

2nd

Iteration

3rd

Iteration

nth

Iteration

Documents

FeedWiz: Using Automated Document Clustering to “Map the Blogosphere”