36
1 Introduction A question professors often hear from undergraduate students is “Where should I go to graduate school?” As academic disciplines become more interdisciplinary, as they grow and change rapidly, as boundaries become more permeable, some students are not even sure which field they wish to study. This simple student question is part of a larger problem of discovering the social structure of an academic discipline. Many disciplines are poorly understood by newcomers to the field, and imperfectly understood even by experts in the field. Understanding a discipline is difficult because there is such an abundance of content in each discipline, there are few clear boundaries between disciplines, and many disciplines are in a continuous state of change as new research emerges. Computational systems may be more effective than people at processing an abundance of content, at providing tools for determining boundaries, and at modeling phenomena in a rapid state of flux. The proposed research seeks to develop and deploy novel text mining techniques to discover the social structure of academic disciplines, and to present this social content to both experts and non-experts using novel visualization techniques. The objective is to present content to enable people to develop more effectively their own understandings of academic disciplines. This research will develop techniques for automatically determining the social structure of an academic discipline and presenting this social content interactively in a way that is useful for different groups of stakeholders. This research takes a human-centered approach to the problem of specifying academic disciplines. By understanding similarities in content between individual researchers, collaborations among researchers and geographical correlations among groups of researchers, this research will reveal the human substrate from which research content arises. The specific approach has two parts. First, we will develop a set of techniques for analyzing and presenting the content within an established discipline to different groups of stakeholders. We 1

Discovering and Visualizing the Social Structure of …sli.ics.uci.edu/pmwiki/uploads/Grants-2008SciSIP/... · Web viewDissemination Write and submit papers on temporal aspects of

  • Upload
    vokhanh

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

1 IntroductionA question professors often hear from undergraduate students is “Where should I go to graduate school?” As academic disciplines become more interdisciplinary, as they grow and change rapidly, as boundaries become more permeable, some students are not even sure which field they wish to study. This simple student question is part of a larger problem of discovering the social structure of an academic discipline. Many disciplines are poorly understood by newcomers to the field, and imperfectly understood even by experts in the field. Understanding a discipline is difficult because there is such an abundance of content in each discipline, there are few clear boundaries between disciplines, and many disciplines are in a continuous state of change as new research emerges.

Computational systems may be more effective than people at processing an abundance of content, at providing tools for determining boundaries, and at modeling phenomena in a rapid state of flux. The proposed research seeks to develop and deploy novel text mining techniques to discover the social structure of academic disciplines, and to present this social content to both experts and non-experts using novel visualization techniques. The objective is to present content to enable people to develop more effectively their own understandings of academic disciplines. This research will develop techniques for automatically determining the social structure of an academic discipline and presenting this social content interactively in a way that is useful for different groups of stakeholders.

This research takes a human-centered approach to the problem of specifying academic disciplines. By understanding similarities in content between individual researchers, collaborations among researchers and geographical correlations among groups of researchers, this research will reveal the human substrate from which research content arises.

The specific approach has two parts. First, we will develop a set of techniques for analyzing and presenting the content within an established discipline to different groups of stakeholders. We have chosen the field of computer science as an initial discipline for exploration based on the investigators’ existing expertise in that area. We will use innovative text mining techniques, building on previous work by Dr. Smyth and his colleagues [7, 25, 26, 31], on a corpus of computing articles to discover the social structure of the various sub-disciplines. We have an agreement with Rexa, a leading digital library of computing publications, to provide the full text of their collection as the initial corpus for the research.

The discovered social structure will be presented interactively to several groups of users, drawing on Dr. Tomlinson’s expertise in real-time animation and interactive visualization [34, 35, 37, 39]. Exploring the discovery and visualization of the social content implicit within a given discipline’s collection of journal articles will enable that body of content to be understood in a new way. We will assess the utility of the presentations to several constituencies of users, drawing on Dr. Nardi’s experience in studying the impacts of digital technologies on user practices [18, 20, 21, 23, 24, 48].

By building a system with an understanding of the social structure of academic disciplines, the proposed research will go beyond keyword search to facilitate engagement between communities of researchers and other interested groups. The proposed research will provide an immediate

1

result in the field of computer science that is very likely to be of value to that community, and a longer-term effort that has the potential for a substantial improvement on the way science is conducted as a whole.

2 Constituencies of UsersThe proposed research will engage four constituencies of users who could benefit from discovering and visualizing the social structure of an academic discipline. These four are not the only ones that would find the proposed technology useful, but they ground the research through real problems in defined user populations. We believe these constituencies will suggest the general utility of the technology across diverse users. They will enable us to assess the impact of the tools on the practices of specific groups of users. The research aims to link technological innovation in data mining and visualization with an assessment of use in the context of real users.

K-12: Person + DiscoveryThe astonishing popularity of social websites such as MySpace and Facebook [3, 8] suggests the congeniality of a social view of science for the younger generation. To connect science to students, to give them a human-centered view not just of giants such as Darwin or Einstein, but also of contemporary scientists, the technology we develop will reveal science as a social space of collaboration, changes in scientific interests and questions, funding activities, trajectories of publications, friendships, and relationships. Science is conducted by people, though textbooks eliminate the human element in presenting science as accumulated facts. Existing connections to grade-school aged students [36, 38] and UCI’s Education program [43] will provide a vital testbed in which to examine the impact of the proposed technology with this group of stakeholders.

Given that our focus in this project is in the area of computer science, a possible side-benefit of our system will be to encourage high-school students to choose to pursue degree programs in computer science (and also to encourage the undergraduate constituency (below) to pursue graduate degrees). By helping to illustrate the wide diversity of research topics and human aspects of modern computer science, the proposed system could provide a step towards re-igniting interest in computer science career paths among K-12 students.

Undergrads: Person + Topic As mentioned earlier, undergraduate students are often uncertain about how to match their interests to actual professors they might work with and programs that would suit their needs for graduate study. What academic areas are available to prospective students? Where do their potential professors publish? With whom do they collaborate? Do they work on small or large projects? Where are they located? What kind of funding do they seek? Which conferences do they attend? Are there faculty and students with similar interests in a nearby geographical area? How have faculty interests changed over time? Do faculty conduct interdisciplinary work? Can a student see ways to find a niche related to a faculty member’s work? In short, students need a way to characterize the academic disciplines in which they might meaningfully participate.

Such questions can be partially answered with Google or other search engines through a painstaking search process. But the webpages of individual faculty members are not always up to

2

date. More fundamentally, relationships such as collaborators over time or changing interests are extremely difficult to discern through standard search engine technology. The proposed technology will provide not lists of documents but visualizations of connected people, depictions of change over time in such areas as conferences attended and size of research projects, and other information to help one discover characteristics of an academic discipline.

Early career researchers: Person + FieldAs interdisciplinary work becomes increasingly important, finding collaborators outside one’s field (and even within one’s field) is challenging for early career researchers who do not yet have wide social networks. Early career researchers need to develop relationships to find reviewers for conference papers, learn about granting opportunities, and understand the leading edge of the scientific arena in which they are working, or new areas into which they might move. For this constituency, gaining an understanding of the forward trajectory of a discipline is especially important. Sources such as conference workshops, conference paper abstracts or full texts, keynote speeches, program committees, and the like may provide important clues. Again, a handcrafted keyword search would be useful but not ideal. A goal of our research is to help early career researchers understand the dynamics of a field more easily using innovative text mining techniques.

Deans and university administrators: Person + OrganizationDeans and university administrators need to understand academic disciplines for long-range planning, for recruitment, for review, for developing interdisciplinary research units within the university, for finding thought leaders for invited talks, for visiting scholars, for cross-university collaborations, and many other tasks associated with guiding a university to attain its full potential. This is a daunting task in need of computational tools that would allow university leadership to assess a field as a whole, to see relationships between scholars’ work, and to find connections within and beyond their own universities. Our proposed technology will reveal the social structure of academic disciplines that will make such understandings easier and more complete.

3 Research QuestionsThere are several areas in which we see significant research problems. In the area of text mining, research questions include:

How can accurate models of an individual's interests over time be inferred from multiple noisy text sources related to that individual?

Can networks of links between individuals be leveraged to better understand the topical interests of individuals?

In the area of human-centered visualization, research questions include: How can visualizations draw on existing cultural interaction paradigms to make this

information available to different constituencies in ways with which they already have experience?

How can temporal factors contribute to an understanding of academic disciplines?

In terms of assessing impact on practice, research questions include:

3

What are the specific ways in which the technology is useful across different constituencies of users?

In what ways can the technology be extended to meet the needs of the constituencies? How might different constituencies benefit from accessing the same information about

academic disciplines in different ways?

4 Technical BackgroundThe primary data that will let us discover and visualize the social structure of the academic discipline of computer science are research papers publicly available on the web. While it is easy to crawl the web and harvest millions of research papers, it is more difficult to identify papers relevant to computer science and extract specific pieces of information from these papers. Research papers or academic articles contain various pieces of information including: author(s), title, abstract (and possibly the full text of the article), date, and journal or conference. Metadata such as keywords and subject headings are sometimes available, as is contact information such as the author’s affiliation, email and mailing address. Identifying and accurately extracting this information can be difficult, and is still an active area of research.

Since our focus is to develop new ways of discovering and visualizing the structure of computer science research, we will leverage other work that has centered on the time consuming tasks of web crawling, information extraction, and entity resolution. We have permission to obtain text and metadata from the Rexa digital library (http://rexa.info) which contains over 7,000,000 paper references, 380,000 full text papers, and names of 880,000 authors in computer science research. Rexa was funded by the National Science Foundation, under Information Technology Research (ITR) grant IIS-0326249. Researchers on the Rexa project have developed tools that automatically extract from research papers various pieces of information, and attempt to resolve the many mentions of authors in the entire collection. We will obtain periodic snapshots of the abstracts plus metadata behind the Rexa digital library. We may also periodically obtain content from the CiteSeer scientific literature digital library, (http://citeseer.ist.psu.edu/ also an NSF sponsored project), whose 760,000 documents include coverage of computer science. CiteSeer’s metadata (title, authors, abstract, date, etc. for each document) is publicly available.

At this point we will have a rich and large collection of papers and authors, along with metadata associated with those papers and authors. Having the abstract – or the full text – of papers is extremely useful for researchers wishing to understand exactly what experiments were performed and what the results were. But our goal is not to create a digital library. Instead, we would like to summarize this text so that we can infer the research interests of a particular author, group or department.

We will use statistical topic modeling to model the collection of research papers and automatically summarize the text [2, 4-6, 11, 16, 42]. The topic model is an unsupervised statistical language model that is based on the idea that each document is made up of one or more topics, where each topic is a thematic distribution over words. More specifically, the topic model is a mixture model, where each document is represented as a finite mixture of topics, i.e., each document is a multinomial distribution over topics, where each topic is a multinomial distribution over words. The model automatically learns topics by finding groups of words that tend to co-occur in documents. While not all papers have equal impact on or importance to a

4

discipline, the topic model produces sensible topics without knowing this information because it learns from a huge amount of data.

To help make concrete this idea of learned topics, we list below selected topics from our topic model run on a collection of 520,000 CiteSeer abstracts:

[probabilistic models] regression estimator variance estimation model bias nonparametric ... [computer hardware] memory cache access performance caching locality caches prefetching data

latency processor ... [human computer interaction] interface user user_interface graphical interaction interactive usability

... [logic] formula boolean formulas sat clauses satisfiability instances clause propositional ... [clustering] cluster hierarchical clustering hierarchy hierarchies data clustering_algorithm grouping ... [graph algorithms] tree trees nodes spanning node algorithm binary leaves leaf rooted root minimum

traversal ...Each topic lists words in order of likelihood (most likely word first), with the ellipsis indicating that there is no hard cutoff to the number of words in a topic. Each list of words clearly conveys a specific theme, which is encapsulated by the human-assigned label in square brackets. We focus on topics because they will be the basis in our project for characterizing papers, researchers, groups and departments.

The author-topic model is an extension to the topic model (and originally developed at UC Irvine) that is useful for modeling collections of text documents, where each document is written by one or more authors [31]. The author-topic model is particularly valuable for modeling

collections of academic literature (e.g., the collection proposed for this study), where individual papers are usually written by several authors. The model not only learns topics discussed in the collection, it also learns the topics about which authors write. The topic representation has also been shown to provide a powerful mechanism for searching through large document sets by providing a flexible concept language to help users navigate at a level that is both more general and more accurate than searching directly on keywords [7, 25, 26, 30, 45].

Once a topic model or author-topic model is learned, the underlying probabilistic framework allows great flexibility in the range of queries that can be answered. We can answer simple queries such as who are the most prolific researchers in a particular topic, or what research papers are most relevant to a given topic. We can

5

Figure 1: Example of a faculty profile automatically derived using topic models.

also answer more complex queries such as which researchers are most similar (topically) to a given researcher; what institution has a group similar to MIT’s Media Lab; and on what topics does Carnegie Mellon’s computer science department tend to focus? Figure 1 shows an example of the list of research topics and related faculty that were automatically inferred for a particular faculty member based on a prototype topic-browsing system that we built for UC Irvine and UCSD faculty. The system crawled the Web pages for 460 faculty members, downloaded 12,000 publications, extracted bags of words, and ran a statistical topic model with 200 topics. The resulting topic models were manually named and a browser (using PHP and MySQL) was built that automatically constructs ranked lists of topics by faculty members, allowing a user to navigate among faculty and topics in an intuitive manner - the system is online at http://yarra.calit2.uci.edu/calit2.

5 Novelty of the Proposed ResearchIn order to develop the underlying text mining technology to support the proposed visualization system we will need to develop novel statistical models and algorithms for inferring researcher's topical interests from text. Specifically, given the nature of the data available, we will develop new techniques for simultaneous statistical topic modeling of multiple text sources, new algorithms for inferring network-based representations of researchers, and new techniques for learning researchers' trajectories over time.

From the point of view of the visualization system, the most closely-related work we know is Rexa. While our system will be based on information provided by Rexa, the user experience will be dramatically different. Rexa is an excellent tool for researchers. Through its text-based display and two main views (author and paper), researchers can efficiently discover information, learn about the latest research in a particular field, see an author’s list of publications and list of co-authors, and see who has cited a particular paper.

We propose to build a complementary system to Rexa that will have several differences. First, we will develop a system to serve a broader range of users: in addition to researchers, we will also focus on user experience for our three other constituencies: K-12 students, undergraduate students, and deans and university administrators. Second, we will go beyond purely text based displays to increasingly graphical displays of information. The new visualization techniques are one of the main focuses of this proposal. Third, the user interface will explicitly include dimensions such as time and geography.

6 Text-Mining ResearchCreating a Computer Science Researcher Database For simplicity, and to concentrate on the needs of our constituencies of users, we will restrict our focus to computer science faculty and researchers currently working at US and Canadian academic institutions and major research labs. This restriction will eliminate certain authors from our system such as graduate students, postdoctoral researchers, and researchers at non-US institutions. We realize that omitting overseas researchers (for example) limits the scope of the proposed system - focusing on the US and Canada will nonetheless produce a broadly useful system while also substantially increasing the likelihood of success given the resources being requested for the project.

6

The open directory project at dmoz.org currently lists 281 US academic computer science departments (with Web pages for most). Estimating 15 to 30 faculty researchers per department, and including researchers at the more well-known industry research labs such as Microsoft Research and IBM Research, results in approximately 5,000 to 10,000 computer science researchers in total.

While acknowledging that departmental Web sites are fallible (for example, faculty may still be listed after they have retired or moved), they are nonetheless likely to be a much more reliable source of current information than many alternatives (such as crawling the Web directly for information based on a particular name). We thus expect that entity-resolution will not be a major issue given that we will be starting from a relatively rich set of initial information (e.g., an individual’s name, current university affiliation, and list of publications). This is in contrast to the more usual and difficult case of trying to resolve between entities based on their name and co-author associations but not knowing their affiliation or list of published papers.

We will start our research by developing a database containing a rich set of attributes for our set of US-Canadian computer science researchers, including (for each researcher): job title, department, school (if applicable), university, home page URL, picture, list of papers (identified by paper IDs), list of research topics, list of collaborators and coauthors, affiliations with conference program committees and journals, list of graduate students (for faculty), email address, phone number, fax, and mailing address. These data will be gathered in an automated fashion, by developing relatively standard Web crawling algorithms that will gather relevant information from the following sources:

university department and industry lab Web sites, including text from individual researcher Web pages such as publication lists, contact information, pictures of the individual, text from Web pages that a home-page points to (e.g., class pages, project pages), text from a CV if available, geographic location of the department and university;

the Rexa system at the University of Massachusetts (by prior arrangement with Professor Andrew McCallum, who has provided an accompanying letter of support), including for each researcher a list of published papers. For each paper, Rexa has extracted a rich set of attributes including abstract text, title, list of authors and affiliations, source (venue the paper was published in), date published, and topic distribution. In addition, Rexa will also provide, per researcher, their institution, department, and topic distribution, as inferred by algorithms currently being developed by the Rexa project;

other sources of relevant information about researchers such as abstracts of conference and journal Web sites that list program committees, invited speaker lists, editorial board membership lists.

The information extraction portion of the work will then combine this information to populate a database for the 5,000 to 10,000 researchers of interest. The names obtained from departmental and research lab Web sites will be assumed to be most reliable initial source of information, and the Rexa and other data will then be parsed to associate different papers and other information with the appropriate individual (e.g., by matching a reference to a paper on an individual’s publications list on their home page with the list of papers known to Rexa for persons with the same name).

7

Statistical Text Mining and QueryingAs discussed earlier, topic models provide a useful statistical framework for linking words across documents. Based on prior work with modeling portions of the CiteSeer digital library, we estimate that 1,000 topics will be needed to adequately represent the full breadth of computer science. The statistical topic models will be stored in a SQL relational database as sets of tables, with a topic-word table for each topic (containing the word-distribution for that topic) and author-topic distributions for authors. The distributions can be efficiently stored as sparse sets of counts from word-topic and author-word assignments from multiple Gibbs samples [30, 31]. The sparse counts can be smoothed and normalized for inference purposes at query time.

The resulting topic models will allow a user to interactively explore faculty, topics, groups, departments, universities, and geographic regions, and to answer questions such as:

Q: What topics does Jon Kleinberg at Cornell work on? Q: What topics does the ACM SIGIR 2007 program committee span?Q: What is the most similar department, topically, to Stanford in the area of databases?Q: What universities are most active in the area of computer security in New England?

Because the topic model is a probabilistic model (in effect a very large Bayesian network defined over words, documents, topics, and authors) queries such as those above can be formulated as probabilistic inference, namely the calculation of conditional probabilities for events of interest given conditioning information. Thus, for example, finding universities doing research on computer security in New England requires that we first infer an aggregate topic distribution for the set of researchers at each university in New England. In turn, finding the topic distribution for a set of researchers amounts to calculating a mixture of topic distributions over documents authored by members of the set – this mixture distribution can be quickly computed by summing appropriate counts in the document-topic matrix. The resulting topic distributions for each university can then be ranked in terms of how much probability mass they put on words or topics related to computer security. From the viewpoint of the user interface the underlying probability calculations will be hidden. Real-time query-answering takes place by parsing high-level user queries in the visualization system, sending the appropriate SQL queries to the database to generate the relevant sets of counts, and then aggregating, smoothing, and ranking the results.

Research on Multi-Source Topic Models and Topic NetworksThe standard topic modeling framework is intended for documents from a single corpus, e.g., a set of technical abstracts from CiteSeer. In contrast, we will have available multiple different types of text information related to each individual, including text from their papers, text from their home page, text from Web pages linked to their home-page (such as course pages or project pages), information about their Ph.D. advisor, their research group, and close collaborators, abstracts from seminar talks they have given, information from their resume (if available), and so on. In this context we will investigate new algorithms for multi-source topic modeling, namely, inferring topic models using multiple sets of text-based information. A baseline strategy will be to first learn a topic model from technical papers (the most reliable source) in the standard manner and "freeze" the resulting topic representation. We can then infer (using Gibbs sampling) how to represent each author in this topic space using not only the words from the technical papers but also using additional words associated with that author from resumes, seminar talks,

8

Web pages, etc. We will compare this with a more ambitious approach where the topic-word and author-topic models are learned from multiple sources in a single pass, using a hierarchical graphical model (e.g., similar in spirit to that of [33]). We will explore techniques that allow the different sources to have different weights in terms of their reliability and relevance to the technical topics, including weights assigned manually based on human heuristics as well as probabilistic source weights that are inferred during the topic learning. The different approaches will be quantitatively compared using standard language modeling measures such as perplexity scores on test documents.

To leverage the linked aspect of the data we will investigate a new approach called topic networks for analyzing a set of linked individuals and their research interests. Topic networks are statistical models in which the locations of individual researchers (nodes) are embedded in a latent Euclidean space and relations among individuals are represented by edges (directed or undirected depending on the nature of the relation). Researchers with similar interests are likely to be closer in this space than researchers with dissimilar interests. The probability of a relation (edge) between two individuals will be a parametrized logistic function of the attributes of each of the individuals - this type of statistical modeling has been shown to be broadly useful for learning about general patterns of relations among individuals in a social network in a manner that includes both individual attribute information and relations between individuals [12]. Because the underlying model is probabilistic it can be evaluated predictively, e.g., by predicting the probability of edges between new test nodes and nodes that were used to train the model. We will investigate the efficacy of different ways to represent an individual's (node's) attributes including informative keywords from an individual's homepage, binary variables representing membership in a society or program committee, and topic distributions as inferred by the standard (non-network-based) author-topic model. Developing algorithms for inferring both the network embedding and the topics simultaneously is likely to be beyond the scope of this project. Thus, our basic strategy for learning a network embedding will be a 2-stage process of first learning an author-topic model and then using the inferred topics, plus additional information about authors and their relations.

Learning about a Researcher's Trajectory over TimeThe techniques we have described above implicitly have no notion of time – for example, a researcher’s topical interests are based on all papers they have written (rather than just recent papers) and their affiliation and geographic location will be their most recent ones. To add time to our models we plan to investigate (in years 2 and 3 of the project) automated inference of researchers' timelines.

Our strategy here will be to first learn topic models without any explicit representation of time by aggregating documents over time. We will then post-process the resulting representations to estimate individual topic trajectories by using word-topic assignments from the Gibbs sampler to infer topic distributions for a specific time period. We are aware that topic models that explicitly represent time can also be inferred (e.g., [41]) but given the resources of this project we will limit the scope to the simpler post-processing approach, which we have found to work well [31].

We will also infer institutional affiliations over time for individuals. Preliminary investigations (using Internet Archive, http://www.archive.org, for example) have shown that historical

9

snapshots of faculty listings on department web pages can be used as a useful starting point for determining past affiliations. Using an archived faculty webpage from UMass's computer sciencedeptartment, we were able to determine that Paul Cohen – who is currently at USC Information Sciences Institute – was previously at UMass Amherst. Similarly, we determined that Michael Pazzani – currently at Rugters – was previously at UC Irvine. These snapshots can also be used to help reconstruct the growth of departments.

Table 1. Affiliation information over time for Paul Cohen and Michael Pazzani, from the Internet Archive.

In the general context of inferring topics and affiliations for authors over time, we will investigate simple Markov-based models to constrain our inferences about authors, leveraging the fact that the primary topical interests and institutional associations of authors typically have persistence over time. Small sets of manually-labeled data (for affiliations) will be used to evaluate the accuracy of such approaches.

The resulting time-based information will allow us to generalize the query language and will allow a user to pose questions such as:Q: What research topics did Michael Jordan work on in the time-period 1995-2000?Q: Where was Ramesh Jain working in 2000?Q: Who was the most prolific researcher in computer security in 1995-1996?Q: Where do UCI graduates in human-computer interaction take jobs when they graduate?Q: Which universities have been recruiting in ubiquitous computing over the last five years?

7 Visualization Research

The project will involve four distinct visualization techniques, for each of the four groups of users. These visualizations will draw on two main sources for the look-and-feel and interactivity elements – current cultural interaction paradigms such as those in social networking sites and online mapping software, and current research in interactive visualization (e.g., [9, 13, 40, 44]. These four visualizations will be designed to transition smoothly into one another, as described below.

For the K-12 group, the visualization will be inspired by social web sites such as Friendster, Facebook, MySpace, and Orkut. Each scientist will have a personal page that is dynamically generated with his or her picture, contact information, research topics, and pictures of collaborators. With each of these items, it will be possible to click to go deeper into that area – the picture and contact information link to the scientist’s home page or other relevant site, the research topics bring up lists of other researchers with similar interests, and the pictures of collaborators link to those researchers’ pages in the system. Students will be able to maintain

10

their own pages, creating lists of researchers they find interesting. By enabling students to have sites that are similar to those of the scientists, the system will help students to see themselves as similar to scientists.

For undergraduates, the focus will be the geographic location of different researchers, and on change over time (see Figure 2). For example, students would be able to pick all professors interested in a specific research topic (or pick a specific researcher and click “more like this,” which would select a collection of other similar researchers) and then watch a visualization of where those researchers are located on a map of the earth. Below the map, there would be a timeline that allows for scrolling through time to see movements and originations of researchers in a given field. So, if a certain university has graduated numerous scholars specializing in a certain subfield, then moving the timeline forward should allow a user to see a cascade effect as individuals (represented by glowing dots) leap from that university to take faculty positions at

11

Figure 2: An example of a possible undergraduate view of the visualization. Building on Google Maps, the system would allow users to enter one or more search terms, and a depth (i.e., number of researchers). The top N researchers for the search term would then be displayed. By moving the Time slider bar, users would view the change in location of researchers over time. Clicking on any of the researcher images would bring the user to the MySpace-style representation of that researcher, with his or her contact information, list of research areas, collaborators, and other information.

other universities and research labs. This process would allow students to understand the movements of researchers around the world and to locate sources and sinks of relevant researchers. In addition, it will be tied to the K-12 visualization style by having each researcher’s dot be clickable, thus taking the viewer to the personal page described above.

For junior faculty, the visualization would be similar to the undergraduate visualization, showing people’s movement over time, but with a shorter time horizon. Specific conferences will be shown on the visualization, so that groups of researchers can be seen coalescing in a common location. Author lists of conference papers would enable authors to be correlated with conferences. Note that all authors will appear to “travel” to a given conference, even if only a subset of them actually attended. By interacting with a visualization of researchers’ location both at home institutions and at conferences, junior faculty can develop an understanding for who is working with whom, what communities exist, and from where they draw their audiences. As with the undergraduate visualization described above, each element would link to a MySpace-style page about that researcher, institution or conference.

For administrators, the visualization will focus on the institution. The basic premise will use the world map described above, but with institutions highlighted rather than individuals. Just as people have collaborations, so too can institutions be seen to have collaborations with each other, based, for example, on the number of cross-institution publications. While the primary mechanism by which administrators work with the system will be at the institutional level, it will be possible to transition smoothly to the level of the individual researcher to see the component parts of a cross-institution collaboration.

We have sketched potential visualizations in broad outline. Our research will focus on how to create novel visualizations to present academic disciplines as they are discovered through text mining. These four visualizations will be designed to leverage the benefits of being created together. During the implementation phase, these four visualizations will be developed in parallel to enable them each to adapt to the needs of the others.

8 Assessment ResearchFor each constituency, we will identify users and deploy and test the technology, assessing its usefulness. (Our focus will not be on usability except inasmuch as it is necessary for a testable prototype.) We will utilize qualitative and quantitative methodologies following the methodologies set forth in [1, 15, 28]. The assessment will occur in three phases and has four key questions:

Is this system better than a conventional keyword search? What value do users find in the system? How can the system be extended and improved to more effectively meet user needs? How would a data mining system to discover academic disciplines change practice in the

four constituencies?

In general, our approach is a variant of “patchwork prototyping” [14] in which high-fidelity prototypes are tested with users for “requirements gathering which is not purely need-based, but also opportunity-and creativity- based” [10]. Given the nature of the system we are developing, iteration is like to be somewhat slower than under ideal patchwork prototyping conditions. But

12

we subscribe to the general philosophy of involving users with functioning systems to which they can respond with reflective comments and feedback rather than simply paper prototypes or other typical participatory design methods (see [10]). Floyd et al. point out that when users can provide meaningful feedback in a timely manner based on real usage, the result is more likely to be useful guidance that can be vital in improving the technology for effective deployment.

Phase One of the assessment will comprise qualitative assessments from users. We want to know what the experience of using the two technologies (Google and our system) is like for discovering academic disciplines, what users like and dislike about each, how they would extend our technology if they could, if they see the technologies as complementary or if one is clearly better than the other, what changes to our technology they would recommend, and what actions they took as a result of using the technologies. We will not attempt to impose strict experimental conditions but simply ask users to use both systems in their own fashion. We will make observations of actual use and interview users about their experiences, asking them to conduct some inquiries with the systems.

Nardi has performed qualitative and quantitative assessments of Internet technologies such as instant messaging, video, blogging, gaming, and a system she and colleagues designed, ContactMap [19-24, 48]. Each user will participate in an in-depth audiotaped semi-structured interview after using the technologies for at least two weeks. This trial will enable us to understand users’ experiences with a keyword search engine and with our technology. We will provide technical support to make sure that participants can use our technology with ease.

In Phase One we will deploy the technology with local users. All of the user communities are readily available to us locally. We can work with local grade schools, local undergraduate students, computer scientists in our university as well as at the many universities in southern California, and local deans and administrators as well as leaders in organizations such as the California Institute of Information Technology and Telecommunications who have tasks similar to deans and administrators.

In Phase Two we will modify the technology based on the qualitative assessment from Phase One. We will then make the technology available on a website so that we can gather quantitative data from a larger sample. We will advertise the technology in relevant places for each of our communities to attract users, such as appropriate listservs and forums. For some constituencies such as deans we will personally contact potential users to make sure we have an adequate sample.

The survey instrument will be designed in accordance with what is learned in the qualitative assessment to reflect users’ interests, concerns, and vocabulary. The surveys will be tailored to each constituency. Analysis will focus on simple descriptive statistics. We can then present these findings to our constituencies so that we can discuss the findings. This interaction is another means of getting feedback from users and engaging them in the design process. Such users will include Phase One users and possibly some who take the survey.

In Phase Three we will identify a small group of users in each constituency who wish to continue using the technology. We will interview them on a regular basis to see how use of the system

13

changes their practice. Such interviews will take place in various media (e.g., phone, face to face, email) to get a wide range of users (rather than just local users who we can interview face to face). Such opportunistic methods are characteristic of ethnography in which Nardi is trained as an anthropologist (see [32]). We believe we can best get detailed understandings of users’ experiences with the proposed technology through the range of methods from in-depth qualitative interviews to large-scale surveying.

Question 4 regarding changing practice with use of the proposed system is the most difficult to answer. Within the time frame of the proposed research we can provide preliminary answers regarding how our technology would change practice. We will pay careful attention to the actions taken as a result of the use of the technology. Phase Three will allow us to follow some users over an extended period of time to gain deeper understandings about the technology and its impact on different kinds of users.

Approval of the use of human subjects will be obtained from the University of California, Irvine Institutional Review Board before the work commences.

9 Project Output and DisseminationThis project will create a publicly available system for visualizing computer science disciplines and sub-disciplines. This project has educational outreach as one of its core tenets. The goal of the project is to enable different constituencies to understand the social structure of academic disciplines. Since K-12 students and undergraduates are two of our key user groups, this project has a substantial educational outreach component embedded throughout the research. The research will be disseminated to the research community through publications and interactive demonstrations at conferences and journals such as AAAI, IJCAI, IEEE Viz, ACM CHI, and ACM TOCHI. In addition to the direct dissemination of this work to these intellectual communities, the tools we will develop in this work are broadly applicable outside computer science, in principle to any academic discipline where digital text is available (i.e., presumably most or all disciplines in the next 10 or 20 years).

10 Research Team and Management PlanEach of the PI and Co-PIs have existing research groups of graduate and undergraduate student researchers who will contribute to the implementation, evaluation and dissemination of the research. The investigators will have weekly meetings to coordinate the ongoing research. Drs. Tomlinson and Smyth have experience leading software projects successfully in academia, and all three investigators have extensive experience working in interdisciplinary/collaborative team research.

11 Timeline Months Topic Research0-6 Text Mining Create initial version of text mining system; Obtain data from Rexa; Find

current faculty; Identify groupings within departments; Run topic model.Visualization Begin working with Google Maps; Design interface for visualization.Assessment Preliminary research via interviews with users in the four constituencies.Dissemination None.

6-12 Text Mining Find employment history for faculty; Begin work with time windows;

14

Develop multi-source topic models; Harvest text from faculties’ home pages and associated/linked pages. Harvest vitae/resumes and extract information.

Visualization Create static version of Google maps visualization; Flesh out three different map-based versions; Implement K-12 “MySpace”-style visualization; Develop PHP/MySQL browser.

Assessment None.Dissemination None.

12-18 Text Mining Develop temporal aspects of text mining system; Develop models for topic networks; Find known relations among faculty (collaborators, students, advisors).

Visualization Design representation of time in visualization; Implement temporal functionality in Google maps – slider bar, change over time; Implement system for displaying movement arcs on Google maps.

Assessment Assess initial implementation with the four constituencies locally.Dissemination Write and submit papers on core text mining and visualization research.

18-24 Text Mining Integrate temporal and static elements of system; Test predictive power of topic networks; Infer network embeddings.

Visualization Revise static visualizations based on assessment findings; Implement timeline and integrate with Google Maps.

Assessment Design survey instrument.Dissemination None.

24-30 Text Mining Integrate conference information into visualization. Visualization None.Assessment Conduct online survey; Follow local users for in-depth study; Analyze

results.Dissemination Write and submit papers on temporal aspects of text mining and

visualization; write and submit papers on user response to technology30-36 Text Mining None.

Visualization Prepare for public release – test browser compatibility, etc.Assessment None.Dissemination Write and submit papers on full system, with accompanying assessment

results; Supervise public release.

12 Summary of Prior NSF SupportPadhraic Smyth has served as PI and Co-PI on several past NSF grants, the most relevant one for this current proposal being NSF IIS-0083489, Data Mining of Digital Behavior. Under this award, several new techniques and algorithms for statistical text mining of large text corpora were developed, including the author-topic model [29, 31] for learning topical interests of authors, the entity-topic model [25, 26] for learning models that relate named entities and topics, and the special-words model [7] for matching queries and documents using both topics and keywords. In addition, several novel algorithms for analysis and prediction of network data over time (so as co-author networks) were developed, including local ranking algorithms [46], node clustering [47], and techniques for link prediction over time [27].

Bonnie Nardi has served as a PI on two small NSF grants (IIS-0424909, IIS-438848). The first grant resulted in a panel on participatory democracy at the ACM CSCW conference. The second

15

resulted in a successful workshop held at UCI Irvine in October, 2004 and led to two publications [17, 49].

16

References: [1] Bernard, R. Social Research Methods: Qualitative and Quantitative Approaches. Sage

Publications, 2000.[2] Blei, D., Ng, A. and Jordan, M.J. Latent Dirichlet Allocation. Journal of Machine

Learning Research, 1. 2003. 993-1022.[3] boyd, d. and Heer, J., Profiles as Conversation: Networked Identity Performance on

Friendster. in Proceedings of the Hawai'i International Conference on System Sciences (HICSS-39), (Kauai, HI, 2006), IEEE Computer Society.

[4] Buntine, W., Löfström, J., Perkiö, J., Perttu, S., Poroshin, V., Silander, T., Tirri, H., Tuominen, A. and Tuulos, V., A Scalable Topic-Based Open Source Search Engine. in Proceedings of the IEEE/WIC/ACM Conference on Web Intelligence (WI 2004), (Beijing, China, 2004), 228-234.

[5] Buntine, W., Löfström, J., Perttu, S. and Valtonen, K., Topic-Specific Scoring of Documents for Relevant Retrieval. in W4: Learning in Web Search, at the 22nd International Conference on Machine Learning, (Bonn, Germany, 2005).

[6] Buntine, W., Löfströom, J., Perttu, S. and Valtonen, K., Topic-Specific Link Analysis using Independent Components for Information Retrieval. in AAAI-05 Workshop on Link Analysis (LinkAnalysis-2005) held at AAAI-2005, the Twentieth National Conference on Artificial Intelligence (AAAI-05), (Pittsburgh, Pennsylvania, 2005).

[7] Chemudugunta, C., Smyth, P. and Steyvers, M. Modeling general and specific aspects of documents with a probabilistic topic model. Advances in Neural Information Processing Systems, 19. 2006.

[8] Ellison, N., Steinfeld, C. and Lampe, C., Spatially Bounded Online Social Networks and Social Capital: The Role of Facebook. in Proceedings Annual Conference of the International Communication Association (ICA), (Dresden, Germany, 2006).

[9] Fisher, D., Ask Not for Whom the Visualization is Rendered; It is Rendered for Thee. in Social Visualization: Exploring Text, Audio, and Video Interactions; Workshop at CHI 2006, (Montréal, 2006).

[10] Floyd, I.R., Jones, M.C., Rathi, D. and Twidale, M.B., Web Mash-ups and Patchwork Prototyping: User-driven technological innovation with Web 2.0 and Open Source Software. To appear in Proceedings of the 40th Annual Hawaii International Conference on System Sciences (HICSS’07), (2007).

[11] Griffiths, T. and Steyvers, M. Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101. 2004. 5228-5235.

[12] Hoff, P.D., Raftery, A.E. and Handcock, M.S. Latent Space Approaches to Social Network Analysis. Journal of the American Statistical Association, 97 (460). 2002. 1090-1098.

[13] Hsieh, T.J., Kuester, F. and Hutchinson, T.C., Visualization of Large-Scale Seismic Field Data. in Proceedings of the ISC High Performance Computing Symposium (HPC 2003), (2003), Society for Computer Simulation.

[14] Jones, M.C. and Floyd, I.R. Patchworks of Open-Source Software: High-Fidelity Low-cost Prototypes in The Handbook of Research on Open Source Software. St. Amant, K. and Still, B. eds., Idea Group, Inc., 2007.

[15] Lofland, J. and Lofland, L. Analyzing Social Settings: A Guide to Qualitative Observation and Analysis. Wadsworth Publishing, 1994.

17

[16] McCallum, A., Corrada-Emmanuel, A. and Wang, X., Topic and Role Discovery in Social Networks. in International Joint Conference on Artificial Intelligence, (Edinburgh, Scotland, 2005).

[17] Nardi, B. Placeless Organizations: Collaboration for Transformation. Forthcoming in Journal of Mind, Culture and Activity. 2007.

[18] Nardi, B., Kuchinsky, A., Whittaker, S., Leichner, R. and Schwarz, H. Video-as-data: Technical and social aspects of a collaborative multimedia application. The Journal of Computer-supported Cooperative Work, 4. 1996. 73–100.

[19] Nardi, B., Ly, S. and Harris, J., Learning Conversations in World of Warcraft. in Proceedings of the Hawaii International Conference on Systems Science, (Big Island, Hawaii, 2007).

[20] Nardi, B., Schiano, D., Gumbrecht, M. and Swartz, L. Why We Blog. Communications of the Association for Computing Machinery, December. 2004. 41-46.

[21] Nardi, B., Whittaker, S., Isaacs, E., Creech, M., Johnson, J. and Hainsworth, J. ContactMap: Integrating communication and information through visualizing personal social networks. Communications of the Association for Computing Machinery, April. 2002. 89-95.

[22] Nardi, B.A. and Harris, J., Strangers and Friends: Collaborative Play in World of Warcraft. in Proceedings of the ACM Conference on Computer Supported Cooperative Work, (Banff, Alberta, Canada, 2006), ACM Press.

[23] Nardi, B.A., Schiano, D.J. and Gumbrecht, M., Blogging as social activity, or, would you let 900 million people read your diary? in Proceedings of the ACM Conference on Computer Supported Cooperative Work, (Chicago, Illinois, 2004), ACM Press, 222-231.

[24] Nardi, B.A., Whittaker, S. and Bradner, E. Interaction and outeraction: instant messaging in action. in Proceedings of the 2000 ACM Conference on Computer Supported Cooperative Work, ACM Press, 2000, 79--88.

[25] Newman, D., Chemudugunta, C., Smyth, P. and Steyvers, M. Analyzing entities and topics in news articles using statistical topic models. Springer Lecture Notes in Computer Science (LNCS) series -- IEEE International Conference on Intelligence and Security Informatics. 2006.

[26] Newman, D., Chemudugunta, C., Smyth, P. and Steyvers, M., Statistical entity-topic models. in Proceedings of the 12th ACM SIGKDD Conference, (2006).

[27] O'Madadhain, J., Hutchins, J. and Smyth, P. Prediction and ranking algorithms for event-based network data. ACM SIGKDD Explorations: Special Issue on Link Mining, 7 (2). 2006. 23-30.

[28] Patton, M. Qualitative Evaluation and Research Methods, 2nd Ed. Newbury Park, CA: Sage. 1990.

[29] Rosen-Zvi, M., T., G., Steyvers, M. and Smyth, P., The Author-Topic Model for Authors and Documents. in 20th Conference on Uncertainty in Artificial Intelligence, (Banff, Canada, 2004).

[30] Steyvers, M. and Griffiths, T. Probabilistic topic models. in Landauer, T., McNamara, D., Dennis, S. and Kintsch, W. eds. Latent Semantic Analysis: A Road to Meaning, Laurence Erlbaum, 2006 (in press).

[31] Steyvers, M., Smyth, P., Rosen-Zvi, M. and Griffiths, T., Probabilistic Author-Topic Models for Information Discovery. in The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (2004).

18

[32] Strathern, M. Commons and Borderlands: Working Papers on Interdisciplinarity, Accountibility and the Flow of Knowledge. Sean Kingston Publishing, 2003.

[33] Teh, Y.W., Jordan, M.I., Beal, M.J. and Blei, D.M. Hierarchical Dirichlet Processes. Journal of the American Statistical Association. 2006.

[34] Tomlinson, B., Designing Affective Interaction Paradigms for Animated Characters. in Human Computer Interaction Consortium, (Fraser, CO, 2005).

[35] Tomlinson, B. From Linear to Interactive Animation. ACM Computers in Entertainment, 3 (1). 2005. 1-20.

[36] Tomlinson, B., A Heterogeneous Animated Platform for Educational Participatory Simulations. in 10th Computer Supported Collaborative Learning Conference (CSCL05), (Taipei, Taiwan, 2005).

[37] Tomlinson, B., Downie, M., Berlin, M., Gray, J., Lyons, D., Cochran, J. and Blumberg, B., Leashing the AlphaWolves: Mixing User Direction with Autonomous Emotion in a Pack of Semi-Autonomous Virtual Characters. in Proceedings of the 2002 ACM SIGGRAPH Symposium on Computer Animation, (San Antonio, TX, 2002).

[38] Tomlinson, B., Yau, M.L., Baumer, E., Goetz, S., Carpenter, L., Pratt, R., Young, K. and May-Tobin, C., The EcoRaft Project: A Multi-Device Interactive Graphical Exhibit for Learning About Restoration Ecology. in ACM Conference On Human Factors In Computing Systems (CHI 2006), Work in Progress, (Montreal, Canada, 2006).

[39] Tomlinson, W. Interactivity and Emotion through Cinematography, Masters Thesis, Media Arts & Sciences, MIT, Cambridge, MA, 1999.

[40] Viégas, F.B., Wattenberg, M. and Dave, K., Studying cooperation and conflict between authors with history flow visualizations. in Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI-04), (Vienna, Autsria, 2004), ACM Press, 575-582.

[41] Wang, X. and McCallum, A., Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends. in ACM Conference on Knowledge Discovery and Data Mining (SIGKDD), (2006).

[42] Wang, X., Mohanty, N. and McCallum, A., Group and Topic Discovery from Relations and Text. in KDD Workshop on Link Discovery: Issues, Approaches and Applications (LinkKDD), (Chicago, Illinois, 2005).

[43] Warschauer, M. Laptops and Literacy. Teachers College Press, New York, 2006.[44] Wattenberg, M., Visual exploration of multivariate graphs. in Proceedings of the SIGCHI

Conference on Human Factors in Computing Systems, (Montreal, Quebec, Canada, 2006), ACM Press, 811-819.

[45] Wei, X. and Croft, W.B., LDA-based document models for ad-hoc retrieval. in Proceedings of the 29th SIGIR Conference, (2006), 178-185.

[46] White, S. and Smyth, P., Algorithms for Discovering Relative Importance In Graphs. in Proceedings of Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (Washington D.C., 2003).

[47] White, S. and Smyth, P., A spectral approach to finding communities in graphs. in SIAM 2005 Conference on Data Mining, (2005).

[48] Whittaker, S., Jones, Q., Nardi, B., Creech, M., Terveen, L., Isaacs, E. and Hainsworth, J. ContactMap: Organizing Communication in a Social Desktop. ACM Transactions on Computer Human Interaction, 11. 2004. 445 – 471.

19

[49] Zimmerman, A. and Nardi, B., Whither or Whether HCI: Requirements Analysis for Multi-sited, Multi-user Cyberinfrastructures. in Proceedings of the ACM Conference on Human Factors in Computer Systems, (2006), ACM Press.

[1] Bernard, R. Social Research Methods: Qualitative and Quantitative Approaches. Sage Publications, 2000.

[2] Blei, D., Ng, A. and Jordan, M.J. Latent Dirichlet Allocation. Journal of Machine Learning Research, 1. 2003. 993-1022.

[3] boyd, d. and Heer, J., Profiles as Conversation: Networked Identity Performance on Friendster. in Proceedings of the Hawai'i International Conference on System Sciences (HICSS-39), (Kauai, HI, 2006), IEEE Computer Society.

[4] Buntine, W., Löfström, J., Perkiö, J., Perttu, S., Poroshin, V., Silander, T., Tirri, H., Tuominen, A. and Tuulos, V., A Scalable Topic-Based Open Source Search Engine. in Proceedings of the IEEE/WIC/ACM Conference on Web Intelligence (WI 2004), (Beijing, China, 2004), 228-234.

[5] Buntine, W., Löfström, J., Perttu, S. and Valtonen, K., Topic-Specific Scoring of Documents for Relevant Retrieval. in W4: Learning in Web Search, at the 22nd International Conference on Machine Learning, (Bonn, Germany, 2005).

[6] Buntine, W., Löfströom, J., Perttu, S. and Valtonen, K., Topic-Specific Link Analysis using Independent Components for Information Retrieval. in AAAI-05 Workshop on Link Analysis (LinkAnalysis-2005) held at AAAI-2005, the Twentieth National Conference on Artificial Intelligence (AAAI-05), (Pittsburgh, Pennsylvania, 2005).

[7] Chemudugunta, C., Smyth, P. and Steyvers, M. Modeling general and specific aspects of documents with a probabilistic topic model. Advances in Neural Information Processing Systems, 19. 2006 (to appear).

[8] Ellison, N., Steinfeld, C. and Lampe, C., Spatially Bounded Online Social Networks and Social Capital: The Role of Facebook. in Proceedings Annual Conference of the International Communication Association (ICA), (Dresden, Germany, 2006).

[9] Fisher, D., Ask Not for Whom the Visualization is Rendered; It is Rendered for Thee. in Social Visualization: Exploring Text, Audio, and Video Interactions; Workshop at CHI 2006, (Montréal, 2006).

[10] Floyd, I.R., Jones, M.C., Rathi, D. and Twidale, M.B., Web Mash-ups and Patchwork Prototyping: User-driven technological innovation with Web 2.0 and Open Source Software. in To appear in Proceedings of the 40th Annual Hawaii International Conference on System Sciences (HICSS’07), (2007).

[11] Griffiths, T. and Steyvers, M. Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101. 2004. 5228-5235.

[12] Hoff, P.D., Raftery, A.E. and Handcock, M.S. Latent Space Approaches to Social Network Analysis. Journal of the American Statistical Association, 97 (460). 2002. 1090-1098.

[13] Hsieh, T.J., Kuester, F. and Hutchinson, T.C., Visualization of Large-Scale Seismic Field Data. in Proceedings of the ISC High Performance Computing Symposium (HPC 2003), (2003), Society for Computer Simulation.

[14] Jones, M.C. and Floyd, I.R. Patchworks of Open-Source Software: High-Fidelity Low-cost Prototypes in The Handbook of Research on Open Source Software. in St. Amant, K. and Still, B. eds., Idea Group, Inc., 2007.

20

[15] Lofland, J. and Lofland, L. Analyzing Social Settings: A Guide to Qualitative Observation and Analysis. Wadsworth Publishing, 1994.

[16] McCallum, A., Corrada-Emmanuel, A. and Wang, X., Topic and Role Discovery in Social Networks. in International Joint Conference on Artificial Intelligence, (Edinburgh, Scotland, 2005).

[17] Nardi, B. Placeless Organizations: Collaboration for Transformation. Forthcoming in Journal of Mind, Culture and Activity. 2007.

[18] Nardi, B., Kuchinsky, A., Whittaker, S., Leichner, R. and Schwarz, H. Video-as-data: Technical and social aspects of a collaborative multimedia application. The Journal of Computer-supported Cooperative Work 4: 73–100. Reprinted in Video-mediated Communication, Finn, K., Sullen, A., Wilbur, S. (eds.) 1997. Hillsdale, NJ: Lawrence Erlbaum. 1996.

[19] Nardi, B., Ly, S. and Harris, J., Learning Conversations in World of Warcraft. in Proceedings Hawaii International Conference on Systems Science., (Big Island, Hawaii, 2007).

[20] Nardi, B., Schiano, D., Gumbrecht, M. and Swartz, L. Why We Blog. Communications of the Association for Computing Machinery, December. 2004. 41-46.

[21] Nardi, B., Whittaker, S., Isaacs, E., Creech, M., Johnson, J. and Hainsworth, J. ContactMap: Integrating communication and information through visualizing personal social networks. Communications of the Association for Computing Machinery, April. 2002. 89-95.

[22] Nardi, B.A. and Harris, J., Strangers and Friends: Collaborative Play in World of Warcraft. in Computer Supported Cooperative Work, (Banff, Alberta, Canada, 2006), ACM Pless.

[23] Nardi, B.A., Schiano, D.J. and Gumbrecht, M., Blogging as social activity, or, would you let 900 million people read your diary? in ACM Conference on Computer Supported Cooperative Work, (Chicago, Illinois, 2004), ACM Press, 222-231.

[24] Nardi, B.A., Whittaker, S. and Bradner, E. Interaction and outeraction: instant messaging in action. in Proceedings of the 2000 ACM conference on Computer supported cooperative work, ACM Press, 2000, 79--88.

[25] Newman, D., Chemudugunta, C., Smyth, P. and Steyvers, M. Analyzing entities and topics in news articles using statistical topic models. Springer Lecture Notes in Computer Science (LNCS) series -- IEEE International Conference on Intelligence and Security Informatics. 2006.

[26] Newman, D., Chemudugunta, C., Smyth, P. and Steyvers, M., Statistical entity-topic models. in Proceedings of the 12th ACM SIGKDD Conference, (2006).

[27] O'Madadhain, J., Hutchins, J. and Smyth, P. Prediction and ranking algorithms for event-based network data. ACM SIGKDD Explorations: Special Issue on Link Mining, 7 (2). 2006. 23-30.

[28] Patton, M. Qualitative Evaluation and Research Methods, 2nd Ed. 1990.[29] Rosen-Zvi, M., T., G., Steyvers, M. and Smyth, P., The Author-Topic Model for Authors

and Documents. in 20th Conference on Uncertainty in Artificial Intelligence, (Banff, Canada, 2004).

[30] Steyvers, M. and Griffiths, T. Probabilistic topic models. in Landauer, T., McNamara, D., Dennis, S. and Kintsch, W. eds. Latent Semantic Analysis: A Road to Meaning, Laurence Erlbaum, 2006 (in press).

21

[31] Steyvers, M., Smyth, P., Rosen-Zvi, M. and Griffiths, T., Probabilistic Author-Topic Models for Information Discovery. in The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (2004).

[32] Strathern, M. Commons and Borderlands: Working Papers on Interdisciplinarity, Accountibility and the Flow of Knowledge. Sean Kingston Publishing, 2003.

[33] Teh, Y.W., Jordan, M.I., Beal, M.J. and Blei, D.M. Hierarchical Dirichlet Processes. Journal of the American Statistical Association. 2006.

[34] Tomlinson, B., Designing Affective Interaction Paradigms for Animated Characters. in Human Computer Interaction Consortium, (Fraser, CO, 2005).

[35] Tomlinson, B. From Linear to Interactive Animation. ACM Computers in Entertainment, 3 (1). 2005. 1-20.

[36] Tomlinson, B., A Heterogeneous Animated Platform for Educational Participatory Simulations. in 10th Computer Suuported Collaborative Learning Conference (CSCL05), (Taipei, Taiwan, 2005).

[37] Tomlinson, B., Downie, M., Berlin, M., Gray, J., Lyons, D., Cochran, J. and Blumberg, B., Leashing the AlphaWolves: Mixing User Direction with Autonomous Emotion in a Pack of Semi-Autonomous Virtual Characters. in Proceedings of the 2002 ACM SIGGRAPH Symposium on Computer Animation, (San Antonio, TX, 2002).

[38] Tomlinson, B., Yau, M.L., Baumer, E., Goetz, S., Carpenter, L., Pratt, R., Young, K. and May-Tobin, C., The EcoRaft Project: A Multi-Device Interactive Graphical Exhibit for Learning About Restoration Ecology. in ACM Conference On Human Factors In Computing Systems (CHI 2006), Work in Progress, (Montreal, Canada, 2006).

[39] Tomlinson, W. Interactivity and Emotion through Cinematography Masters Thesis, Media Arts & Sciences, MIT, Cambridge, MA, 1999.

[40] Viégas, F.B., Wattenberg, M. and Dave, K., Studying cooperation and conflict between authors with history flow visualizations. in ACM Conference on Human Factors in Computing Systems (CHI-04), (Vienna, Autsria, 2004), ACM Press, 575-582.

[41] Wang, X. and McCallum, A., Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends. in ACM Conference on Knowledge Discovery and Data Mining (SIGKDD), (2006).

[42] Wang, X., Mohanty, N. and McCallum, A., Group and Topic Discovery from Relations and Text. in KDD Workshop on Link Discovery: Issues, Approaches and Applications (LinkKDD), (Chicago, Illinois, 2005).

[43] Warschauer, M. Laptops and Literacy. Teachers College Press, New York, in press.[44] Wattenberg, M., Visual exploration of multivariate graphs. in Proceedings of the SIGCHI

conference on Human Factors in computing systems, (Montreal, Quebec, Canada, 2006), ACM Press, 811-819.

[45] Wei, X. and Croft, W.B., LDA-based document models for ad-hoc retrieval. in Proceedings of the 29th SIGIR Conference, (2006), 178-185.

[46] White, S. and Smyth, P., Algorithms for Discovering Relative Importance In Graphs. in Proceedings of Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (Washington D.C., 2003).

[47] White, S. and Smyth, P., A spectral approach to finding communities in graphs. in SIAM 2005 Conference on Data Mining, (2005).

22

[48] Whittaker, S., Jones, Q., Nardi, B., Creech, M., Terveen, L., Isaacs, E. and Hainsworth, J. ContactMap: Organizing Communication in a Social Desktop. ACM Transactions on Computer Human Interaction, 11. 2004. 445 – 471.

[49] Zimmerman, A. and Nardi, B., Whither or Whether HCI: Requirements Analysis for Multi-sited, Multi-user Cyberinfrastructures. in Proceedings of the ACM Conference on Human Factors in Computer Systems, (2006), ACM Press.

23