60
1 You are a document too: Web mining and IR for next-generation information literacy Bettina Berendt K.U. Leuven, Belgium www.berendt.d e

1 You are a document too: Web mining and IR for next-generation information literacy Bettina Berendt K.U. Leuven, Belgium

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

1

You are a document too:

Web mining and IR

for next-generation

information literacy

Bettina BerendtK.U. Leuven, Belgium

www.berendt.de

2

From

IR / WM tools for solving a (information-getting) problem

to

IR / WM as cognitive tools for thinking about what the problem is

“Another grand challenge“

3

For whom is that interesting?

Researchers

Instructors

Practitioners

Citizens

4

About me: My public (and mine-able) profile

: Information Systems: Computer Science / Cognitive Science: Artificial Intelligence: Business Science: Economics

: Computer Science

5

Agenda

Outlook

(Some) Questions

Goal

Concepts

(Some) answers

Goal:IR / WM for teaching and learning Information Literacy

Concepts:From information to communication & privacy

(Some) answers:IR / WM tools elucidate communication patterns

6

(Some) Questions Q

7

Should an unknown Web user get these news ....

... or these?

[Genderlens. See Liu & Mihalcea, Proc. ICWSM 2007]

8

... and why? (Because they are male or female)

clicked on

9

Is this lady interested in singles bars?

10

What do spamming and reading someone‘s diary have in common?

11

Why is atomic energy safe?

12

How do embattled politicians minimize their responsibility?

I acknowledgethat mistakes

were made here(just not be me)

13

Goal

14

The goal: to use Information Retrieval / Web Mining for teaching and learning Information Literacy

Information literacy:

a set of competencies that an informed citizen of an information

society ought to possess to participate intelligently and actively in

that society

“the ability to

recognize when

information is

needed and to

locate, evaluate

and use effectively

the needed

information“

15

Information Literacy

“the ability to

recognize when

information is

needed and to

locate, evaluate

and use effectively

the needed

information“

1. Task Definition1.1 Define the information problem1.2 Identify information needed

2. Information Seeking Strategies 2.1 Determine all possible sources2.2 Select the best sources

3. Location and Access3.1 Locate sources (intellectually and physically)3.2 Find information within sources

4. Use of Information4.1 Engage (e.g., read, hear, view, touch)4.2 Extract relevant information

5. Synthesis5.1 Organize from multiple sources5.2 Present the information

6. Evaluation6.1 Judge the product (effectiveness)6.2 Judge the process (efficiency)

16

a set of competencies that an informed citizen of an information society ought to possess to participate intelligently and actively in that society

Using Information Retrieval / Web Mining for teaching and learning Information Literacy

1. Task Definition1.1 Define the information problem1.2 Identify information needed

2. Information Seeking Strategies 2.1 Determine all possible sources2.2 Select the best sources

3. Location and Access3.1 Locate sources (intellectually and physically)3.2 Find information within sources

4. Use of Information4.1 Engage (e.g., read, hear, view, touch)4.2 Extract relevant information

5. Synthesis5.1 Organize from multiple sources5.2 Present the information

6. Evaluation6.1 Judge the product (effectiveness)6.2 Judge the process (efficiency)

Information: get, produce, communicate IR / WM

IR / WM

17

Concepts

18

Information retrieval and (Web) data mining

19

Some history / motivations

Information Retrieval

1940s : US military confronts problems of indexing and retrieval of wartime scientific research documents captured from Germans

1950s : Growing concern in the US for a “science gap“ with the USSR mechanized literature searching systems

Information Literacy

1983 report A Nation at Risk: The Imperative for Educational Reform : a “rising tide of mediocrity” is eroding the very foundations of the American educational system

Data Mining

1990s : ‘If a business knew more about its customers, these wouldn‘t run away to competitors‘

All motivated by combinations of scarcity and abundance

20

Information retrieval as a communication process

Authors

Documents

Users with tasks and

goals

Informationneeds

QueriesDocument

representations

Founddocuments

Evaluation of the documents

Kuropka, Advances in Inf. Systems and Mgt. Science, 2004 (simplified)

21

... this assumes “mutually wanted“ communication

wants to disclose info

wants to get info

informationowner(subject)

informationseeker

Information

Intention

Intention

22

4 cases

wants to disclose info

wants to get info

does not want to disclose

wants to get info

does not want to disclose

does not want to get infodoes not want to get info

wants to disclose info

23

(Some) Answers A

24

Case 1: Example 1

wants to disclose info

wants to get info

25

Case 1: Example 2

“... I want to make three brief points about the resignations of the eight United States' attorneys, a topic that I know is foremost in your minds. First, those eight attorneys deserved better. ... Each is a fine lawyer and dedicated professional. I regret how they were treated, and I apologize to them and to their families for allowing this matter to become an unfortunate and undignified public spectacle. I accept full responsibility for this. Second, I want to address allegations that I have failed to tell the truth about my involvement in these resignations. These attacks on my integrity have been very painful to me. ...“

wants to disclose info

wants to get info

26

... but ...

does not want to disclose

wants to get info

... but ...[Method: Learning a trie from the string sequences]

http://services.alphaworks.ibm.com/manyeyes/view/SgoRsIsOtha6bhEf6arzI2-

27

Going further in analysing implicit messages:What differentiates news souces?

[Fortuna, Galleguillos, & Cristianini, in press]

[Method: Nearest neighbour / best reciprocal hit for document matching;Kernel Canonical Correlation Analysisand vector operationsfor finding topics and characteristic keywords]

wants to disclose info

wants to get info

? ... depends

28

Case 2: Example 1

does not want to disclose

wants to get info

29

Her queries(sample from an anonymized search-query log)

http://www.nytimes.com/2006/08/09/technology/09aol.html

30

clicked on

Case 2: Example 2

does not want to disclose

wants to get info

31

Input data and prediction problem

Informal observations of correlations between browsing behaviour and demographic attributes (gender, age)

Problem:

How to predict a user‘s gender from the Web pages s/he klicked on

Basic idea:

users user-to-page matrix pages document-to-term matrix terms

[Jian Hu, Hua-Jun Zeng, Hua Li, Cheng Niu, Zheng Chen (2007). Demographic Prediction Based on User’s Browsing Behavior. In Proc. WWW 2007]

32

[Method: Learn a classifier]

1. Define the gender tendency of a Web page Proportion of requests for the page by male/female (c) users,

relative to all requests

(R : user-to-page matrix)

2. Learn the gender tendency of Web pages Pages: with variance on gender ≥ threshold

Linear form of support-vector machine regression

Features: content words with highest information gain

target attribute: gender tendency

3. Predict the user‘s gender Naive Bayes

Features: visited pages

target attribute: gender (and some more optimization)

33

Results

A kind of analogue of the BOWused for predicting genderfrom produced content(words over all visited pages)

35

Case 2: Example 3

does not want to disclose

wants to get info

36

How was the information sent?

37

Tracing anonymous edits

38

[Method: Attribute matching]

39

Results (an example)

40

Case 3: Example 1

does not want to get info

wants to disclose info

41

[Typical method: learn a classification model (usually with at least some features being words)]

[ Ntoulas et al., Proc. WWW 2006]

42

Case 3: Example 2

does not want to get info

wants to disclose info

Solution approach?:Learn

classification models

A blog reader:“I don‘t mind personal blogs, but if they get to really really personal stuff, like if they‘re going to start talking about suicide, it‘s not something that you wanna share ... I avoid reading content that I consider too personal.“

[Baumer, Suevoshi, & Tomlinson. In Proc. ICWSM 2008]

43

Case 4

Do these people beat their kids?

does not want to disclose

does not want to get info

Move this communication out

of the long tail ?!

<a picture of a happy family>

44

The meta level: Information-related activities become data / documents

45

[Method Information visualization / history flow – ex. visualizing conflict, here: “edit wars“]

[Viégas, Wattenberg, & Dave, Proc. CHI 2004]

46

The bone of contention ...

47

Web mining for articulation and reflection

Repetition Organisation Elaboration

[Berendt, in Neues

Handbuch Hochschul-

lehre, 2006;BRMIC ‘01]

Proxy server

LogfileASP

[Methods: Usage tracking, semantic graph coarsening]

48

Outlook

49

Challenge 1: Understanding and keeping up with the communications arms race

“membership - or 'log in' - is the new anonymous.“

[Digital Methods Initiative (2007). Comparison between Anonymous Palestinian and Israeli Wikipedia Edits. wiki2.issuecrawler.net/twiki/bin/view/Dmi/ComparisonBetweenAnonymousPalestinianAndIsraeliWikipediaEdits]

50

Challenge 2: Network effects (1): Ownership?

•A. Colleague

51

Network effects (2): Ownership?

tags

friends

friendsA

B

C

Can see feature ?!

Can see bug ?!

52

Network effects (3): Inferences

Friendship is generally symmetric

If A wants to hide her friendships,

But B shows that “A is my friend“,

B has disclosed private information of A.

(More elaborate problems follow from this ...)

For a discussion, see

Preibusch, S., Hoser, B., Gürses, S., & Berendt, B. (2007). Ubiquitous social networks - opportunities and challenges for privacy-aware user modelling. In Proceedings of the Workshop on Data Mining for User Modelling at UM 2007, Corfu, Greece, June 2007.

53

Network effects (4): Requirements interact

[Preibusch, S., Hoser, B., Gürses, S., & Berendt, B. (2007). Ubiquitous social networks - opportunities and challenges for privacy-aware user modelling.

In Proceedings of the Workshop on Data Mining for User Modelling at UM 2007, Corfu, Greece, June 2007 .]

54

Challenge 3: Countermeasures against re-identification and their effect on democracy (and other things)

Is this the same

person?

55

Keeping identities apart – the basic setting

Paper published by the MovieLens team (collaborative-filtering movie ratings) who were considering publishing a ratings dataset, see http://movielens.umn.edu/

Public dataset: users mention films in forum posts

Private dataset (may be released e.g. for research purposes): users‘ ratings

Film IDs can easily be extracted from the posts

Observation: Every user will talk about items from a sparse relation space (those – generally few – films s/he has seen)

[Frankowski, D., Cosley, D., Sen, S., Terveen, L., & Riedl, J. (2006). You are what you say: Privacy risks of public mentions. In Proc. SIGIR‘06]

56

[Method: Compute similarities between people (films as features)]

Given a target user t from the forum users, find similar users (in terms of which items they related to) in the ratings dataset

Rank these users u by their likelihood of being t

Evalute:

If t is in the top k of this list, then t is k-identified

Count percentage of users who are k-identified

E.g. measure likelihood by TF.IDF (m: item)

57

Results

58

What do you think helps?

60

Summary and conclusions

Information-related activities involve disclosing and withholding. Each information-related activity has (at least) one source, one

manifestation as data/document, one user and one stakeholder; network effects abound.

The dichotomy of information-seeking users and information-containing data/documents has vanished.

a new operationalisation of information literacy: getting, producing, communicating, … information

IR/WM tools can support this type of information literacy For whom is that interesting?

Researchers, Instructors Practitioners Citizens

Who can do something about this? Researchers, Instructors Practitioners Anyone who funds such work ...

61

Thank you!

62

Picture and some more literature credits

pp. 1 and 61: http://farm2.static.flickr.com/1062/932116791_490db77985_m.jpg

pp.8 and 30: http://www.theage.com.au/news/World/Charles-coronation-to-move-with-the-times/2004/12/26/1103996438678.html

pp. 9 and 28: http://www.nytimes.com/2006/08/09/technology/09aol.htm l

pp. 10 and 40: http://seiplecandis.googlepages.com/exerterton.jpg

Pp. 10 and 42: http://www.crowncombo.com/articles/2005/100205_monster/monster04.jpg

pp. 11 and 35: http://www.radarmagazine.com/features/images/2006/12/atomic-energy-lab-01.jpg

pp. 12 and 25: http://graphics8.nytimes.com/images/2007/07/24/us/24gonzales-2-600.jpg, with inspiration by http://www.lifeclever.com/wp-content/uploads/2007/03/gonzales_passive.jpg

p. 13: http://www.ffc-turbine.de/graphs/news/070303_nadineangerer.jpg

p. 14: based on http://en.wikipedia.org/wiki/Information_literacy , „yellow definition“ quoted from there and based on Shapiro, J.J. & Hughes, S.K. (1996). Information Literacy as a Liberal Art. Enlightenment proposals for a new curriculum. Educom Review, 31 (2), http://www.educause.edu/pub/er/review/reviewarticles/31231.html; „light blue definition“ based on Presidential Committee on Information Literacy. 1989, p. 1 (see Wikipedia page)

p. 15: http://www.big6.com/what-is-the-big6%E2%84%A2/

p. 19 uses input from en.wikipedia.org/wiki/Information_literacy and en.wikipedia.org/wiki/Information_retrieval

p. 24: http://media.mcclatchydc.com/smedia/2007/10/08/16/854-8web-clinton-obama-minor.standalone.prod_affiliate.91.jpg

p. 25 (text): http://services.alphaworks.ibm.com/manyeyes/static-resources/data/89ade5ae14e1dd2c0114ff78100c0b61.txt

p. 36: from the Wikipedia page (some editing done for illustration)

p. 37: http://wikiscanner.virgil.gr/

p. 39: http://de.wikipedia.org/wiki/Wikipedia:Wikiscanner

p. 46: http://www.surrealcoconut.com/surrealism_gallery/coulage/chocolate1.html

p. 51: http://eu.inmagine.com/img/imagewerksrf/iwf06015/iwf019005.jpg, http://img.timeinc.net/time/time100/2007/images/queen_elizabeth.jpg, http://www.barmala.de/wp-content/uploads/2005/02/spam.jpg