27
Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital Libraries Paul Clough* and Irene Eleta ([email protected]) Department of Information Studies, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield S1 4DP UK. Tel: ++4(0) 114 2222664 Fax: +44(0) 114 2780300 Abstract Digital libraries remove physical barriers to accessing information, but the language barrier still remains due to: multilingual collections and the linguistic diversity of users. This study aims at understanding the effect of users’ language skills and field of knowledge on their language preferences when searching for information online and to provide new insights on the access to multilingual digital libraries. Both quantitative and qualitative data were gathered using a questionnaire and results show that the language skills and the field of knowledge have an impact on the language choice for searching online. These factors also determine the interest in cross-language information retrieval: language–related fields constitute the best potential group of users, followed by the Arts and Humanities and the Social Sciences. Keywords: digital libraries, multilingual information access, cross-language information retrieval, language skills. 1. Introduction The curators of digital libraries and their users are being confronted with large quantities of digital material, increasingly diverse in nature: multi-media, multi-cultural and multi- language (Borgman, 1997; Crane, 2006). A fundamental goal of digital libraries is to provide universal access to the information being managed (Association of Research

Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital Libraries

Paul Clough* and Irene Eleta ([email protected])

Department of Information Studies, University of Sheffield,

Regent Court, 211 Portobello Street, Sheffield S1 4DP UK.

Tel: ++4(0) 114 2222664 Fax: +44(0) 114 2780300

Abstract

Digital libraries remove physical barriers to accessing information, but the language

barrier still remains due to: multilingual collections and the linguistic diversity of users.

This study aims at understanding the effect of users’ language skills and field of

knowledge on their language preferences when searching for information online and to

provide new insights on the access to multilingual digital libraries. Both quantitative and

qualitative data were gathered using a questionnaire and results show that the language

skills and the field of knowledge have an impact on the language choice for searching

online. These factors also determine the interest in cross-language information retrieval:

language–related fields constitute the best potential group of users, followed by the Arts

and Humanities and the Social Sciences.

Keywords: digital libraries, multilingual information access, cross-language information

retrieval, language skills.

1. Introduction The curators of digital libraries and their users are being confronted with large quantities

of digital material, increasingly diverse in nature: multi-media, multi-cultural and multi-

language (Borgman, 1997; Crane, 2006). A fundamental goal of digital libraries is to

provide universal access to the information being managed (Association of Research

Page 2: Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

Libraries, 1995), but this can only be realised if digital content is made more accessible

and usable over time within online environments (Chen, 2007). For example, the

European i2010 Digital Libraries initiative aims to make cultural, audiovisual and

scientific heritage accessible to all: “the initiative combines cultural diversity,

multilingualism and technological progress” (European Commission Information Society

and Media, 2006). In Europe two major initiatives are The European Library1 (TEL) and

more recently Europeana2. The European Library (TEL) offers access to digital resources

(books, posters, maps, sound recordings and videos) and bibliographic content from 48

national libraries in 35 languages. Europeana - the European digital library, museum and

archive – aims to provide users access to around 2 million digital objects, including

photos, paintings, sounds, maps, manuscripts, books, newspapers and archival papers.

Both digital libraries offer access to multilingual content and Europeana plans to provide

a multilingual interface and offer multilingual access to users. More widely, UNESCO

has officially launched the World Digital Library3, an Internet-based library that aims to

display and explain the wealth of all human cultures. Of course, universal access is as

applicable to smaller and more specialised digital libraries as it is to the larger national

and international ones. However, although digital libraries can remove physical and

spatial barriers in accessing information, the language barrier still remains due to

multilingual collections and the linguistic diversity of users. Previous research has shown

that language has an impact on the structure of the web (Kralisch & Mandl, 2006) and

that the power relations of languages on the Internet have cultural implications (Flammia

& Saunders, 2007), causing a digital divide. Language represents a clear barrier to

accessing information online, which is dominated by the English language and Anglo-

American values. This is the context in which digital libraries must operate and are

thereby subject to this digital divide also. A key factor to the future success of digital

libraries is the provision of appropriate multilingual services to allow users to find,

explore and work with content in multiple languages (European Commission Information

Society and Media, 2006).

1 http://www.theeuropeanlibrary.org [site accessed: 23/06/2009] 2 http://dev.europeana.eu/outcomes.php [site accessed: 23/06/2009] 3 http://www.wdl.org/en/ [site accessed: 23/06/2009]

Page 3: Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

In this paper we present a study to investigate the potential impact of language

and field/domain of knowledge in searching for information online in general, and using

digital libraries. The context of our study has been constrained to users predominantly

within a university setting because: (1) this group makes regular use of digital libraries,

(2) it is increasingly common to find members of this population exhibiting a wide range

of language skills and abilities, (3) members of this group have specialised areas of

knowledge, and (4) members of this community were easily accessible to us. The

methodology follows an inductive approach and both quantitative and qualitative data

have been gathered using a questionnaire that resulted in 514 responses. Although the

majority of participants belong to The University of Sheffield (UK) and the Universidad

Autonoma de Madrid (Spain), an effort has been made to include respondents with other

native languages in addition to English and Spanish. Also, a range of fields is studied,

with particular attention to the language-related fields. The specific objectives of this

study are to:

• Explore the effect of users’ language skills and professional/study field on their

language choice when searching online;

• Investigate whether or not users would like to use cross-language information

retrieval (CLIR), its utility and how this relates to language skills and field;

• Investigate the preference of users for certain tools and functionalities that support

searching in digital libraries, as well as the most criticised aspects of digital

libraries.

The paper is organised as follows: firstly, Section 2 describes related literature;

Section 3 explains the methodology used in this study; Section 4 presents the results,

which are further discussed in Section 5; Section 6 concludes the paper and discusses

further areas for future work.

2. Literature Review

2.1. People, Languages and the Internet

Page 4: Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

With users’ activities involving increasing levels of online interaction, it is important to

study language with regard to information access. Miniwatts Marketing Group (2008)

shows in its Internet World Statistics that already 17.3% of Internet users are from China,

above the 15% from the USA and followed by Japan (6.4%), India (4.1%), Germany

(3.6%) and Brazil (3.4%). Also, their numbers show that the percentage of English

speakers online has decreased to 30.5% in June 2008, while Chinese speakers represent

20.4% of Internet world users and Spanish speakers, 6.8%. However, these statistics do

not take into account the number of speakers that know English as a second language,

which Graddol (2006) estimated to be higher than the number of native English speakers.

Also, many people are bilingual or multilingual, but Miniwatts Marketing Group only

assigns one language per person.

Although there is a clear trend that the linguistic diversity of content on the Web

is increasing, many languages are still underrepresented and the character encoding

systems used privilege some languages over others (Flammia and Saunders, 2007).

Mikami and Suzuki (2004) raised the issue of the lack of statistics about the number of

web pages by language, script and character set. Halavais (2000) demonstrated that

national borders have an imprint in the topology of the Web: websites are more likely to

link to another site of the same country (and in this study to websites in the USA).

Kralisch and Mandl (2006) also studied the hyperlink structure of the Web and confirmed

the tendency to link to websites not only in the same country, but written in the same

language.

Previous research has highlighted a digital divide online due to language barriers

(Paolillo, 2005; Kralisch, 2005; Kralisch & Mandl, 2006; Flammia & Saunders, 2007;

Berendt & Kralisch, 2009). An overarching theme throughout the past work has been a

language bias towards English and this has been reflected in the analysis carried out. For

example, Kralisch (2005) reduced the choice of language to two options: the user’s

mother tongue and English. Kralisch (ibid.) observed that non-native speakers performed

searches in their domain of knowledge, as good as native speakers and concluded that

both the language and domain of knowledge affect the amount of cognitive effort

involved in the use of a certain search option. Laufer (1998) observed that learners’

passive language skills (listening and reading) develop earlier than the active ones

Page 5: Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

(writing and speaking). This observation has consequences when using a foreign

language for searching. For instance, a user might be able to understand documents in a

foreign language but unable to write queries in that language to find it.

Previous research has also investigated the language skills of users, providing

suitable frameworks in which to study language ability. Unknown and native languages

are the two endpoints of a spectrum of language knowledge; foreign language ability can

vary greatly within these two extremes. While Ringbom (2007) points out the distinction

between passive and active ability, Laufer & Goldstein (2004) suggest that this

dichotomy is too simplistic, and propose a continuum of knowledge strength that also

includes recall and recognition. According to Gibson & Hufeisen (2003), prior

knowledge of a language has been shown to assist understanding of an unfamiliar but

related one (e.g. German and Swedish). Marlow et al. (2008) demonstrated the effects of

language skill on a series of information retrieval tasks using Google Translate and the

requirement for various functionality to support users searching in their native language,

a passive language and an unknown language.

As part of the Web, digital libraries might be subject to this “linguistic divide”.

Although multilingual access to digital libraries was already explored in 1997 (Borgman;

Oard; Peters and Picchi), the present situation of the impact of language barriers in digital

libraries remains unclear.

2.2. Multilingual Information Access Multilingual information access allows users to search for information written in a variety

of languages without having to formulate their search query in each language (Oard &

Diekema, 1998; Peters & Sheridan, 2001; Jones, 2005; Gey et al., 2005; Kishida, 2005;

Gey et al., 2006; Yang & Lam, 2006). This is particularly useful when a user has

sufficient competence in a language to be able to use retrieved documents but is not

skilled enough to formulate an effective query. Also, some users will have information

needs which can be satisfied without them having to read through a document in detail

and the fact that a document exists may be enough (Oard et al., 2008). As argued by

Gonzalo (2002), there are two different situations relating to a user’s language skills that

carry different design implications for cross-language systems. If a user is monolingual,

Page 6: Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

full translation assistance is needed in a CLIR context (e.g. back translation of query

terms and document translation). If the user has some passive language skills, then

document translation is less likely to be used or desired. Language ability, therefore, is an

important variable to consider when designing a system that will cater to a range of users

with different needs. Multilingual information retrieval is particularly interesting from the

perspective of the user because the need for search assistance is substantially higher than

in monolingual information retrieval: normally, the user can quickly adapt to

characteristics of the system, but not to an unknown target language (see, e.g. Oard &

Gonzalo, 2002).

Providing multilingual information access to document collections can range from

adapting existing information for use by local communities to providing cross-language

search. Research has focused on aspects such as the design and usability of multilingual

websites (Del Galdo & Nielsen, 1996; Yunker, 2003) and the provision of multilingual

search functionalities (Oard, 1997). Cross-language information retrieval involves

translating the query (in the source language) into the language of the document

collection (target language), the documents into the query language or translating both

queries and documents into a common language (McCarley, 1999; Clough, 2005). At the

simplest level, cross-language retrieval involves the combination of “standard” IR

methods and translation. This has been proven effective for many European languages as

demonstrated in the non-English tasks of the Cross-Language Evaluation Forum (CLEF),

a major comparative evaluation exercise for multilingual information access (see, e.g.

Savoy, 2004; Peters & Braschler, 2004). To bridge the language gap, three major

approaches for CLIR have emerged: (1) machine translation, (2) machine-readable

bilingual dictionaries, and (3) comparable or parallel corpora to develop translation

resources (Voorhees and Harman, 2000).

Providing effective access to multilingual document collections also involves

challenges for the designers of interactive retrieval systems. In particular, deciding how

best to support interaction within the search process can involve enabling: query

formulation (e.g. offering the user additional query terms to refine their search such as

synonyms), query translation (e.g. enabling the user to select from multiple query

translations such as different word senses), document selection from search results (e.g.

Page 7: Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

providing useable summaries for users to make informed decisions) and document

examination (e.g. providing translated versions of documents for use by the end users)

(Oard, 1997; Petrelli et al., 2006; Oard et al., 2008).

2.3. Multilingualism and Digital Libraries

Supporting Multi-Lingual Information Access (MLIA) and Cross-Language Information

Retrieval (CLIR) in digital libraries has long been recognised as important in providing

universal access to digital content (Oard, 1997; Borgman, 1997; Bian & Chen, 2000;

Peters & Sheridan, 2001). However, much of the focus has been on the technical issues of

providing multilingual access and less attention focused on users and the effects of

language ability. Exceptions have been the studies by Bilal and Bachir (2007a, 2007b) on

the usability of the International Children’s Digital Library with Arabic-speaking

children; Duncker (2002) who described the cultural issues of digital libraries for the

Maori people; and Pavani (2001) who focused on usability issues of the Maxwell

multilingual digital library in Brazil.

However, what is crucial in designing such services is an understanding of the

user and their profile, their interaction with information and search systems, and their

typical information needs: “Most information retrieval (IR) systems are used by people

and we cannot design effective IR systems without some knowledge of how users interact

with them” (Robins, 2000, p. 57). This is as true for digital library systems as it is for

information retrieval systems. The present study aims at complementing this previous

work, but drawing more general conclusions.

3. Methodology This study uses an inductive approach with quantitative and qualitative data gathered

using a questionnaire on searching habits, preferences and language selection. Most of

the questions were structured and closed-class using a 5-point Likert scale; others were

open-ended to gather respondents feedback and comments. The questionnaire was

organised and presented in 6 sections: user profile, language skills, language selection,

digital libraries, preferred features and crossing the language barrier.

Page 8: Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

A pilot test of the questionnaire was carried out before embarking on the main

study. A paper version of the questionnaire was tested with a sample of sixteen

postgraduate students (in Translation Studies) from the University of Sheffield with

varying multilingual skills. The results of this pilot test were used to improve the

questionnaire and develop the final version (produced in English and Spanish). A

purposive (non-probability) sampling approach was used to target potential users of

digital libraries and university communities constituted a good starting point for this

research because they are users of university and research libraries. Due to time and

resource constraints, the online questionnaire was distributed mainly within academic

communities from the University of Sheffield and Universidad Autonoma de Madrid.

However, participants from these communities exhibited a range of language skills and

abilities from various academic disciplines.

The sample was divided into English native speakers and non-English speakers to

analyse the relationship between these two populations. Languages that participants

spoke were coded and named L1 (native language), L2 (second language), L3 (third

language), etc. To distinguish the language profile of participants from their preferred

language(s) for searching online, a new set of languages called “search language” were

created and coded as SL1 (preferred search language 1), SL2 (preferred search language

2), etc. The independent variables were the user’s language profiles and field/domain of

knowledge; location of participants was not asked or taken into account for the analysis,

except when they selected the option of location as a reason for their language choice.

To study the relationship between foreign language skills and use of several

languages when searching online, two new variables were created: the “polyglot value”

and the “polyglot search value”. The polyglot value is computed from summing the mean

of reading level (1=“only understand everyday words and phrases” to 6=“understand

virtually everything, but not like my first language”) and writing level (1=“cannot write”

to 6=“can write fluently and precisely about specialised subjects”) in every foreign

language a participant knows (i.e. a higher score indicates a higher degree of proficiency

in the languages reported). The polyglot search value is computed by summing the

frequencies with which people search in each foreign language specified (from 1=“hardly

Page 9: Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

ever” to 5=“as much as my native search language”), a higher score indicating more

frequent searching in non-native languages.

4. Results

4.1. Description of the Sample

A total of 514 responses were obtained from the questionnaire. This consisted of 92%

belonging to a university community (80% from the University of Sheffield):

undergraduate (31.5%) and postgraduate (23%) students, research and academic staff

(20%) and administrative staff (18%); the remaining 8% were not associated with

academia. Table 1 shows the fields of knowledge represented by the sample, ranging

from arts and humanities, social sciences, science and technology, and health and life

sciences.

Field Count Percent

Medicine, Dentistry

Art, History, Archaeology, Philosophy, Literature, Librarianship

Social Science, Anthropology, Politics, International Relations

Applied Physics, Chemistry, Geology, Computer Science

Engineering

Languages, Linguistics, Translation, Interpretation

Biology

Mathematics, Theoretical Physics, Theoretical Informatics

Economics, Business Management, Tourism

Psychology, Cognitive Science

Other

50

49

49

46

43

41

41

22

20

17

135

9.7%

9.5%

9.5%

8.9%

8.4%

8%

8%

4.3%

3.9%

3.3%

26.5%

Table 1. Fields of knowledge represented in the sample.

Thirty-two native languages are represented (58 in total including foreign

languages) with English the dominant language (71.7%), followed by Spanish (10.2%),

German (4.1%), French (2.5%) and other languages (11.5%). The skew towards English

and Spanish native languages reflects the sample populations (a UK and Spanish

Page 10: Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

University). Participants exhibit a wide range of languages known and used for searching

including those from Europe (e.g. Ancient Greek, Basque, Dutch, Polish and Welsh),

Asia (Cantonese, Tamil, Urdu and Vietnamese), the Middle-East (e.g. Persian, Hebrew

and Arabic), Africa (e.g. Afrikaans, Zulu and Swahili) and South America (e.g.

Papiamentu and Quechua).

The language skills of participants vary widely: 36% are monolingual native

English speakers, 35% are native English speakers with knowledge of at least one other

foreign language; 28% are non-English speakers with knowledge of more than one

language and 0.4% (2 respondents) are monolingual Spanish speakers (this low result is

due to the requirement that a foreign language is known for university entry in Spain).

When searching online, participants can be classified in three main groups according to

their language profile and their search language 1 (SL1): (1) native English speakers

searching mainly in English (60%); (2) speakers of other languages searching mainly in

their native language (23%) and (3) speakers of other languages searching mainly in

English (17%).

When the search language 2 (SL2) is taken into account, the groups subdivide as

illustrated in Table 24. Results show that 34.4% of participants search in at least two

languages (the grey areas) and 62.2% search in their native language only. Also, when

more search languages are considered, results show that 15% of the sample search in at

least three languages.

Native English speakers

Non-English native speakers (SL1≠English)

Non-English native speakers (SL1=English)

365 52 86

No SL2 SL2 No SL2 SL2 = English

SL2 ≠ English No SL2 SL2=L1 SL2 ≠ L1

306 59 7 41 4 17 61 8

Table 2. Categories of respondents by native language (L1), first (SL1) and second (SL2)

search language. 4 Not all respondents specified a second search language and this accounts for the difference in total responses and the numbers in Table 2.

Page 11: Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

The following points summarise the reasons participants provided to explain their

choice of language for searching:

• Their native language is easier for searching or is the only one they can

understand;

• They perceive that a significant part of the information in their field of expertise is

in English or they find more information in English in general;

• They search in the language of the university where they study/work;

• They search in several languages to widen the coverage of the search;

• They search in the local language if they want to find local news, cultural,

historical, linguistic, tourist and geographical information of particular countries;

• Translators and other language-related professionals search in several languages;

• They do not search in English when they want to find information in certain

fields, like Archaeology, Museums (French), Philosophy (German and French),

etc.

4.2. Field of Knowledge and Language Abilities

A wide variety of languages are represented by the respondents (as described in Section

4.1). The “polyglot value” and “polyglot search value” for each participant was

calculated to reflect their language abilities and frequency of use of other languages for

searching online respectively. An interesting question in this particular context (a

university setting), is whether the field of knowledge or study is also related to (or has an

impact on) language skills/abilities. Table 3 shows the mean of these values across the

most represented fields of knowledge in the sample.

The results from Table 3 show, as one might expect, clear differences between the

language skills exhibited by members of different academic groups: in the upper rows the

language-related fields, social sciences, anthropology, politics and international relations

(on average) have members who are more proficient and search more frequently for

information in multiple languages. Respondents from language-related fields typically

Page 12: Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

know at least three languages and have a proficiency or bilingual level in one of them.

This contrasts with fields (in the lower rows of Table 3) such as medicine, dentistry and

biology: these typically represent people with intermediate levels in one foreign language

and rarely use more than one language to search online.

Field

Mean of

polyglot

value

Mean of

polyglot

search value

Languages, Linguistics, Translation, Interpretation

Social Science, Anthropology, Politics, International Relations

Engineering

Art, History, Archaeology, Philosophy, Literature, Librarianship

Applied Physics, Chemistry, Geology, Computer Science

Medicine, Dentistry

Biology

11.19

7.51

4.94

4.93

4.29

3.91

3.59

4.87

2.98

0.88

1.94

1.02

0.28

0.64

Table 3. Fields of knowledge and “polyglot value” and “polyglot search value”

Using a Pearson Correlation Coefficient a significant correlation exists between

the polyglot values and polyglot search values for language-related fields (r=0.717;

p<0.01) and Social Science, Anthropology, Politics and International Relations (r=0.795;

p<0.01) indicating that, on average, having a greater proficiency in multiple foreign

languages is likely to result in more frequent searches in multiple languages (for these

subject areas).

4.3. Potential for Cross-Language Information Retrieval

In the questionnaire, respondents were specifically asked whether having CLIR – being

able to type a query in one language to find information written in other languages -

would be useful to them. In total, 54% stated CLIR would be helpful; 23% not helpful

and 23% did not know (or comment). Table 4 shows the respondents’ specified interest in

CLIR (positive responses) based on their field of knowledge/study. Similar to the results

Page 13: Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

in Section 4.2, the potential utility of CLIR technologies in a university setting may vary

depending on the field of knowledge/study. This is interesting as it highlights the fields in

which to focus and develop multilingual information access.

Field Respondents interested in CLIR

Languages, Linguistics, Translation, Interpretation

Art, History, Archaeology, Philosophy, Literature, Librarianship

Social Science, Anthropology, Politics, International Relations

Engineering

Applied Physics, Chemistry, Geology, Computer Science

Biology

Medicine, Dentistry

91%

79%

77%

70%

64%

62%

61%

Table 4. Interest in CLIR by field of knowledge.

In addition to field of knowledge, Table 5 (grey areas highlighting participants

searching in at least two languages) shows that language ability may also have an impact

on who would most benefit from cross-language search. This highlights potential groups

of users who may benefit more than others in having access to cross-language search.

Results indicate that CLIR functionality may be most helpful in supporting non-English

speakers that search in their native language with English as a second choice, and native

English speakers who search in languages other than English. As expected, CLIR is not

useful to those who do not wish to search across multiple languages (e.g. native English

searching only in English).

Native English speakers

Non-English native speakers (SL1≠English)

Non-English native speakers (SL1=English)

268 52 86

No SL2 (213) SL2 (55) No SL2

(6)

SL2 = English

(37)

SL2 ≠ English

(4)

No SL2 (12)

SL2=L1 (55)

SL2 ≠ L1 (7)

60% 89% 100% 94.5% 75% 66.6% 73% 100%

Page 14: Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

Table 5. Interest in CLIR in relation to native language and search language(s).

A qualitative analysis of participants’ answers illustrated the potential

applications of CLIR in assisting with search. For example, in general respondents

reported that CLIR would be most useful to them for the following:

• To widen the coverage of the search and save time by performing “one search per

query instead of one search per language”;

• To include different perspectives from different cultures, to widen the spectrum of

authors and perform resource comparison to have “a picture of the whole”;

• “It is easier to formulate the query in my language (difficult in others) but one

might be interested in finding articles in a different language”.

In addition, specific applications of CLIR may include:

• In language-related fields: “to find translations of terms”, “to find parallel texts

for corpus creation and for translation” and “to discover if an American idea has

diffused into continental linguistics and vice-versa”.

• Arts and Humanities: “Research in my field is published in other European

languages”, “for historical research”, “lots of my research has documents in other

languages and I have to translate them and only then do I realise if they are

relevant” and “it is useful to compare the British situation with other European

Societies”.

• Personal interest: “for tourism”, “to check foreign news”, “to find pictures” and

“to improve reading skills in other languages”.

4.4. Digital Libraries

Overall, 84% of respondents reported using digital libraries as part of their online

activities, with 63% using digital libraries for academic/work purposes and 30% for

personal interest. In total, 97% of postgraduate students used digital libraries, 96% of

research/academic staff and 83% of undergraduate students; administrative staff made the

Page 15: Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

least use of digital libraries (64%). 13.4% of respondents reported using digital libraries

on a daily basis, 23% three times or less per week, 18% three or less times per month and

10% once every two months. The reasons for using digital libraries varied: 74% of

respondents reported using digital libraries because they wanted digitised versions of

documents, 57% indicated they used digital libraries because they were considered a

reliable source of information, 45% because they wanted to locate items in a physical

library, and 25% because they wanted to access digitised versions of historical documents

where the original is not publicly accessible.

Respondents provided the names of 88 digital libraries they used. The most

popular being the University of Sheffield Library5 (as expected), JSTOR6, ISI Web of

Knowledge7 (including Web of Science) and PubMed8. Other popular digital libraries

used included national libraries (e.g. The British Library, The Library of Congress,

Deutsche Nationalbibliotek and Bibliotèque Nacionale de France) and online libraries of

publishers (e.g. Elsevier, Nature, Wiley, SAGE journals online, Kluwer Law Library and

Emerald). Regarding the use of search functionality provided by digital libraries,

respondents reported using: search by key words (70.4%), search by author (63.4%),

search by title (58.3%), search by topic (58%), browse by subject (20.4%), browse by

related items (12%), browse by collection (5.5%). To overcome the language difficulties

when searching in multilingual digital libraries, participants selected one or more features

from Table 6.

Feature No. of respondents

Online dictionary

List of suggested terms related to the query

Machine translation of the documents found

List of documents related to those found relevant

Machine translation of the query

183

146

129

127

114

Table 6. Most popular tools to cross the language barrier. 5 http://www.shef.ac.uk/library/ [site visited: 22/6/09] 6 http://www.jstor.org/ [site visited: 22/6/09] 7 http://apps.isiknowledge.com/ [site visited: 22/6/09] 8 http://www.ncbi.nlm.nih.gov/pubmed/ [site visited: 22/6/09]

Page 16: Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

The features in Table 6 are commonly found features of CLIR and include

functionalities to translate the users’ query (using machine translation or bilingual

dictionaries), improve the query (query expansion) and translate retrieved documents.

The results in Section 4.3 indicate that the functionality required by users is likely

influenced by their language abilities and field of knowledge (we found in previous work

that language ability had a direct impact on the provision of functionalities for

multilingual search (Marlow et al., 2008)).

When investigating further the digital libraries used by respondents, it was

noticeable that very few offered any kind of cross-language support beyond the

localisation of selected pages of content. To explore this further, we carried out a

“competitor analysis” of the two most widely used digital libraries by participants and

examples of multilingual digital libraries: JSTOR and Web of Knowledge. In particular

an informal features analysis, which “provides a snapshot of the competition’s services

and features from a customer standpoint” (Goto and Cotler, 2005, p. 260) was carried out

to compare the sites based on functionality, content and multilingual accessibility.

Interestingly, publications written in English represented approximately 90% of the

content, whilst other languages were clearly underrepresented. At the same time, access

to resources not published in English was not facilitated; neither with linguistic tools nor

with a search tool that supported several languages (JSTOR offers a search for 4

languages in addition to English, but requires a special syntax). There are definitely

improvements which could be made to help users access the content in these digital

libraries, in particular for non-native English users who must gather material in English

as part of their research or programme of study.

5. Discussion

To fully assess the impact and utility of multilingual digital libraries studies must be

carried out to investigate the role of language in accessing information online, and in

particular for developing effective multilingual digital libraries. The complexity of

language choice used to access information is determined by many factors; in this study

Page 17: Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

we have focused on two important factors in the context of (mainly) university

communities: fluency in foreign languages and field of knowledge/study.

The first observation to make concerns people’s language skills: it is clear from

our study that in a multicultural and multilingual society the languages skills that users

possess can be very complex. This confirms the need for an appropriate framework in

which to capture and analysis language knowledge (e.g. similar to (Laufer & Goldstein,

2004; Marlow et al., 2008; Berendt & Kralisch, 2009)). Commonly the statistics about

Internet usage by language typically assume that users search in just one language: the

official language of their country. This view is too simplistic: firstly, there are many

countries with more than one official language and millions of regional languages that are

not considered “official”; secondly, there are more speakers of English as a second

language than native English speakers (Graddol, 2006), which would indicate that many

online users might search in English and their native language. This study shows (with no

aim of generalising the numbers) that 34.4% of the sample uses at least two languages for

searching online and 17% of participants prefer English to their native language for

searching online.

Our second observation concerns the effect of users’ language skills and

professional/study field on their language choice when searching online. We have found

that users from different fields of study are likely, on average, to have varying language

proficiency and skills, which is likely to affect the utility of multilingual information

access across different academic disciplines. This suggests that even in one context or

search scenario there are likely to be very different use cases and suggests that more

detailed studies within domains should be conducted, as well we across different

domains. This study has shown that within language-related fields, Arts and Humanities

and Social Sciences, as users’ proficiency of foreign language skills increases, they are

more likely to search using more languages (no correlation was found in other fields). We

also observed that in some fields, e.g. Computer Science and IT, English is the

predominant language and, therefore, professionals are required to know English in

addition to their native language. This might suggest that digital libraries should be

developed for or adapted to the needs of specific user groups (rather than providing

unnecessary functionality for all users). This aligns with Berendt & Kraslisch (2009) who

Page 18: Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

propose deployment strategies for language tools based on multilingual content, user

behaviours, the user group’s domain expertise, site type and user’s attitudes towards the

availability of first-language content.

Our third observation concerns users’ interest in cross-language information

retrieval. Results from this sample group show that the language-related fields constitute

the best potential group of users, followed by the Arts and Humanities and the Social

Sciences. Also, non-English speakers that search in their native language only might be

the best potential group of users, but they are not statistically significant in this sample

and conclusions cannot be drawn. Apart from these users, non-English speakers that

search in their native language and in English (as a second choice) are also a potential

group of interest. English speakers that use other languages for searching in addition to

English constitute the third group more interested in CLIR.

It should be noted that translators and language professionals suggested many

applications of CLIR systems and, in general, “to widen the coverage of the search” was

a very popular answer. Also, some participants expressed their need for tools that help to

formulate queries in foreign languages they can understand. This finding is supported by

Laufer’s study (1998), which indicates that learners’ passive language skills (listening

and reading) develop earlier than the active ones (writing and speaking). The

consequence of this observation for information search is that a user might be able to

understand documents in a foreign language but are unable to formulate queries. To

overcome this difficulty, online dictionaries were the most popular tools among

participants of the questionnaire. Therefore, monolingual users may need help

formulating queries in foreign languages and require document translation, but users with

passive language abilities may not require tools to support document translation.

Our final observation concerns multilingualism and digital libraries. This study

highlights the many opportunities for multilingual information access in digital libraries,

particularly within a university community: many people have multiple language skills or

work in areas that require searching for content in multiple languages. However, a brief

analysis of two commonly used digital libraries (JSTOR and ISI Web of Knowledge)

highlight two issues: that content is predominantly in English (as found in other areas of

the Web (Berendt & Kralisch, 2009)), and the provision for multilingual search and

Page 19: Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

browse is minimal (especially with regards to cross-language search). This certainly

would not seem to support the notion of making content accessible to all, especially as

previous research has shown that users prefer to search in languages they are familiar

with (Berendt & Kralisch, 2009), and prefer to use documents written in languages they

can read (Michos et al., 1999). The provision of multilingual information access will

involve more than implementing cross-language search; it requires localisation of

existing material and support for multilingual browsing (see, e.g. (Eurescom, 2000)). The

use of CLIR would facilitate access to multilingual information in a digital library, but

before designing such functionality a study of the linguistic profile of the library users is

necessary. In addition, specific domains or field of knowledge of the library is a factor

that determines utility of CLIR.

6. Conclusions

This study has investigated the effects of language skills and field of knowledge/study on

user’s preferred languages when searching for information online, including digital

libraries, and within a university context. Both quantitative and qualitative data have been

gathered using a questionnaire and 514 responses were obtained from a range of people at

the University of Sheffield (UK) and the Universidad Autonoma de Madrid (Spain).

Respondents have varying degrees of language abilities and approximately one third of

the respondents use at least two languages for searching online. Language skills and the

frequency with which multiple languages are searched were found to vary between

different fields of knowledge/study, with a significant correlation between the polyglot

value (indicating language proficiency in foreign languages) and polyglot search value

(indicating the frequency of searching in foreign languages) for language-related fields,

Social Science and Arts and Humanities.

The analysis of interest in CLIR by field of knowledge showed that the language–

related fields constitute the best potential group of users, followed by the Arts and

Humanities and Social Sciences. The results also highlight the utility of cross-language

search for specific language groups (e.g. non-native English users searching for English

content). To develop effective multilingual digital libraries, it is clear that further work

should be carried out to better understand the users, their profiles and the context in

Page 20: Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

which they access digital libraries. Studying the language skills and field of knowledge in

a university community are just two factors within one domain; multiple contexts and

factors should be investigated. It is vital that developing multilingual digital libraries

focuses as much on the user and tasks as it does the technical issues required to

implement multilingualism.

In future work we plan to study groups that were not statistically significant in this

sample with respect to language ability and search languages (e.g. doctors and biologists

that are not English speakers but monolingual non-English speakers). Adding a parameter

for geographical location would also enrich the study in combination with the language

profile. Finally, a more thorough evaluation of representative digital libraries should be

carried out to provide a more complete view of the state of multilingual access to digital

libraries and potential avenues for future developments.

Acknowledgements

The work reported has been partially supported by the TrebleCLEF Coordination Action,

within FP7 of the European Commission, Theme ICT-1-4-1 Digital Libraries and

Technology Enhanced Learning (Contract 215231). The views and opinions expressed in

this paper reflect those of the authors.

References

Association of Research Libraries (1995). Realizing Digital Libraries. Appendix II:

Definition and Purposes of a Digital Library. Retrieved June 26, 2009, from

http://www.arl.org/resources/pubs/mmproceedings/126mmappen2

Bian, G., & Chen, H. (2000). Cross Language Information Access to Multilingual

Collections on the Internet. Journal of American Society for Information Science &

Technology, Special Issue on Digital Libraries, 51(3), 281-296.

Page 21: Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

Berendt, B., & Kralisch, A. (2009). A user-centric approach to identifying best

deployment strategies for language tools: the impact of content and access language

on Web user behaviour and attitudes. Information Retrieval, 12(3), 380-399.

Bilal, D., & Bachir, I. (2007). Children’s interaction with cross-cultural and multilingual

digital libraries: I. Understanding interface design representations. Information

Processing and Management,43, 47–64.

Bilal, D., & Bachir, I. (2007). Children’s interaction with cross-cultural and multilingual

digital libraries II: Information seeking, success, and affective experience.

Information Processing and Management, 43, 65–80.

Borgman, C.L. (1997). Multi-Media, Multi-Cultural and Multi-Lingual Digital Libraries:

or How do we exchange data in 400 languages? D-Lib Magazine. Retrieved June 26,

2009, from http://dlib.ukoln.ac.uk/dlib/june97/06borgman.html

Chen, C. (2007, August). Delivery of Web-based Multilingual Digital Collections and

Services to Multicultural Populations: The Case of Global Memory Net. Paper

presented at IFLA Government Libraries Section in cooperation with the Library

Services to Multicultural Populations Section at the IFLA Meeting in Durban, South

Africa. Retrieved June 26, 2009, from http://archive.ifla.org/IV/ifla73/papers/097-

Chen-en.pdf [last accessed 23/06/2009]

Clough, P. (2005). Caption vs. Query Translation for Cross-Language Image Retrieval. In

Peters, C., Clough, P., Gonzalo, J., Jones, G., Kluck, M. and Magnini, B. (Eds.),

Multilingual Information Access for Text, Speech and Images: Results of the Fifth

Page 22: Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

CLEF Evaluation Campaign (pp. 614-625). Lecture Notes in Computer Science

(LNCS), Springer, Heidelberg, Germany, Volume 3491/2005.

Crane, G. (2006). What do you do with a million books? D-Lib Magazine, 12(3).

Retrieved 26 June, 2009, from http://www.dlib.org/dlib/march06/crane/03crane.html

Del Galdo, E.M., & Nielsen, J. (1996). International User Interfaces. New York: John

Wiley & Sons.

Duncker, E. (2002). Cross-Cultural Usability of the Library Metaphor. In Proceedings of

the ACM & IEEE Joint Conference on Digital Libraries (JCDL '02), Portland, OR,

USA.

European Commission Information Society and Media (2006). i2010: Digital Libraries.

Published by the European Communities (Luxembourg: Office for Official

Publications of the European Communities). Retrieved June 26, 2009, from

http://ec.europa.eu/information_society/activities/digital_libraries/doc/brochures/dl_b

rochure_2006.pdf

Eurescom (2000). Multi-Lingual Web Sites: Best Practice Guidelines and Architecture

(P923). Eurescom project report. Retrieved June 26, 2009, from

http://www.eurescom.de/Public/projectresults/P900-series/923d1.asp

Flammia, M., & Saunders, C. (2007). Language as Power on the Internet. Journal of the

American Society for Information Science and Technology, 58(12), 1899-1903.

Gey, F. C., Kando, N., & Peters, C. (2005). Introduction to Special issue on Cross-

Language Information Retrieval. Information Processing and Management, 41(3),

413-722.

Page 23: Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

Gey, F. C., Kando, N., Lin, C., & Peters, C. (2006). New directions in multilingual

information access. SIGIR Forum, 40(2), 31-39.

Gibson, M., & Hufeisen, B. (2003). Investigating the role of prior foreign language

knowledge: Translating from an unknown into a known foreign language. In: Cenoz,

J., Hufeisen, B., & Jessner, U. (Eds.), The Multilingual Lexicon (pp. 87-102).

Netherlands: Springer.

Gonzalo, J. (2002). Scenarios for interactive cross-language information retrieval

systems. Paper presented at the SIGIR 2002 Workshop on Cross-Language IR.

Goto, K., & Cotler, E. (2005). Web Re-Design 2.0: Workflow that works (2nd edition).

New Riders.

Graddol, D. (2006). English Next. Published by the British Council. Retrieved June 26,

2009, from http://www.britishcouncil.org/learning-research-english-next.pdf

Halavais, A. (2000). National Borders on the World Wide Web. New Media and Society

2(1), 7-28.

Jones, G.J.F. (2005). Beyond English Text: Multilingual and Multimedia Information

Retrieval. In Charting a New Course: Natural Language Processing and Information

Retrieval. Essays in Honour of Karen Sparck Jones (pp. 81-98). Netherlands:

Springer.

Kishida, K. (2005). Technical issues of cross-language information retrieval: a review.

Information Processing & Management, 41(3), 433-455.

Page 24: Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

Kralisch. A. (2005). The impact of culture and language on the use of the internet:

empirical analyses of behaviour and attitudes. Masters dissertation, Humboldt-

Universität zu Berlin.

Kralisch, A., & Mandl, T. (2006). Barriers to Information Access across Languages on

the Internet: Network and Language Effects. In Proceedings of the 39th Hawaii

International Conference on System Sciences-IEEE (pp. 1-10).

Laufer, B. (1998). The development of passive and active vocabulary in a second

language: Same or different? Applied Linguistics, 19, 255–271.

Laufer, B., & Goldstein, Z. (2004). Testing vocabulary knowledge: Size, strength and

computer adaptiveness. Language Learning, 54(3), 399-436.

Marlow, J., Clough, P., Cigarrán Recuero, J., & Artiles, J. (2008). Exploring the Effects

of Language Skills on Multilingual Web Search, In Proceedings of the 30th European

Conference on IR Research (pp. 126-137). Glasgow, UK, April 2008.

McCarley J. S. (1999). Should we Translate the Documents or the Queries in Cross-

language Information Retrieval? In Proceedings of the 37th Annual Meeting of the

Association for Computational Linguistics (pp. 208 – 214). College Park, Maryland,

USA.

Michos, S.E., Stamatatos, E., & Fakotakis, N. (1999). Supporting Multilinguality in

Library Automation Systems Using AI Tools. Applied Artificial Intelligence, 13(7),

679-703.

Mikami, Y., & Suzuki, I. (2004). The Language Observatory Project and its Experiment:

Cyber Census Survey. Published by the European Language Resources Association.

Page 25: Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

Retrieved June 26, 2009, from

http://www.elda.org/en/proj/scalla/SCALLA2004/mikami.pdf

Miniwatts Marketing Group (2008). Internet World Statistics. Retrieved June 26, 2009,

from http://www.internetworldstats.com/

Oard, D. W. (1997). Serving Users in Many Languages: Cross-language retrieval for

digital libraries. D-Lib Magazine. Retrieved June 26, 2009, from

http://dlib.ukoln.ac.uk/dlib/december97/oard/12oard.html

Oard, D.W., & Diekema, A. (1998). Cross-Language Information Retrieval. Annual

Review of Information Science and Technology, 33, 223-256.

Oard, D., & Gonzalo, J. (2002). The CLEF 2001 Interactive Track. Evaluation of Cross-

Language Information Retrieval Systems (pp. 308-319). Netherlands: Springer-Verlag

LNCS 2406.

Oard, D., He, D., & Wang, J. (2008). User-assisted query translation for interactive cross-

language information retrieval. Information Processing & Management, 44(1), 181-

211.

Paolillo, J. (2005). Language Diversity on the Internet. Measuring Linguistic Diversity,

43-89. UNESCO Publications for the World Summit on the Information Society.

Retrieved June 26, 2009, from

http://unesdoc.unesco.org/images/0014/001421/142186e.pdf

Pavani, A.M.B. (2001). A model of multilingual digital library. Ci. Inf., Brasília, 30(3),

73-81.

Page 26: Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

Peters, C., & Picchi, E. (1997). Across Languages, Across Cultures: Issues in

Multilinguality and Digital Libraries. D-Lib Magazine. Retrieved June 26, 2009, from

http://dlib.ukoln.ac.uk/dlib/may97/peters/05peters.html

Peters, C., & Sheridan, P. (2001). Multilingual information access. In M. Agosti, F.

Crestani, and G. Pasi (Eds.), Lectures on information Retrieval (pp. 51-80). New

York: Springer LNCS 1980.

Peters, C., & Braschler, M. (2004). Introduction to Special Issue on CLEF. Information

Retrieval, 7(1-2).

Petrelli, D., Levin, S., Beaulieu, M., & Sanderson, M. (2006). Which User Interaction for

Cross-Language Information Retrieval? Design issues and Reflections. Journal of the

American Society for Information Science and Technology, 57(5), 709-722.

Ringbom, H. (2007). Cross-linguistic similarity in foreign language learning. Clevedon,

UK: Multilingual Matters.

Robins, D. (2000). Interactive Information Retrieval: Context and basic Notions.

Informing Science, 3(2), 57-61. Retrieved June 26, 2009, from

http://inform.nu/Articles/Vol3/v3n2p57-62.pdf

Savoy, J. (2004). Combining Multiple Strategies for Effective Monolingual and Cross-

Language Retrieval. Information Retrieval, 7(1-2), 121-148.

The Unesco’s Recommendation concerning the Promotion and Use of Multilingualism

and Universal Access to Cyberspace (2003). Retrieved June 26, 2009, from

http://portal.unesco.org/ci/en/files/13475/10697584791Recommendation-

Eng.pdf/Recommendation-Eng.pdf

Page 27: Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital …ir.shef.ac.uk/cloughie/papers/IJDLS_2010.pdf · Europeana - the European digital

Voorhees, E.H., & Harman, D (2000). Overview of the sixth text retrieval conference

(TREC-6). Information Processing & Management, 36(1), 3-35.

Yang, C., & Lam, W. (2006). Introduction to Special issue on Multilingual Information

Systems. Journal of the American Society for Information Science and Technology

Vol. 57(5).

Yunker, J. (2003). Beyond borders - Web globalization strategies. Indianapolis, IN: New

Riders Publishing.

Biographies Dr. Paul Clough is a lecturer in Information Systems in the Department of Information

Studies, University of Sheffield (UK). He received his BEng from the University of York

in Computer Science while also working for British Telecommunications Plc. as a

software engineer. He received his PhD from the Department of Computer Science,

University of Sheffield and has since worked as a researcher on a range of language

engineering and information access projects. Clough is member of the Information

Retrieval (IR) group and his core research interests are information retrieval

(Geographical IR, multimedia IR and evaluation), computational text analysis (plagiarism

detection, authorship attribution and creating corpora) and human-computer interaction.

He has over 60 peer-reviewed publications in his research area and a US patent for an

information management system.