Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Investigating Language Skills and Field of Knowledge on Multilingual Information Access in Digital Libraries
Paul Clough* and Irene Eleta ([email protected])
Department of Information Studies, University of Sheffield,
Regent Court, 211 Portobello Street, Sheffield S1 4DP UK.
Tel: ++4(0) 114 2222664 Fax: +44(0) 114 2780300
Abstract
Digital libraries remove physical barriers to accessing information, but the language
barrier still remains due to: multilingual collections and the linguistic diversity of users.
This study aims at understanding the effect of users’ language skills and field of
knowledge on their language preferences when searching for information online and to
provide new insights on the access to multilingual digital libraries. Both quantitative and
qualitative data were gathered using a questionnaire and results show that the language
skills and the field of knowledge have an impact on the language choice for searching
online. These factors also determine the interest in cross-language information retrieval:
language–related fields constitute the best potential group of users, followed by the Arts
and Humanities and the Social Sciences.
Keywords: digital libraries, multilingual information access, cross-language information
retrieval, language skills.
1. Introduction The curators of digital libraries and their users are being confronted with large quantities
of digital material, increasingly diverse in nature: multi-media, multi-cultural and multi-
language (Borgman, 1997; Crane, 2006). A fundamental goal of digital libraries is to
provide universal access to the information being managed (Association of Research
Libraries, 1995), but this can only be realised if digital content is made more accessible
and usable over time within online environments (Chen, 2007). For example, the
European i2010 Digital Libraries initiative aims to make cultural, audiovisual and
scientific heritage accessible to all: “the initiative combines cultural diversity,
multilingualism and technological progress” (European Commission Information Society
and Media, 2006). In Europe two major initiatives are The European Library1 (TEL) and
more recently Europeana2. The European Library (TEL) offers access to digital resources
(books, posters, maps, sound recordings and videos) and bibliographic content from 48
national libraries in 35 languages. Europeana - the European digital library, museum and
archive – aims to provide users access to around 2 million digital objects, including
photos, paintings, sounds, maps, manuscripts, books, newspapers and archival papers.
Both digital libraries offer access to multilingual content and Europeana plans to provide
a multilingual interface and offer multilingual access to users. More widely, UNESCO
has officially launched the World Digital Library3, an Internet-based library that aims to
display and explain the wealth of all human cultures. Of course, universal access is as
applicable to smaller and more specialised digital libraries as it is to the larger national
and international ones. However, although digital libraries can remove physical and
spatial barriers in accessing information, the language barrier still remains due to
multilingual collections and the linguistic diversity of users. Previous research has shown
that language has an impact on the structure of the web (Kralisch & Mandl, 2006) and
that the power relations of languages on the Internet have cultural implications (Flammia
& Saunders, 2007), causing a digital divide. Language represents a clear barrier to
accessing information online, which is dominated by the English language and Anglo-
American values. This is the context in which digital libraries must operate and are
thereby subject to this digital divide also. A key factor to the future success of digital
libraries is the provision of appropriate multilingual services to allow users to find,
explore and work with content in multiple languages (European Commission Information
Society and Media, 2006).
1 http://www.theeuropeanlibrary.org [site accessed: 23/06/2009] 2 http://dev.europeana.eu/outcomes.php [site accessed: 23/06/2009] 3 http://www.wdl.org/en/ [site accessed: 23/06/2009]
In this paper we present a study to investigate the potential impact of language
and field/domain of knowledge in searching for information online in general, and using
digital libraries. The context of our study has been constrained to users predominantly
within a university setting because: (1) this group makes regular use of digital libraries,
(2) it is increasingly common to find members of this population exhibiting a wide range
of language skills and abilities, (3) members of this group have specialised areas of
knowledge, and (4) members of this community were easily accessible to us. The
methodology follows an inductive approach and both quantitative and qualitative data
have been gathered using a questionnaire that resulted in 514 responses. Although the
majority of participants belong to The University of Sheffield (UK) and the Universidad
Autonoma de Madrid (Spain), an effort has been made to include respondents with other
native languages in addition to English and Spanish. Also, a range of fields is studied,
with particular attention to the language-related fields. The specific objectives of this
study are to:
• Explore the effect of users’ language skills and professional/study field on their
language choice when searching online;
• Investigate whether or not users would like to use cross-language information
retrieval (CLIR), its utility and how this relates to language skills and field;
• Investigate the preference of users for certain tools and functionalities that support
searching in digital libraries, as well as the most criticised aspects of digital
libraries.
The paper is organised as follows: firstly, Section 2 describes related literature;
Section 3 explains the methodology used in this study; Section 4 presents the results,
which are further discussed in Section 5; Section 6 concludes the paper and discusses
further areas for future work.
2. Literature Review
2.1. People, Languages and the Internet
With users’ activities involving increasing levels of online interaction, it is important to
study language with regard to information access. Miniwatts Marketing Group (2008)
shows in its Internet World Statistics that already 17.3% of Internet users are from China,
above the 15% from the USA and followed by Japan (6.4%), India (4.1%), Germany
(3.6%) and Brazil (3.4%). Also, their numbers show that the percentage of English
speakers online has decreased to 30.5% in June 2008, while Chinese speakers represent
20.4% of Internet world users and Spanish speakers, 6.8%. However, these statistics do
not take into account the number of speakers that know English as a second language,
which Graddol (2006) estimated to be higher than the number of native English speakers.
Also, many people are bilingual or multilingual, but Miniwatts Marketing Group only
assigns one language per person.
Although there is a clear trend that the linguistic diversity of content on the Web
is increasing, many languages are still underrepresented and the character encoding
systems used privilege some languages over others (Flammia and Saunders, 2007).
Mikami and Suzuki (2004) raised the issue of the lack of statistics about the number of
web pages by language, script and character set. Halavais (2000) demonstrated that
national borders have an imprint in the topology of the Web: websites are more likely to
link to another site of the same country (and in this study to websites in the USA).
Kralisch and Mandl (2006) also studied the hyperlink structure of the Web and confirmed
the tendency to link to websites not only in the same country, but written in the same
language.
Previous research has highlighted a digital divide online due to language barriers
(Paolillo, 2005; Kralisch, 2005; Kralisch & Mandl, 2006; Flammia & Saunders, 2007;
Berendt & Kralisch, 2009). An overarching theme throughout the past work has been a
language bias towards English and this has been reflected in the analysis carried out. For
example, Kralisch (2005) reduced the choice of language to two options: the user’s
mother tongue and English. Kralisch (ibid.) observed that non-native speakers performed
searches in their domain of knowledge, as good as native speakers and concluded that
both the language and domain of knowledge affect the amount of cognitive effort
involved in the use of a certain search option. Laufer (1998) observed that learners’
passive language skills (listening and reading) develop earlier than the active ones
(writing and speaking). This observation has consequences when using a foreign
language for searching. For instance, a user might be able to understand documents in a
foreign language but unable to write queries in that language to find it.
Previous research has also investigated the language skills of users, providing
suitable frameworks in which to study language ability. Unknown and native languages
are the two endpoints of a spectrum of language knowledge; foreign language ability can
vary greatly within these two extremes. While Ringbom (2007) points out the distinction
between passive and active ability, Laufer & Goldstein (2004) suggest that this
dichotomy is too simplistic, and propose a continuum of knowledge strength that also
includes recall and recognition. According to Gibson & Hufeisen (2003), prior
knowledge of a language has been shown to assist understanding of an unfamiliar but
related one (e.g. German and Swedish). Marlow et al. (2008) demonstrated the effects of
language skill on a series of information retrieval tasks using Google Translate and the
requirement for various functionality to support users searching in their native language,
a passive language and an unknown language.
As part of the Web, digital libraries might be subject to this “linguistic divide”.
Although multilingual access to digital libraries was already explored in 1997 (Borgman;
Oard; Peters and Picchi), the present situation of the impact of language barriers in digital
libraries remains unclear.
2.2. Multilingual Information Access Multilingual information access allows users to search for information written in a variety
of languages without having to formulate their search query in each language (Oard &
Diekema, 1998; Peters & Sheridan, 2001; Jones, 2005; Gey et al., 2005; Kishida, 2005;
Gey et al., 2006; Yang & Lam, 2006). This is particularly useful when a user has
sufficient competence in a language to be able to use retrieved documents but is not
skilled enough to formulate an effective query. Also, some users will have information
needs which can be satisfied without them having to read through a document in detail
and the fact that a document exists may be enough (Oard et al., 2008). As argued by
Gonzalo (2002), there are two different situations relating to a user’s language skills that
carry different design implications for cross-language systems. If a user is monolingual,
full translation assistance is needed in a CLIR context (e.g. back translation of query
terms and document translation). If the user has some passive language skills, then
document translation is less likely to be used or desired. Language ability, therefore, is an
important variable to consider when designing a system that will cater to a range of users
with different needs. Multilingual information retrieval is particularly interesting from the
perspective of the user because the need for search assistance is substantially higher than
in monolingual information retrieval: normally, the user can quickly adapt to
characteristics of the system, but not to an unknown target language (see, e.g. Oard &
Gonzalo, 2002).
Providing multilingual information access to document collections can range from
adapting existing information for use by local communities to providing cross-language
search. Research has focused on aspects such as the design and usability of multilingual
websites (Del Galdo & Nielsen, 1996; Yunker, 2003) and the provision of multilingual
search functionalities (Oard, 1997). Cross-language information retrieval involves
translating the query (in the source language) into the language of the document
collection (target language), the documents into the query language or translating both
queries and documents into a common language (McCarley, 1999; Clough, 2005). At the
simplest level, cross-language retrieval involves the combination of “standard” IR
methods and translation. This has been proven effective for many European languages as
demonstrated in the non-English tasks of the Cross-Language Evaluation Forum (CLEF),
a major comparative evaluation exercise for multilingual information access (see, e.g.
Savoy, 2004; Peters & Braschler, 2004). To bridge the language gap, three major
approaches for CLIR have emerged: (1) machine translation, (2) machine-readable
bilingual dictionaries, and (3) comparable or parallel corpora to develop translation
resources (Voorhees and Harman, 2000).
Providing effective access to multilingual document collections also involves
challenges for the designers of interactive retrieval systems. In particular, deciding how
best to support interaction within the search process can involve enabling: query
formulation (e.g. offering the user additional query terms to refine their search such as
synonyms), query translation (e.g. enabling the user to select from multiple query
translations such as different word senses), document selection from search results (e.g.
providing useable summaries for users to make informed decisions) and document
examination (e.g. providing translated versions of documents for use by the end users)
(Oard, 1997; Petrelli et al., 2006; Oard et al., 2008).
2.3. Multilingualism and Digital Libraries
Supporting Multi-Lingual Information Access (MLIA) and Cross-Language Information
Retrieval (CLIR) in digital libraries has long been recognised as important in providing
universal access to digital content (Oard, 1997; Borgman, 1997; Bian & Chen, 2000;
Peters & Sheridan, 2001). However, much of the focus has been on the technical issues of
providing multilingual access and less attention focused on users and the effects of
language ability. Exceptions have been the studies by Bilal and Bachir (2007a, 2007b) on
the usability of the International Children’s Digital Library with Arabic-speaking
children; Duncker (2002) who described the cultural issues of digital libraries for the
Maori people; and Pavani (2001) who focused on usability issues of the Maxwell
multilingual digital library in Brazil.
However, what is crucial in designing such services is an understanding of the
user and their profile, their interaction with information and search systems, and their
typical information needs: “Most information retrieval (IR) systems are used by people
and we cannot design effective IR systems without some knowledge of how users interact
with them” (Robins, 2000, p. 57). This is as true for digital library systems as it is for
information retrieval systems. The present study aims at complementing this previous
work, but drawing more general conclusions.
3. Methodology This study uses an inductive approach with quantitative and qualitative data gathered
using a questionnaire on searching habits, preferences and language selection. Most of
the questions were structured and closed-class using a 5-point Likert scale; others were
open-ended to gather respondents feedback and comments. The questionnaire was
organised and presented in 6 sections: user profile, language skills, language selection,
digital libraries, preferred features and crossing the language barrier.
A pilot test of the questionnaire was carried out before embarking on the main
study. A paper version of the questionnaire was tested with a sample of sixteen
postgraduate students (in Translation Studies) from the University of Sheffield with
varying multilingual skills. The results of this pilot test were used to improve the
questionnaire and develop the final version (produced in English and Spanish). A
purposive (non-probability) sampling approach was used to target potential users of
digital libraries and university communities constituted a good starting point for this
research because they are users of university and research libraries. Due to time and
resource constraints, the online questionnaire was distributed mainly within academic
communities from the University of Sheffield and Universidad Autonoma de Madrid.
However, participants from these communities exhibited a range of language skills and
abilities from various academic disciplines.
The sample was divided into English native speakers and non-English speakers to
analyse the relationship between these two populations. Languages that participants
spoke were coded and named L1 (native language), L2 (second language), L3 (third
language), etc. To distinguish the language profile of participants from their preferred
language(s) for searching online, a new set of languages called “search language” were
created and coded as SL1 (preferred search language 1), SL2 (preferred search language
2), etc. The independent variables were the user’s language profiles and field/domain of
knowledge; location of participants was not asked or taken into account for the analysis,
except when they selected the option of location as a reason for their language choice.
To study the relationship between foreign language skills and use of several
languages when searching online, two new variables were created: the “polyglot value”
and the “polyglot search value”. The polyglot value is computed from summing the mean
of reading level (1=“only understand everyday words and phrases” to 6=“understand
virtually everything, but not like my first language”) and writing level (1=“cannot write”
to 6=“can write fluently and precisely about specialised subjects”) in every foreign
language a participant knows (i.e. a higher score indicates a higher degree of proficiency
in the languages reported). The polyglot search value is computed by summing the
frequencies with which people search in each foreign language specified (from 1=“hardly
ever” to 5=“as much as my native search language”), a higher score indicating more
frequent searching in non-native languages.
4. Results
4.1. Description of the Sample
A total of 514 responses were obtained from the questionnaire. This consisted of 92%
belonging to a university community (80% from the University of Sheffield):
undergraduate (31.5%) and postgraduate (23%) students, research and academic staff
(20%) and administrative staff (18%); the remaining 8% were not associated with
academia. Table 1 shows the fields of knowledge represented by the sample, ranging
from arts and humanities, social sciences, science and technology, and health and life
sciences.
Field Count Percent
Medicine, Dentistry
Art, History, Archaeology, Philosophy, Literature, Librarianship
Social Science, Anthropology, Politics, International Relations
Applied Physics, Chemistry, Geology, Computer Science
Engineering
Languages, Linguistics, Translation, Interpretation
Biology
Mathematics, Theoretical Physics, Theoretical Informatics
Economics, Business Management, Tourism
Psychology, Cognitive Science
Other
50
49
49
46
43
41
41
22
20
17
135
9.7%
9.5%
9.5%
8.9%
8.4%
8%
8%
4.3%
3.9%
3.3%
26.5%
Table 1. Fields of knowledge represented in the sample.
Thirty-two native languages are represented (58 in total including foreign
languages) with English the dominant language (71.7%), followed by Spanish (10.2%),
German (4.1%), French (2.5%) and other languages (11.5%). The skew towards English
and Spanish native languages reflects the sample populations (a UK and Spanish
University). Participants exhibit a wide range of languages known and used for searching
including those from Europe (e.g. Ancient Greek, Basque, Dutch, Polish and Welsh),
Asia (Cantonese, Tamil, Urdu and Vietnamese), the Middle-East (e.g. Persian, Hebrew
and Arabic), Africa (e.g. Afrikaans, Zulu and Swahili) and South America (e.g.
Papiamentu and Quechua).
The language skills of participants vary widely: 36% are monolingual native
English speakers, 35% are native English speakers with knowledge of at least one other
foreign language; 28% are non-English speakers with knowledge of more than one
language and 0.4% (2 respondents) are monolingual Spanish speakers (this low result is
due to the requirement that a foreign language is known for university entry in Spain).
When searching online, participants can be classified in three main groups according to
their language profile and their search language 1 (SL1): (1) native English speakers
searching mainly in English (60%); (2) speakers of other languages searching mainly in
their native language (23%) and (3) speakers of other languages searching mainly in
English (17%).
When the search language 2 (SL2) is taken into account, the groups subdivide as
illustrated in Table 24. Results show that 34.4% of participants search in at least two
languages (the grey areas) and 62.2% search in their native language only. Also, when
more search languages are considered, results show that 15% of the sample search in at
least three languages.
Native English speakers
Non-English native speakers (SL1≠English)
Non-English native speakers (SL1=English)
365 52 86
No SL2 SL2 No SL2 SL2 = English
SL2 ≠ English No SL2 SL2=L1 SL2 ≠ L1
306 59 7 41 4 17 61 8
Table 2. Categories of respondents by native language (L1), first (SL1) and second (SL2)
search language. 4 Not all respondents specified a second search language and this accounts for the difference in total responses and the numbers in Table 2.
The following points summarise the reasons participants provided to explain their
choice of language for searching:
• Their native language is easier for searching or is the only one they can
understand;
• They perceive that a significant part of the information in their field of expertise is
in English or they find more information in English in general;
• They search in the language of the university where they study/work;
• They search in several languages to widen the coverage of the search;
• They search in the local language if they want to find local news, cultural,
historical, linguistic, tourist and geographical information of particular countries;
• Translators and other language-related professionals search in several languages;
• They do not search in English when they want to find information in certain
fields, like Archaeology, Museums (French), Philosophy (German and French),
etc.
4.2. Field of Knowledge and Language Abilities
A wide variety of languages are represented by the respondents (as described in Section
4.1). The “polyglot value” and “polyglot search value” for each participant was
calculated to reflect their language abilities and frequency of use of other languages for
searching online respectively. An interesting question in this particular context (a
university setting), is whether the field of knowledge or study is also related to (or has an
impact on) language skills/abilities. Table 3 shows the mean of these values across the
most represented fields of knowledge in the sample.
The results from Table 3 show, as one might expect, clear differences between the
language skills exhibited by members of different academic groups: in the upper rows the
language-related fields, social sciences, anthropology, politics and international relations
(on average) have members who are more proficient and search more frequently for
information in multiple languages. Respondents from language-related fields typically
know at least three languages and have a proficiency or bilingual level in one of them.
This contrasts with fields (in the lower rows of Table 3) such as medicine, dentistry and
biology: these typically represent people with intermediate levels in one foreign language
and rarely use more than one language to search online.
Field
Mean of
polyglot
value
Mean of
polyglot
search value
Languages, Linguistics, Translation, Interpretation
Social Science, Anthropology, Politics, International Relations
Engineering
Art, History, Archaeology, Philosophy, Literature, Librarianship
Applied Physics, Chemistry, Geology, Computer Science
Medicine, Dentistry
Biology
11.19
7.51
4.94
4.93
4.29
3.91
3.59
4.87
2.98
0.88
1.94
1.02
0.28
0.64
Table 3. Fields of knowledge and “polyglot value” and “polyglot search value”
Using a Pearson Correlation Coefficient a significant correlation exists between
the polyglot values and polyglot search values for language-related fields (r=0.717;
p<0.01) and Social Science, Anthropology, Politics and International Relations (r=0.795;
p<0.01) indicating that, on average, having a greater proficiency in multiple foreign
languages is likely to result in more frequent searches in multiple languages (for these
subject areas).
4.3. Potential for Cross-Language Information Retrieval
In the questionnaire, respondents were specifically asked whether having CLIR – being
able to type a query in one language to find information written in other languages -
would be useful to them. In total, 54% stated CLIR would be helpful; 23% not helpful
and 23% did not know (or comment). Table 4 shows the respondents’ specified interest in
CLIR (positive responses) based on their field of knowledge/study. Similar to the results
in Section 4.2, the potential utility of CLIR technologies in a university setting may vary
depending on the field of knowledge/study. This is interesting as it highlights the fields in
which to focus and develop multilingual information access.
Field Respondents interested in CLIR
Languages, Linguistics, Translation, Interpretation
Art, History, Archaeology, Philosophy, Literature, Librarianship
Social Science, Anthropology, Politics, International Relations
Engineering
Applied Physics, Chemistry, Geology, Computer Science
Biology
Medicine, Dentistry
91%
79%
77%
70%
64%
62%
61%
Table 4. Interest in CLIR by field of knowledge.
In addition to field of knowledge, Table 5 (grey areas highlighting participants
searching in at least two languages) shows that language ability may also have an impact
on who would most benefit from cross-language search. This highlights potential groups
of users who may benefit more than others in having access to cross-language search.
Results indicate that CLIR functionality may be most helpful in supporting non-English
speakers that search in their native language with English as a second choice, and native
English speakers who search in languages other than English. As expected, CLIR is not
useful to those who do not wish to search across multiple languages (e.g. native English
searching only in English).
Native English speakers
Non-English native speakers (SL1≠English)
Non-English native speakers (SL1=English)
268 52 86
No SL2 (213) SL2 (55) No SL2
(6)
SL2 = English
(37)
SL2 ≠ English
(4)
No SL2 (12)
SL2=L1 (55)
SL2 ≠ L1 (7)
60% 89% 100% 94.5% 75% 66.6% 73% 100%
Table 5. Interest in CLIR in relation to native language and search language(s).
A qualitative analysis of participants’ answers illustrated the potential
applications of CLIR in assisting with search. For example, in general respondents
reported that CLIR would be most useful to them for the following:
• To widen the coverage of the search and save time by performing “one search per
query instead of one search per language”;
• To include different perspectives from different cultures, to widen the spectrum of
authors and perform resource comparison to have “a picture of the whole”;
• “It is easier to formulate the query in my language (difficult in others) but one
might be interested in finding articles in a different language”.
In addition, specific applications of CLIR may include:
• In language-related fields: “to find translations of terms”, “to find parallel texts
for corpus creation and for translation” and “to discover if an American idea has
diffused into continental linguistics and vice-versa”.
• Arts and Humanities: “Research in my field is published in other European
languages”, “for historical research”, “lots of my research has documents in other
languages and I have to translate them and only then do I realise if they are
relevant” and “it is useful to compare the British situation with other European
Societies”.
• Personal interest: “for tourism”, “to check foreign news”, “to find pictures” and
“to improve reading skills in other languages”.
4.4. Digital Libraries
Overall, 84% of respondents reported using digital libraries as part of their online
activities, with 63% using digital libraries for academic/work purposes and 30% for
personal interest. In total, 97% of postgraduate students used digital libraries, 96% of
research/academic staff and 83% of undergraduate students; administrative staff made the
least use of digital libraries (64%). 13.4% of respondents reported using digital libraries
on a daily basis, 23% three times or less per week, 18% three or less times per month and
10% once every two months. The reasons for using digital libraries varied: 74% of
respondents reported using digital libraries because they wanted digitised versions of
documents, 57% indicated they used digital libraries because they were considered a
reliable source of information, 45% because they wanted to locate items in a physical
library, and 25% because they wanted to access digitised versions of historical documents
where the original is not publicly accessible.
Respondents provided the names of 88 digital libraries they used. The most
popular being the University of Sheffield Library5 (as expected), JSTOR6, ISI Web of
Knowledge7 (including Web of Science) and PubMed8. Other popular digital libraries
used included national libraries (e.g. The British Library, The Library of Congress,
Deutsche Nationalbibliotek and Bibliotèque Nacionale de France) and online libraries of
publishers (e.g. Elsevier, Nature, Wiley, SAGE journals online, Kluwer Law Library and
Emerald). Regarding the use of search functionality provided by digital libraries,
respondents reported using: search by key words (70.4%), search by author (63.4%),
search by title (58.3%), search by topic (58%), browse by subject (20.4%), browse by
related items (12%), browse by collection (5.5%). To overcome the language difficulties
when searching in multilingual digital libraries, participants selected one or more features
from Table 6.
Feature No. of respondents
Online dictionary
List of suggested terms related to the query
Machine translation of the documents found
List of documents related to those found relevant
Machine translation of the query
183
146
129
127
114
Table 6. Most popular tools to cross the language barrier. 5 http://www.shef.ac.uk/library/ [site visited: 22/6/09] 6 http://www.jstor.org/ [site visited: 22/6/09] 7 http://apps.isiknowledge.com/ [site visited: 22/6/09] 8 http://www.ncbi.nlm.nih.gov/pubmed/ [site visited: 22/6/09]
The features in Table 6 are commonly found features of CLIR and include
functionalities to translate the users’ query (using machine translation or bilingual
dictionaries), improve the query (query expansion) and translate retrieved documents.
The results in Section 4.3 indicate that the functionality required by users is likely
influenced by their language abilities and field of knowledge (we found in previous work
that language ability had a direct impact on the provision of functionalities for
multilingual search (Marlow et al., 2008)).
When investigating further the digital libraries used by respondents, it was
noticeable that very few offered any kind of cross-language support beyond the
localisation of selected pages of content. To explore this further, we carried out a
“competitor analysis” of the two most widely used digital libraries by participants and
examples of multilingual digital libraries: JSTOR and Web of Knowledge. In particular
an informal features analysis, which “provides a snapshot of the competition’s services
and features from a customer standpoint” (Goto and Cotler, 2005, p. 260) was carried out
to compare the sites based on functionality, content and multilingual accessibility.
Interestingly, publications written in English represented approximately 90% of the
content, whilst other languages were clearly underrepresented. At the same time, access
to resources not published in English was not facilitated; neither with linguistic tools nor
with a search tool that supported several languages (JSTOR offers a search for 4
languages in addition to English, but requires a special syntax). There are definitely
improvements which could be made to help users access the content in these digital
libraries, in particular for non-native English users who must gather material in English
as part of their research or programme of study.
5. Discussion
To fully assess the impact and utility of multilingual digital libraries studies must be
carried out to investigate the role of language in accessing information online, and in
particular for developing effective multilingual digital libraries. The complexity of
language choice used to access information is determined by many factors; in this study
we have focused on two important factors in the context of (mainly) university
communities: fluency in foreign languages and field of knowledge/study.
The first observation to make concerns people’s language skills: it is clear from
our study that in a multicultural and multilingual society the languages skills that users
possess can be very complex. This confirms the need for an appropriate framework in
which to capture and analysis language knowledge (e.g. similar to (Laufer & Goldstein,
2004; Marlow et al., 2008; Berendt & Kralisch, 2009)). Commonly the statistics about
Internet usage by language typically assume that users search in just one language: the
official language of their country. This view is too simplistic: firstly, there are many
countries with more than one official language and millions of regional languages that are
not considered “official”; secondly, there are more speakers of English as a second
language than native English speakers (Graddol, 2006), which would indicate that many
online users might search in English and their native language. This study shows (with no
aim of generalising the numbers) that 34.4% of the sample uses at least two languages for
searching online and 17% of participants prefer English to their native language for
searching online.
Our second observation concerns the effect of users’ language skills and
professional/study field on their language choice when searching online. We have found
that users from different fields of study are likely, on average, to have varying language
proficiency and skills, which is likely to affect the utility of multilingual information
access across different academic disciplines. This suggests that even in one context or
search scenario there are likely to be very different use cases and suggests that more
detailed studies within domains should be conducted, as well we across different
domains. This study has shown that within language-related fields, Arts and Humanities
and Social Sciences, as users’ proficiency of foreign language skills increases, they are
more likely to search using more languages (no correlation was found in other fields). We
also observed that in some fields, e.g. Computer Science and IT, English is the
predominant language and, therefore, professionals are required to know English in
addition to their native language. This might suggest that digital libraries should be
developed for or adapted to the needs of specific user groups (rather than providing
unnecessary functionality for all users). This aligns with Berendt & Kraslisch (2009) who
propose deployment strategies for language tools based on multilingual content, user
behaviours, the user group’s domain expertise, site type and user’s attitudes towards the
availability of first-language content.
Our third observation concerns users’ interest in cross-language information
retrieval. Results from this sample group show that the language-related fields constitute
the best potential group of users, followed by the Arts and Humanities and the Social
Sciences. Also, non-English speakers that search in their native language only might be
the best potential group of users, but they are not statistically significant in this sample
and conclusions cannot be drawn. Apart from these users, non-English speakers that
search in their native language and in English (as a second choice) are also a potential
group of interest. English speakers that use other languages for searching in addition to
English constitute the third group more interested in CLIR.
It should be noted that translators and language professionals suggested many
applications of CLIR systems and, in general, “to widen the coverage of the search” was
a very popular answer. Also, some participants expressed their need for tools that help to
formulate queries in foreign languages they can understand. This finding is supported by
Laufer’s study (1998), which indicates that learners’ passive language skills (listening
and reading) develop earlier than the active ones (writing and speaking). The
consequence of this observation for information search is that a user might be able to
understand documents in a foreign language but are unable to formulate queries. To
overcome this difficulty, online dictionaries were the most popular tools among
participants of the questionnaire. Therefore, monolingual users may need help
formulating queries in foreign languages and require document translation, but users with
passive language abilities may not require tools to support document translation.
Our final observation concerns multilingualism and digital libraries. This study
highlights the many opportunities for multilingual information access in digital libraries,
particularly within a university community: many people have multiple language skills or
work in areas that require searching for content in multiple languages. However, a brief
analysis of two commonly used digital libraries (JSTOR and ISI Web of Knowledge)
highlight two issues: that content is predominantly in English (as found in other areas of
the Web (Berendt & Kralisch, 2009)), and the provision for multilingual search and
browse is minimal (especially with regards to cross-language search). This certainly
would not seem to support the notion of making content accessible to all, especially as
previous research has shown that users prefer to search in languages they are familiar
with (Berendt & Kralisch, 2009), and prefer to use documents written in languages they
can read (Michos et al., 1999). The provision of multilingual information access will
involve more than implementing cross-language search; it requires localisation of
existing material and support for multilingual browsing (see, e.g. (Eurescom, 2000)). The
use of CLIR would facilitate access to multilingual information in a digital library, but
before designing such functionality a study of the linguistic profile of the library users is
necessary. In addition, specific domains or field of knowledge of the library is a factor
that determines utility of CLIR.
6. Conclusions
This study has investigated the effects of language skills and field of knowledge/study on
user’s preferred languages when searching for information online, including digital
libraries, and within a university context. Both quantitative and qualitative data have been
gathered using a questionnaire and 514 responses were obtained from a range of people at
the University of Sheffield (UK) and the Universidad Autonoma de Madrid (Spain).
Respondents have varying degrees of language abilities and approximately one third of
the respondents use at least two languages for searching online. Language skills and the
frequency with which multiple languages are searched were found to vary between
different fields of knowledge/study, with a significant correlation between the polyglot
value (indicating language proficiency in foreign languages) and polyglot search value
(indicating the frequency of searching in foreign languages) for language-related fields,
Social Science and Arts and Humanities.
The analysis of interest in CLIR by field of knowledge showed that the language–
related fields constitute the best potential group of users, followed by the Arts and
Humanities and Social Sciences. The results also highlight the utility of cross-language
search for specific language groups (e.g. non-native English users searching for English
content). To develop effective multilingual digital libraries, it is clear that further work
should be carried out to better understand the users, their profiles and the context in
which they access digital libraries. Studying the language skills and field of knowledge in
a university community are just two factors within one domain; multiple contexts and
factors should be investigated. It is vital that developing multilingual digital libraries
focuses as much on the user and tasks as it does the technical issues required to
implement multilingualism.
In future work we plan to study groups that were not statistically significant in this
sample with respect to language ability and search languages (e.g. doctors and biologists
that are not English speakers but monolingual non-English speakers). Adding a parameter
for geographical location would also enrich the study in combination with the language
profile. Finally, a more thorough evaluation of representative digital libraries should be
carried out to provide a more complete view of the state of multilingual access to digital
libraries and potential avenues for future developments.
Acknowledgements
The work reported has been partially supported by the TrebleCLEF Coordination Action,
within FP7 of the European Commission, Theme ICT-1-4-1 Digital Libraries and
Technology Enhanced Learning (Contract 215231). The views and opinions expressed in
this paper reflect those of the authors.
References
Association of Research Libraries (1995). Realizing Digital Libraries. Appendix II:
Definition and Purposes of a Digital Library. Retrieved June 26, 2009, from
http://www.arl.org/resources/pubs/mmproceedings/126mmappen2
Bian, G., & Chen, H. (2000). Cross Language Information Access to Multilingual
Collections on the Internet. Journal of American Society for Information Science &
Technology, Special Issue on Digital Libraries, 51(3), 281-296.
Berendt, B., & Kralisch, A. (2009). A user-centric approach to identifying best
deployment strategies for language tools: the impact of content and access language
on Web user behaviour and attitudes. Information Retrieval, 12(3), 380-399.
Bilal, D., & Bachir, I. (2007). Children’s interaction with cross-cultural and multilingual
digital libraries: I. Understanding interface design representations. Information
Processing and Management,43, 47–64.
Bilal, D., & Bachir, I. (2007). Children’s interaction with cross-cultural and multilingual
digital libraries II: Information seeking, success, and affective experience.
Information Processing and Management, 43, 65–80.
Borgman, C.L. (1997). Multi-Media, Multi-Cultural and Multi-Lingual Digital Libraries:
or How do we exchange data in 400 languages? D-Lib Magazine. Retrieved June 26,
2009, from http://dlib.ukoln.ac.uk/dlib/june97/06borgman.html
Chen, C. (2007, August). Delivery of Web-based Multilingual Digital Collections and
Services to Multicultural Populations: The Case of Global Memory Net. Paper
presented at IFLA Government Libraries Section in cooperation with the Library
Services to Multicultural Populations Section at the IFLA Meeting in Durban, South
Africa. Retrieved June 26, 2009, from http://archive.ifla.org/IV/ifla73/papers/097-
Chen-en.pdf [last accessed 23/06/2009]
Clough, P. (2005). Caption vs. Query Translation for Cross-Language Image Retrieval. In
Peters, C., Clough, P., Gonzalo, J., Jones, G., Kluck, M. and Magnini, B. (Eds.),
Multilingual Information Access for Text, Speech and Images: Results of the Fifth
CLEF Evaluation Campaign (pp. 614-625). Lecture Notes in Computer Science
(LNCS), Springer, Heidelberg, Germany, Volume 3491/2005.
Crane, G. (2006). What do you do with a million books? D-Lib Magazine, 12(3).
Retrieved 26 June, 2009, from http://www.dlib.org/dlib/march06/crane/03crane.html
Del Galdo, E.M., & Nielsen, J. (1996). International User Interfaces. New York: John
Wiley & Sons.
Duncker, E. (2002). Cross-Cultural Usability of the Library Metaphor. In Proceedings of
the ACM & IEEE Joint Conference on Digital Libraries (JCDL '02), Portland, OR,
USA.
European Commission Information Society and Media (2006). i2010: Digital Libraries.
Published by the European Communities (Luxembourg: Office for Official
Publications of the European Communities). Retrieved June 26, 2009, from
http://ec.europa.eu/information_society/activities/digital_libraries/doc/brochures/dl_b
rochure_2006.pdf
Eurescom (2000). Multi-Lingual Web Sites: Best Practice Guidelines and Architecture
(P923). Eurescom project report. Retrieved June 26, 2009, from
http://www.eurescom.de/Public/projectresults/P900-series/923d1.asp
Flammia, M., & Saunders, C. (2007). Language as Power on the Internet. Journal of the
American Society for Information Science and Technology, 58(12), 1899-1903.
Gey, F. C., Kando, N., & Peters, C. (2005). Introduction to Special issue on Cross-
Language Information Retrieval. Information Processing and Management, 41(3),
413-722.
Gey, F. C., Kando, N., Lin, C., & Peters, C. (2006). New directions in multilingual
information access. SIGIR Forum, 40(2), 31-39.
Gibson, M., & Hufeisen, B. (2003). Investigating the role of prior foreign language
knowledge: Translating from an unknown into a known foreign language. In: Cenoz,
J., Hufeisen, B., & Jessner, U. (Eds.), The Multilingual Lexicon (pp. 87-102).
Netherlands: Springer.
Gonzalo, J. (2002). Scenarios for interactive cross-language information retrieval
systems. Paper presented at the SIGIR 2002 Workshop on Cross-Language IR.
Goto, K., & Cotler, E. (2005). Web Re-Design 2.0: Workflow that works (2nd edition).
New Riders.
Graddol, D. (2006). English Next. Published by the British Council. Retrieved June 26,
2009, from http://www.britishcouncil.org/learning-research-english-next.pdf
Halavais, A. (2000). National Borders on the World Wide Web. New Media and Society
2(1), 7-28.
Jones, G.J.F. (2005). Beyond English Text: Multilingual and Multimedia Information
Retrieval. In Charting a New Course: Natural Language Processing and Information
Retrieval. Essays in Honour of Karen Sparck Jones (pp. 81-98). Netherlands:
Springer.
Kishida, K. (2005). Technical issues of cross-language information retrieval: a review.
Information Processing & Management, 41(3), 433-455.
Kralisch. A. (2005). The impact of culture and language on the use of the internet:
empirical analyses of behaviour and attitudes. Masters dissertation, Humboldt-
Universität zu Berlin.
Kralisch, A., & Mandl, T. (2006). Barriers to Information Access across Languages on
the Internet: Network and Language Effects. In Proceedings of the 39th Hawaii
International Conference on System Sciences-IEEE (pp. 1-10).
Laufer, B. (1998). The development of passive and active vocabulary in a second
language: Same or different? Applied Linguistics, 19, 255–271.
Laufer, B., & Goldstein, Z. (2004). Testing vocabulary knowledge: Size, strength and
computer adaptiveness. Language Learning, 54(3), 399-436.
Marlow, J., Clough, P., Cigarrán Recuero, J., & Artiles, J. (2008). Exploring the Effects
of Language Skills on Multilingual Web Search, In Proceedings of the 30th European
Conference on IR Research (pp. 126-137). Glasgow, UK, April 2008.
McCarley J. S. (1999). Should we Translate the Documents or the Queries in Cross-
language Information Retrieval? In Proceedings of the 37th Annual Meeting of the
Association for Computational Linguistics (pp. 208 – 214). College Park, Maryland,
USA.
Michos, S.E., Stamatatos, E., & Fakotakis, N. (1999). Supporting Multilinguality in
Library Automation Systems Using AI Tools. Applied Artificial Intelligence, 13(7),
679-703.
Mikami, Y., & Suzuki, I. (2004). The Language Observatory Project and its Experiment:
Cyber Census Survey. Published by the European Language Resources Association.
Retrieved June 26, 2009, from
http://www.elda.org/en/proj/scalla/SCALLA2004/mikami.pdf
Miniwatts Marketing Group (2008). Internet World Statistics. Retrieved June 26, 2009,
from http://www.internetworldstats.com/
Oard, D. W. (1997). Serving Users in Many Languages: Cross-language retrieval for
digital libraries. D-Lib Magazine. Retrieved June 26, 2009, from
http://dlib.ukoln.ac.uk/dlib/december97/oard/12oard.html
Oard, D.W., & Diekema, A. (1998). Cross-Language Information Retrieval. Annual
Review of Information Science and Technology, 33, 223-256.
Oard, D., & Gonzalo, J. (2002). The CLEF 2001 Interactive Track. Evaluation of Cross-
Language Information Retrieval Systems (pp. 308-319). Netherlands: Springer-Verlag
LNCS 2406.
Oard, D., He, D., & Wang, J. (2008). User-assisted query translation for interactive cross-
language information retrieval. Information Processing & Management, 44(1), 181-
211.
Paolillo, J. (2005). Language Diversity on the Internet. Measuring Linguistic Diversity,
43-89. UNESCO Publications for the World Summit on the Information Society.
Retrieved June 26, 2009, from
http://unesdoc.unesco.org/images/0014/001421/142186e.pdf
Pavani, A.M.B. (2001). A model of multilingual digital library. Ci. Inf., Brasília, 30(3),
73-81.
Peters, C., & Picchi, E. (1997). Across Languages, Across Cultures: Issues in
Multilinguality and Digital Libraries. D-Lib Magazine. Retrieved June 26, 2009, from
http://dlib.ukoln.ac.uk/dlib/may97/peters/05peters.html
Peters, C., & Sheridan, P. (2001). Multilingual information access. In M. Agosti, F.
Crestani, and G. Pasi (Eds.), Lectures on information Retrieval (pp. 51-80). New
York: Springer LNCS 1980.
Peters, C., & Braschler, M. (2004). Introduction to Special Issue on CLEF. Information
Retrieval, 7(1-2).
Petrelli, D., Levin, S., Beaulieu, M., & Sanderson, M. (2006). Which User Interaction for
Cross-Language Information Retrieval? Design issues and Reflections. Journal of the
American Society for Information Science and Technology, 57(5), 709-722.
Ringbom, H. (2007). Cross-linguistic similarity in foreign language learning. Clevedon,
UK: Multilingual Matters.
Robins, D. (2000). Interactive Information Retrieval: Context and basic Notions.
Informing Science, 3(2), 57-61. Retrieved June 26, 2009, from
http://inform.nu/Articles/Vol3/v3n2p57-62.pdf
Savoy, J. (2004). Combining Multiple Strategies for Effective Monolingual and Cross-
Language Retrieval. Information Retrieval, 7(1-2), 121-148.
The Unesco’s Recommendation concerning the Promotion and Use of Multilingualism
and Universal Access to Cyberspace (2003). Retrieved June 26, 2009, from
http://portal.unesco.org/ci/en/files/13475/10697584791Recommendation-
Eng.pdf/Recommendation-Eng.pdf
Voorhees, E.H., & Harman, D (2000). Overview of the sixth text retrieval conference
(TREC-6). Information Processing & Management, 36(1), 3-35.
Yang, C., & Lam, W. (2006). Introduction to Special issue on Multilingual Information
Systems. Journal of the American Society for Information Science and Technology
Vol. 57(5).
Yunker, J. (2003). Beyond borders - Web globalization strategies. Indianapolis, IN: New
Riders Publishing.
Biographies Dr. Paul Clough is a lecturer in Information Systems in the Department of Information
Studies, University of Sheffield (UK). He received his BEng from the University of York
in Computer Science while also working for British Telecommunications Plc. as a
software engineer. He received his PhD from the Department of Computer Science,
University of Sheffield and has since worked as a researcher on a range of language
engineering and information access projects. Clough is member of the Information
Retrieval (IR) group and his core research interests are information retrieval
(Geographical IR, multimedia IR and evaluation), computational text analysis (plagiarism
detection, authorship attribution and creating corpora) and human-computer interaction.
He has over 60 peer-reviewed publications in his research area and a US patent for an
information management system.