10
computer communications ELSEVIER Computer Communications 20 (1997) 927-936 Standard specification extensions for provision of language and character enabled server Borka Jerman-BlaiiEa3*, Konstantin Chuguevb, Andrej Gogala” “Joief Stefan Institute’, Ljubijana, Slovenia hChelyabinsk Technical University2, South Ural Regional Center of FREEnet. Chelyabinsk, Russia Abstract Currently, the major problems for content providers especially in Central and Eastern Europe, are multitude of character sets used and proper presentation of the content for users outside this area. The problem can be reduced with extensions to the existing standards specifying the WWW service. This paper describes two such solutions, one already in active use in Russia (FREEnet network) and the other being developed within EU funded Multilingual Application Interface for Telematic Services (MAITS) project. The new RFC 2070 for internationalization of HTML that introduces as a document character set the Unicode coded character set table solves many of the problems related to language and culture support. However, legacy coded character sets will exist in the future on the hosts connected to the Internet. The solutions described in this paper provide a smooth migration path to fully Unicode supported WWW server with an easy use of documents and information prepared in different languages. 0 1997 Elsevier Science B.V. Keywords: MAITS; RFC 2070; HTML; WWW server 1. Introduction The globalization of the Internet is leading to a number of new requirements. These include the need for multi-media communication containing textual components across nationalities and culture. Images and sounds in the multi- media services help and provide user friendliness, but the textual compound in such services still represents the big- gest core of such services. Human beings are used to com- municating in the language of their origin either orally or as a written text. Ideally, each individual must be able to work or play as efficiently, effectively and as easily with remote users via the Internet. In this context as an ubiquitous service on the Internet that needs very fast solution of the problems of multi-linguality is certainly World Wide Web. The basic issue to start with in provision of multi-linguality is the proper representation and processing of the used coded character sets. Coded character sets are IS0 standards or manufacturer developed coded character sets tables that enable presentation and communication of textual data in natural languages. WWW is a service that is provided through implementa- tion of two Internet standards: the HTTP protocol for * Corresponding author. Supported by Slovenian Ministry of Science and Technology. ’ Supported by Open Society Institute and Russian Foundation for Basic Research. 0140-3664/97/$17.00 0 1997 Elsevier Science B.V. All rights reserved PII SOl40-3664(97)00085-6 communication between the client and the server [5], and the HTML language for representation of WWW multi- media documents [6]. In the first version of the WWW services designed with the HTTP protocol the communica- tion between the server and the client was without possi- bilities for negotiation of the content of the exchanged data and HTML standard restricted the use of the character set in HTML documents to the IS0 eight-bit coded character set table known as IS0 88591 [ 11. The repertoire of this coding table is restricted to the repertoire of alphabets that are used in the languages belonging to the Western European nations. Very soon after the appearance of the WWW services many nationalities faced the problem for proper representa- tion of the text documents on the WWW documents, and home pages in their languages, mainly from Central and Eastern Europe and Asia. Recently the major drawbacks of the first WWW oriented Internet standards were removed and first application followed afterwards. The negotiation about the content of the exchanged data is a part of the new HTTP standard [ 121 and the Internationalization of the HTML is a RFC document [ 131 that defines as document character sets the Universal Coded Character Set standard IS0 10646-the Basic Multilingual Plane [4] which is identical in content with the Unicode consortium [3] coded character set standard known as Unicode 1.1. The Unicode repertoire cater for all known written languages

Standard specification extensions for provision of language and character enabled server

Embed Size (px)

Citation preview

Page 1: Standard specification extensions for provision of language and character enabled server

computer communications

ELSEVIER Computer Communications 20 (1997) 927-936

Standard specification extensions for provision of language and character enabled server

Borka Jerman-BlaiiEa3*, Konstantin Chuguevb, Andrej Gogala”

“Joief Stefan Institute’, Ljubijana, Slovenia

hChelyabinsk Technical University2, South Ural Regional Center of FREEnet. Chelyabinsk, Russia

Abstract

Currently, the major problems for content providers especially in Central and Eastern Europe, are multitude of character sets used and

proper presentation of the content for users outside this area. The problem can be reduced with extensions to the existing standards specifying the WWW service. This paper describes two such solutions, one already in active use in Russia (FREEnet network) and the other being developed within EU funded Multilingual Application Interface for Telematic Services (MAITS) project.

The new RFC 2070 for internationalization of HTML that introduces as a document character set the Unicode coded character set table solves many of the problems related to language and culture support. However, legacy coded character sets will exist in the future on the

hosts connected to the Internet. The solutions described in this paper provide a smooth migration path to fully Unicode supported WWW server with an easy use of documents and information prepared in different languages. 0 1997 Elsevier Science B.V.

Keywords: MAITS; RFC 2070; HTML; WWW server

1. Introduction

The globalization of the Internet is leading to a number of

new requirements. These include the need for multi-media communication containing textual components across nationalities and culture. Images and sounds in the multi- media services help and provide user friendliness, but the textual compound in such services still represents the big-

gest core of such services. Human beings are used to com- municating in the language of their origin either orally or as

a written text. Ideally, each individual must be able to work or play as efficiently, effectively and as easily with remote

users via the Internet. In this context as an ubiquitous service on the Internet that needs very fast solution of the problems of multi-linguality is certainly World Wide Web. The basic issue to start with in provision of multi-linguality is the proper representation and processing of the used coded character sets. Coded character sets are IS0 standards or manufacturer developed coded character sets tables that enable presentation and communication of textual data in natural languages.

WWW is a service that is provided through implementa- tion of two Internet standards: the HTTP protocol for

* Corresponding author. ’ Supported by Slovenian Ministry of Science and Technology. ’ Supported by Open Society Institute and Russian Foundation for Basic

Research.

0140-3664/97/$17.00 0 1997 Elsevier Science B.V. All rights reserved

PII SOl40-3664(97)00085-6

communication between the client and the server [5], and the HTML language for representation of WWW multi- media documents [6]. In the first version of the WWW services designed with the HTTP protocol the communica- tion between the server and the client was without possi-

bilities for negotiation of the content of the exchanged data and HTML standard restricted the use of the character set in HTML documents to the IS0 eight-bit coded character set

table known as IS0 88591 [ 11. The repertoire of this coding table is restricted to the repertoire of alphabets that are used

in the languages belonging to the Western European

nations. Very soon after the appearance of the WWW services

many nationalities faced the problem for proper representa- tion of the text documents on the WWW documents, and home pages in their languages, mainly from Central and Eastern Europe and Asia. Recently the major drawbacks of the first WWW oriented Internet standards were removed

and first application followed afterwards. The negotiation about the content of the exchanged data is a part of the new HTTP standard [ 121 and the Internationalization of the HTML is a RFC document [ 131 that defines as document character sets the Universal Coded Character Set standard IS0 10646-the Basic Multilingual Plane [4] which is identical in content with the Unicode consortium [3] coded character set standard known as Unicode 1.1. The Unicode repertoire cater for all known written languages

Page 2: Standard specification extensions for provision of language and character enabled server

928 B. Jerman-Blaii? et al./Compurer Communications 20 (1997) 927-936

Republika Sloveniia

Center Wade za informatikQ

1001, p.p.2955 Q.J&+Q&, Jadranlka 21 tel: 061-1257-100, fax: 061-1258-256

Na tern stre’niku se predstavljajo DR@AVNE USTANOVE. Ve- o njihovem delu najdete v STVARNEM KAZALU ali pa vam ‘eleni podatek poi{-e ISKALNI PRO-. Redne obiskovalce na{ega stre’nika najbr’ zanimajo zlasti NOVOS’lJ. Morebitne pripombe pa usmerite na ADMINISTRATORJA.

Republic of Slovenia

Government Centre for Infore

Slovenia, 1001, p.p.2955 J&bliana,

tel: +386-61-1257-100, fax: +38661-1258-256

This is official server of SLOVENIAN STATE INSTITUTIONS. More detailed information on their work can be found in ALPHABETIC INDEX or by means of the SEARCH PROGRAM. Regular users of our server will be particularly interested in m. Please address any comments to the ADMINISTRATORS.

Fig. 1. A typical Slovenian server.

in the world. Every character is coded with 16 bits which

enables coding of 65 000 different characters. In this paper we present an extension of the existing

standard specifications [5] that enable proper handling of

different character sets used in different regions in Europe.

The character sets are seven or eight-bit coded, and are either defined within the International Standard Organiza-

tion either as manufacturer solutions or are part of the national standardization like the Russian eight-bit coded character set KOI-8. The proposed extension solves many

problems related to legacy systems and legacy software based on eight-bit coding which will be in use in Europe for some time until good and friendly solutions (for brow-

sers, editors and other input/output tools) for use of the full Unicode repertoire appear. It is certain that despite the uni-

versal solution provided with Unicode there will be a big delay in wide use and implementation in the Internet ser- vices, and for that reason the solution represented in this paper represents a sort of migration path for fully Unicode supported WWW services. Another argument in favor of the proposed extension and applied solution is that the universal character set is needed when the information has to be pro- vided in several languages in the same document (dictionary pages, foreign language manuals, etc.). Moreover, keeping monolingual documents in an eight-bit character set allows them to be placed on a server more compactly. That is why

the server has to support not only a universal character set, but also eight-bit ones used today.

2. Description of the problem

In Central and Eastern Europe today there is a multitude

of coded character sets in use. For example, in Slovenia, there are four coded character sets in use (IS0 8859/2, CP 852, MS 1250 and National version of the IS0 IRV 646), although there are only three characters used, not present in the English alphabet-US ASCII. So, a typical WWW ser- ver with pages in Slovene usually offers the documents in four ways. The pages are not present on the server, but the appropriate coding is achieved through on fly conversion through CGI scripts. The user selects the preferred encoding or the only possible one (the one present on his equipment) at the first page, and then follows the entire tree using the same encoding. This requires the user to know which char- acter sets are supported by his browser, and to select the right one. For illustration of the approach see Fig. 1.

This solution has two major drawbacks:

. since conversion is usually done with CGI scripts, there is a performance cost

. it assumes a knowledgeable user.

Page 3: Standard specification extensions for provision of language and character enabled server

B. Jerman-BlaiiE et al./Computer Communicalions 20 (1997) 927-936 929

Fig. 2. Conversions at levels 0 and 1.

In Russia, the Russian language is written in Cyrillic script. Currently five different coded character sets are

used in Russia:

l CP866 (also known as DOS alternative Cyrillic charset);

l CP1251-MS Windows Cyrillic characterset; l KOI8-R [II-national eight-bit based standard, de-facto

standard on Unix systems and in the Russian part of the

Internet; l ISO-8859-5 [2]-the only Cyrillic charset supported in

MIME and by large Unix-manufacturers such as Sun

Microsystems, etc.; l MacOS Cyrillic-on Macintosh computers.

The solution presented in this paper helps Russian users to read and prepare documents in Russian and at the same time in English, and was developed by the Russian aca-

demic network FREEnet.

3. Evaluation of different options for provision of character and language enabled server

3.1. WWW specijkation requirements

FREEnet is the Russian network for research, education

and engineering. One of the main tasks of the network is the creation of an infrastructure of services, and dissemination of information for the Russian scientific and educational

organizations. World Wide Web is the most widespread and one of the most powerful tools for creating an informa- tion infrastructure. The key part of the FREEnet dissemina- tion service is the WWW server. The most popular software used in this service is Apache WWW server which is open for modification, and can be tuned to the needs of a particular network in contrast to the client software for

WWW which is much less flexible. Most of the FREEnet servers are Unix machines. There-

fore, it was FREEnet’s decision to first specify and imple- ment the specification requirements for a WWW server.

One of the main requirements for a WWW server is the possibility to keep versions of the same document in differ- ent languages. This makes information resources available for both Russian and foreign users. The second requirement is that documents can be prepared in the native coded

character sets, making the use of the document for users

that use legacy and non-flexible browser architecture easy. Cost effective server solutions imply the documents to be

kept in one character set for each language and the required conversion that provides the coding asked by the client to be

done on-the-fly. In the development of the solution the limitation of the

network itself was also taken in account. The capacity of FREEnet communication lines is insufficient. Therefore, the

optimization of the information streams on the network infrastructure was also considered. In the case of the

WWW it means employment of caching proxy servers which are in turn part of the common information infrastruc-

ture. This situation is also typical for many European net- works which are currently working together for the development of a common hierarchical cache structure.

This consideration originated with another requirement: the server should work with several languages and character sets through caching proxies.

The next requirement appeared from the diversity of the

used browser technology: provide support for the language and character set for an end user in cases when WWW client

software lacks the negotiation capabilities. This is important for users that are not using one of the following browsers:

Netscape Navigator, Microsoft Internet Explorer or Alis Tango (because of insufficient hardware resources, lack of on-line access, etc.).

A very useful feature for a multilingual WWW server is that HTML documents structure and content independence of the number of languages and character sets is supported.

In the context of character and language enabled server, that implies no changes on the existing Web pages when new character sets or language are added. As a consequence, if the document version in a particular language is not avail-

able, a user can continue the navigation through the server

just by choosing to look at the document in any other avail- able language.

3.2. Options for language specifications

After requirements identification, we can proceed toward

exploration of the available options within the framework defined within the existing standards.

WWW servers according to the existing standards can use different methods for provision of user choice of the language:

According to the language negotiation mechanism

defined in the HTTP/l. 1 [ 121: the client sends the Accept- Language field with a list of acceptable languages, and the

server returns the name of the chosen language in the Content-Language header field of the response. There is no explicit specification in the standard about the way the server decides which language will be chosen.

This solution enables the servers to use the same refer- ence (URL [7]) for all different documents that are actually the same document but different versions depending on the

Page 4: Standard specification extensions for provision of language and character enabled server

930 B. Jerman-Blaii? et al./Computer Communications 20 (1997) 927-936

language used. The value sent in the Accept-Language field can be set up in the browser just before the loading of the document that is requested. This approach has some advantages:

document hierarchy and file structure on the server does

not depend on the languages supported by the server, or in other words the presence of different versions of a

document has no influence on the file structure. A user

can alter the document language in any place of the hierarchy, there is no restriction regarding the main page. After a request for another language version of

the document, the user remains in the same point of the file hierarchy and can continue the navigation from this place; there is no information about the language in hypertext

references to other documents. The server determines the language every time it receives a request from the

browser. In case of a missing document version in the

requested language, the list of all available language versions of the document is presented to the user.

Although this method has many advantages, not all client and server software available on the networks in some

regions of Europe supports it. Furthermore, the proxy- servers cache documents by URLs only, ignoring the Con- tent-Language field in the header, which makes this approach non applicable for WEB services based on caching proxies.

(a) explicit specification of the language in the URL In this case the top level directories contain information

about the available documents together with the languages used in the document. Examples follow: < http://www.

The other approach specified in the standard is by lan- guage specification within the URL.

server.org/english/info/main.html > < http://www.server.

Evaluation of this option has shown that it can be imple-

mented in five different ways:

org/russian/info/main.html > < http:/lwww.server.orgl

french/info/main.html > The shortcoming of this method is in the difficulty in

providing flexibility for language change. The user is tied to the place of the document he/she is reading in the docu- ment hierarchy of the server. Navigation on the server is

realized, as a rule, by means known as “relative references” which enable change of the language only for documents

situated near the root (top level) in the server document hierarchy.

Since language names are in the root, the user who wants to see the other version of the document needs to know the absolute reference of the other version. This makes the browsing of the WEB server sub-trees cumbersome.

(b) Each directory has sub directories structured accord- ing to the language used:

In this case the specification of the language comes after the specification of the directory, e.g. “info” < http:llwww. server.org/info/english/main.html > < http:l/www.server.org/

info/russian/main.html > french/main.html >

< http://www.server.org/info/

(c) all language versions of a document are kept in a separate directory

In this case the specification of the language comes after the specification of the sub-directory, e.g. < http://

www.server.org/info/main/en,html > < http://www.server. org/info/main/ru.html > < http://www.server.org/info/

main/fr.html > (d) language name is in the file suffix

In this case the specification of the language comes in the file suffix like: < http://www.server.org/info/main.en.html >

< http://www.server.org/info/main.ru.html > < http:// www.server.org/info/main.fr.html >

(e) language name is in the suffix but with another suffix order.

In this case the specification of the language is in the file suffix, but with another order: < http://www.server.org/info/

main.html.en > < http://www.server.org/info/main.html.ru > < http://www.server.org/info/main.html.fr > .

The advantage of all five methods presented above is that they allow navigation through caching proxy servers, the

shortcoming is in the difficulty of provision of references from one language version of a document to the other lan- guage versions, because all versions need to be up-dated

when a new language/new document version is added.

3.3. Options for character set specijcation in the server

The choice of which coded character set will be provided to the client depends on the server capabilities. There are

several methods, some of them use the negotiation facilities if they exist and some of them use the exact specification of the used coded character set in the URL. The most attractive is certainly the negotiation. Anyhow, two different approaches can be implemented:

After exploring the options available for language speci- fication we will look for the current options that enable

specification of the used coded character set.

In the first, the client sends a list of preferred coded char-

acter sets in the Accept-Charset field and the server answers with the character set name written in the Content-Type field like:Content-Type: text/html; charset = iso-8859-5

Usually, the same URL is used for different character set

versions of the document. In the other, the URL does not contain information about

the used coded character set. The value sent in the Accept- Charset field is specified according to the browser options and the information is provided just before the document is loaded. At present this method is supported by an even smaller number of software packages than the language negotiation method, and by none of the existing caching proxy servers. This method has the same advantages and disadvantages as the language negotiation method.

Let’s see the possible extensions of the existing standard negotiation methods:

Page 5: Standard specification extensions for provision of language and character enabled server

B. Jerman-BlaiiE et d/Computer Communications 20 (1997) 927-936 931

We have explored three of them. They can work: = windows 125 1) tell the user that the content of the original (a) by means of the content type negotiation document can be converted to a content coded in one of the

The client sends the Accept field in the request header to specified coded character sets.

(b) use of the domain names the server, and indicates a “pseudotype” among its pre- ferred content types. Pseudotype is defined as the most fre- quently used type of coded character sets on a particular browser (e.g. text/x-charset-koi8). The preferred character set is defined within “pseudotype” definition. Many client

programs do not have the possibility for use of the Accept field, which makes the use of this method as a general solu-

tion impossible. (b) by guessing the preferred coded character set of the

client through information about the client’s operating system

Usually, the coded character set names can be linked to a specific language if the region of use is known. This is true only for large countries like Russia where the use of MS 1251 implies more or less Russian or IS0 8859/l, and

France implies French or for some legacy coded character sets like the IS0 646 (or ASCII) versions known as national ASCII versions. These legacy character sets are registered

in the International register for escape sequences for coded character sets according to the IS0 2375 standard which is maintained by ECMA [ 141.

The server can identify or guess the client’s operating

system and used character set or some language related features, and provide the corresponding character set by examination of the User-Agent field. For example, for the Russian language in the User-Agent field, the server can

always send a document in the MS 1251 character set, and in the case of “DOS” the document character set

expected is CP 866 or in MIME notation “x-ibm866” for

systems used in Russia. Unfortunately, the User-Agent field contents do not have a standardized format, and in addition to that different coded character sets are used on the same operating system in a particular region (e.g. ISO-8859-5 and KOIS-R for Russia on Unix systems). Another difficulty is in the user behavior, the user can use a character set, which

is not standard for a given language and given operating system. We may conclude that this approach is not very

effective.

Though the DNS names are short and clear for the end

user, use of this information for configuring the character enabled server has very restricted use, as this is applicable for servers that work with a single language only. Such an

assumption is almost unacceptable, as today all servers work at least with two languages-English and the local

language of the region. The directories that use this type of specification may have the following form: < http://iso. www.server.org/info/main.html > < http://koi.www.server.

org/info/main.html > < http:Nwin.www.server.org/info/

main.html > (c) use of different server ports for different charsets This method is suitable for a server that uses just one

(c) by use of the client’s address

language and a few character sets (four to five). The speci- fication will have the following form: < http://www.server. org:8080/info/main.html > < http://www.server.org:8081/

info/main.html > < http://www.server.org:8082/info/ main.html >

The server can keep information about the pair: “the

client host (IP address or domain name)-and the coded character set” per user. In this approach the clients interact with the database containing this information by HTML forms located on the same server. The server is always

aware of the IP address of its peer and the character set in use. This is useful if the clients are not accessing the server via proxies, otherwise the server will be aware of the proxy

address only. So, this approach is even less applicable then the previous one.

If we look at the method where the explicit indication of the used character set in the URL is required, we may con-

clude that this approach also offers three possible solutions, e.g.:

As the above examples show, all methods have certain

shortcomings. The best approach for a server is to support more than one method. More methods guarantee more uni-

versal service, compliant with the needs of different users. Considering the current situation, when a very small amount of client software supports the HTTP/l. 1 specifications for

negotiation, it appears that the optimal solution for a multi- lingual server supporting multiple coded character sets is

a server that provides support for different character sets/

languages by both standard and non-standard negotiation methods, and that also allows the client to use the language or character set specification provided explicitly in the URL.

Such extensions and implementation of the approach is described in the next section.

(a) use of pseudo directories with information about the coded character set

This approach specifies directories with information about coded character sets, e.g.: < http:Nwww.server.org/ iso/info/main.html > < http:llwww.server.orglkoilinfo/ main.html > < http://www.server.org/win/info/main.html >

The content of the directories is generated according to the requested coded character sets. The information about the coded character sets (in the example above, iso, koi8 and win means iso = iso-8859-5, koi = koi8-r, win

4. Specification extension for provision of language and character enabled server

4. I. SpeciJication

This section describes the specification of the extension that was used for operation of the character enabled multi- lingual server. These specifications were part of the

Page 6: Standard specification extensions for provision of language and character enabled server

932 B. Jerman-Blaiii: et al/Computer Communications 20 (1997) 927-936

MultiWeb project that was carried out at the South Ural Regional Center of FREEnet with support of Open Society Institute and Russian Foundation for Basic Research, and in cooperation with TERENA WG on Internationalization.

The specifications were developed in order to meet the user and operator requirements specified in Section 3.1.

The MultiWeb system is built up from two parts relatively independent each of the other:

1. Application Portable Interface (API) for coded character sets conversion.

2. Patches for an existing Web-server.

The API provides character sets conversion according to

the request issued by the server. It contains three modules:

.

.

.

.

1.

2.

3.

Library module, libcharset. This module is very simple

and provides a conversion of eight-bit coded character

sets only. Experiences have shown that the repertoire of these coded character sets is sufficient for provision of character enabled servers for almost all European

languages. Character set database. This module contains descrip- tions of each used coded character set in the system. Each coded character set is provided in simple form with the format specified in RFC 1345, for more infor-

mation see Ref. [S]. This format was chosen because of

its simple representation and because it enables repre- sentation of every character from the required repertoire.

The conversion tables are built with a library function that uses the specification of the two character sets, the source and the target. The information about the source and the target allows reconstruction of the original text

in the cases when this is required, or in other words, this approach allows reversible conversion. A set of utilities used for different purposes like:

output of the converted data from one charset to another

in various formats (useful, for example, for inclusion of a table as an array into C source code);

conversion from standard input (or any file) to standard output (this utility is used in the FTP server for con-

version of Russian text files on-the-fly: see < ftp:ll ftp.urc.ac.ru/ > ); provision of conversion filter for transparent coded char- acter set conversion for terminal work in cases when the text in one coded character set and the terminal is sup- porting another one. The FREEnet center staff uses this

utility during telnet sessions from MS DOS or Windows (with CP866 and MS 1251 coded character set tables correspondingly) operating systems to Unix server with

KOI8-R.

The beta version of this API is currently available at < http:

//www.urc.ac.ru/-joy/projects/ > (as Charset Library). The second part of the system is a Web-server. It is a

slightly modified version of the Apache Web Server [9] (version 1.1.1). The Apache server was used because of

the following good properties:

l it is public domain software; l it is one of the fastest Web servers, both commercial and

shareware or public domain; l it is very stable;

l it provides content type and language negotiation methods;

l it has modular structure: the server is built up from a kernel and separate modules, each of them implement-

ing strictly defined functions. Such a structure allows easy addition of new server features with creation of separate modules, without any change to the server’s kernel and other modules.

There are 29 modules included in the Apache 1.1.1 ver-

sion, and dozens of others can be found on the Internet. There are claims that the Apache server is the most wide-

spread Web server on the Internet, especially in its academic

Part. Unfortunately, the original implementation of Apache’s

version 1.1.1 does not allow the multi character set support extension to be provided without modification of the kernel. Therefore, about 50 lines of the source code were required to be added to the source code.

The main changes were done for language (specification in URL) and character set specification. The added exten-

sion allows negotiation of the content type, and explicit specification of the language and character set in the URL. The extension is provided as a new separate module, named mod_charset. The module calls a CGI script with all

necessary arguments for provision of a list of available lan- guages or character sets to a user. The CGI script also pro- vides a friendly user interface that enables easy choice of the

language or character set. This interface can be modified by the Web master if inclusion of pages of any other Web

server is requested. The module defines several options, which can be set both

in the server’s main configuration file or in directory ones.

The options provided in the original Apache server source code are:

AddLanguage (from the mod-mime module) allows the specification of the language of a document in a file suffix. With the use of content negotiation the server can provide the browser with a file in the language the browser user understands. The language suffix used can differ from the

standard two character ID for a language as specified in IS0 639 standard. This feature was added to allow some differ- ence between the standard ID and the new specified ID because that was imposed by some circumstances, e.g. for documents in Polish the suffix can be “PO” and not as expected ’ ‘pl” to avoid the ambiguity with suffix commonly

used for per1 scripts. Several examples for better under- standing follow:

AddLanguage de .de AddLanguage en .en AddLanguagees .es

Page 7: Standard specification extensions for provision of language and character enabled server

B. fennan-Blaiic et al./Computer Communications 20 (1997) 927-936 933

AddLanguage fr. fr LanguagePriority (from the mod-negotiation module)

AddLanguageit .it allows specification of a precedence for languages in case

AddLanguagepl .po of a tie during content negotiation. This is done with a list

AddLanguagept .pt of the languages in decreasing order of preference, e.g.

AddLanguage ru .ru LanguagePriority en es ru pt de fr it po

4.2. The added extensions

The new instructions which rcprcscnt. the innovation were designed and added in mod_charset. They are:

the module

AddCharset defines character set name and its aliases and loads a character sets in the

description table e.g.:

AddCharset iso-8859-1 AddCharset iso-8859-2 AddCharset iso-8859-3 AddCharset iso-8859-4 AddCharset iso-8859-5 AddCharset iso-8859-9

AddCharset x-~~1250

AddCharset x-~~1251

AddCharset x-~~1252

AddCharset x-~~1253

AddCharset x-~~1254

AddCharset ibm437

AddCharset ibm850

AddCharset ibm852

AddCharset ibm855

AddCharset ibm857

AddCharset ibm860

AddCharset ibm861

AddCharset ibm863

AddCharset ibm865

AddCharset x-ibm866

AddCharset ibm869

AddCharset koi8-r

AddCharset x-ru-mat

iso_8859-1 latinl

iso 8859-2 latin2

iso-8859-3 latin3

iso-8859-4 latinl

iso_8859-5 iso-ir-144 iso-cyr cyrillic iso

iso 8859-9 latin5

winzee ms-ee win-1250 cp1250

win-cyr win-1251 ~~1251

win-ansi win-1252 ~~1252

win-greek ms-greek win-1253 cp1253

win-turk ms-turk win-1254 cp1254

cp437

cp850

cp852

cp855

cp857

cp860

cp861 cp-is

cp863

cp865

cp866 x-cp866 ibm866 dos-rus-alt

cp869 cp-gt

koi-8 koi8 koi

MacOS_Cyrillic macOS-cyr

AddUAType binds sub strings from strings, received by the server in the User-Agent field, with the User Agent Type.

AddUAType DOS-OS2 DosLynx WebExplorer

AddUAType KOI-Unix Linux FreeBSD "via PBD" Arena Ariadna Lynx

AddUAType ISO-Unix X11 "X Window"

AddUAType Windows Win AIR-Mosaic IWENG/l MSIEyr cyrillic iso

AddUAType Macintosh Macintosh

AddLangCS sets the unique character set for UK given language and User Agent Type (the <UAT>:<charset> form), and also defines all other character sets that can be used for particular language (the <charset> form). The first coded character set name encountered for the particular language delines the coded character set used on the server itself for this language e.g:.

AddLangCS de iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-9

AddLangCS de x-~~1250 Windows:x-cpl252 x-~~1254

Page 8: Standard specification extensions for provision of language and character enabled server

934 B. Jerman-B&E et al./Computer Communications 20 (1997) 927-936

AddLangCS de DOS-OS2:ibm850 ibm852 ibm857

AddLangCS en iso-8859-1 Windows:x-cpl252

AddLangCS es iso-8859-1 iso-8859-9 Windows:x-cpl252 X-~~1254

AddLangCS es DOS-OS2:ibm850 ibm857 ibm860

AddLangCS fr iso-8859-1 iso-8859-3 Windows:x-cpl252 x-~~1254

AddLangCS fr DOS-OS2:ibm850 ibm857 ibm863

AddLangCS it iso-8859-1 Windows:x-cpl252 DOS-OS2:ibm850 ibm857

AddLangCS pl iso-8859-2 Windows:x-cpl250 DOS-OS2:ibm852

AddLangCS pt iso-8859-1 iso-8859-9 Windows:x-cpl252 x-~~1254

AddLangCS pt DOS-OS2:ibm850 ibm857 ibm860

AddLangCS ru KOI-Unix:KOIB-R ISO-Unix:ISO-8859-5 Windows:x-cpl251

AddLangCS ru DOS-OS2:x-ibm866 ibm855 Macintosh:x-ru-mat

AddLocale sets the locale for particular language(s)/cullural environment for use in CGI scripts and provides infomalion for calls from server parsed HTML files. This feature may be useful for language and culture dcpcndcnt search and other word processing tasks.

AddLocale de de

AddLocale en en

AddLocale es es

AddLocale fr fr

AddLocale it it

AddLocale pt Pt AddLocale ru RU ru

LangChoiceHandler and CSChoiceHandler setpathstoCG1 scripts,whichallowtheuser

to choose any available language or coded character sets. This script is common as the choice (language or coded character set) can be determined from the script parameters.

LangChoiceHandler /cgi-bin/avail choice - CSChoiceHandler /cgi-bin/avail choice

4.3. Implementation

The above described extensions wereimplementedinthe

Apache WEB server as modul mod_charset. Together with the existing Apache features it enables the character and

language enabled WEB server to operate as follows:

l Document language detection: if the document URL

does not contain the language specification, then the

language negotiation method from the mod-mime and

mod-negotiation modules is triggered. Otherwise, the

language is determined from the URL header. If

both methods provide no result, no information about

language is passed to a client. As a consequence, no

conversion is performed and no information about the

character set used is provided to the user. l Detection of the coded character set, required by the

client: if the URL does not contain specification of the coded character set, then the server tries to guess it by several methods in consecutive order. If there is no result, or if the requested character set is not used for the given language, the server sends the document in the character set, in which the document is kept on the ser- ver, i.e. no conversion is done. The method used is either:

(a) the standard coded character set negotiation method (the Accept-Charsetfiefd); (b) the extension of the negotiation method (the text/x- charset- < name > specification in the Accept field);

(c) the same with other specifications-text/x-cytiZZic- < name > (for compatibility with the Mironov’s Cyrillic enabled Web server, widespread in FREEnet); (d) client software type detection from the User-Agent field.

The language or coded character set can also be provided explicitly in the URL in the following format:http:// < server-name > [: < port > ][/LANG = < language > ] [KS = < charset> ]/ < path > where < language > is thelanguagetagaccordingtotheRFC1766[10], < charset>

is a coded character set tag according to RFC 1700 [ 111. In this case, the negotiation method is not called. The explicit language or coded character set specifications are used by client programs, which are not able to use any of the other existing negotiation methods.

There are special variants of the LANG and CS specifica- tions used for language and coded character set lists gener- ated by the server that should be mentioned:

The “*” sign before or instead of a language or coded character set name (e.g. < http://www.server.org/LANG

Page 9: Standard specification extensions for provision of language and character enabled server

B. Jerman-BlaiiE et al./Computer Communications 20 (1997) 927-936 935

= */info/main.html > ), indicates that the server will give

the list of languages or coded character sets (correspond-

ingly) available for the current document. To perform this

the server calls the CGI script (set by WEB master) with the following arguments: “I < path > ?lang = . ..&flang = . ..&alang = . ..&cs = . ..&scs = . ..&fcs = . ..&acs = . ..“. where:

< path > -is the current document path (that is written after the LANG and CS specifications);

Zang-is the language chosen for the current document;

jlang-is the explicit specification of the language, if any (that is written after ‘LANG = ‘);

alang-is the list of available languages (the line with the language names separated by spaces); cs-is the character set chosen for the current document;

scs-is the character set used for the given languages on the server; fcs-is the explicit specification of the character set, if

any (that is written after “CS = “); acs-is the list of available charsets for the given lan-

guage (separated by spaces).

After the language or coded character set is chosen by the user the document is sent to him/her. If the user

specifies both keywords (/LANG = *KS = *I...), then first the language and afterwards character sets are presented to the

user; finally the document is presented with the chosen characteristics.

There are two options in providing this. One is by HTML reference to the document ( < A Href = “...‘I > </A > ), and the other is by a server parsed document. In the later, the path from the initial document (an SHTML or that, from which the reference is made) is specified after the LANG

and CS specification. This enables the user to return to the

document after he has made his choice. This implies that the document name will be contained in the document itself, and as a consequence the name changes will require modi- fication of the document content.

To avoid this, the following approach was implemented:

the dot (.) sign is used before or instead of the language or character set names (e.g. http://www.server.org/Z,ANG=./). The user receives the language or character set list, but the document (i.e. the one which a user receives after his

choice) is found automatically because it is either the docu- ment, from which the reference for the choice list is done, or the SHTML document, of which the list is a part.

In this approach the LANG and CS specifications are

located in the URL before the document’s real path, and this implies the explicit language and character set specifi- cation to stay as they are in the case when navigation to another document through a relative reference takes place. Here, there is still a problem with the absolute references. When the user looks in the URL http://www.server.org/ LANG = ru/CS = koi8-r/info/main.html, and goes through < A Href = “/second.html” > . . . < IA > reference, then the new URL is http://www.server.org/

second.html, and the information about the explicit lan-

guage and charset specification is lost. To preserve this

information the following approach was applied:

Whenever the “LANG = ” or “CS = ” specification is present the language or char-set name (e.g. < A Href = “/ CS = /second html” . > . . . < /A > ), the language AND character set are inherited from the current document. This method works even in the cases when a user navigates among different multiple character sets enabled servers.

5. Additional enhancement

5.1. MAITS API

The current API used in FREEnet solution is rather poor and for that reason an enhancement was proposed. This enhancement provides much richer functionality and can

be considered as a good migration to fully Unicode (UCS BMP of IS0 10 646) enabled server which is proposed as a document character set in the RFC 2070 that solves the

Internationalization of the HTML. The API was developed within MAITS project (see http://www.dkuug.dk/maits). The core of the API is based on the C3 Conversion of coded

character sets tools developed within the TERENA Task

Force of WG-i18n [15]. MAITS was a consortium formed to develop an Applica-

tions Portable Interface (API) for multilingual applications in the telematic services. MAITS is a project from the Lan- guage Engineering part of the Telemtics program within CEC 4th Framework Programme.

API is covering four levels of language processing:

0 Character set conversion

1 Transliteration and locales 2 Simple translation of stored text strings

3 Access to machine translation

All levels of API use UCS-2 or UCS-4 (depending on

compile-time setting) for their input. The purpose of Level 0 is to convert the text from various character sets into and from UCS-2 (or UCS-4), and thus provide access to other levels. In the case of character set and language enabled

server, API level 1 and level 2 that implements almost all the C3 features is used (see Fig. 2).

Since the UCS repertoire is larger than in other character sets, conversion from UCS must cope with missing charac- ters. API is designed to provide several fallback mechan-

isms for missing characters:

l SGML entities l RFC 1345 mnemonics . a predefined designed fall back character.

The API provides a rich set of statistics about performed conversion, such as the number of converted characters, number of used SGML entities, etc. This enables choice of the best destination character set when several options are available.

Page 10: Standard specification extensions for provision of language and character enabled server

936 B. Jerman-BlaiiE et al./Computer Communications 20 (1997) 927-936

The API also supports different naming of the same coded character sets. Thus, “IS0-8859-2” and “Latin- 2” designate the same coded character set. Since differ-

ent names are allowed only in some contexts, the API supports the notion of domains for names. For example,

“CP 912” is a valid identifier for DOS code page, but it is not registered by IANA and is not in the IANA domain.

The API is designed to be used in multithreaded environ- ments as the major use of the API is in the server environ-

ment. This is achieved by using the conversion stream, an opaque object, that stores all conversion settings. It also remembers the state of the conversion between two conver- sion invocations, thus enabling the API to handle state-full character sets.

5.2. MAITS enhanced WWW server

While retaining the extension for character sets and lan-

guage specification as described in Section 4, the original API is substituted with the MAITS API. The conversion statistics provided by the API enable the server to choose

the output character set that is best suited for the given client/user. With this approach, the use of the fallback mechanism will be marginal.

As more and more browsers on the market are accepting Unicode (ALIS, Accent, Netscape, Amaya), the enhanced server with MAITS API represents a good migration solu- tion to a fully Unicode supported server.

6. Conclusion

The basic design for character enabled WWW server was developed within the FREEnet project MultiWeb based on the Apache server (version l.l.l)-one of the most popular and widespread Web servers in the Internet, and one of the richest ones in functional capabilities. The implementation

was implemented with the financial support of Open Society Institute and Russian Foundation for Basic Research. Later, the server specifications were elaborated within TERENA

WG-il8N together with the provision of the enhanced API.

The basic server features are:

HTTP/l. 1 language and character set negotiation mechanism support;

non-standard negotiation methods support;

support of the method of setting language and/or char- acter set in URL, allowing the server to workcorrectly with client software not supporting the HTTP/l. 1 protocol support of the user work through caching proxy servers, which is extremely important for users using low band- width communication lines (and this will be for certain the reality for many Internet users); user friendly methods of communication for language and character set choice through automatically gener- ated HTML pages for cases when the client software is not capable of negotiation;

l distinction between the multiple languages and coded character sets support from the WWW server informa- tion contents which allows the Web master to manipu- late easily with languages and character sets (adding, changing, deleting) without the need to change the con-

tent of HTML documents on the server.

The server is installed and is being used successfully on

South Ural Regional node of FREEnet. There are several bilingual (Russian and English) WWW servers: www.urc.

ac.ru, www.tu-chel.ac.ru, scholar.urc.ac.ru; and multilin- gual demonstration server (URL: http://multiweb.urc.ac. ru/demo.html). The character enabled server is enhanced

with much better API. The major benefit of the new API is the possibility for handling Unicode coded documents.

Acknowledgements

The authors would like to thank the Open Society Insti- tute and Russian Foundation for Basic Research for their

support. Without it, the beginning of the project (within the framework of FREEnet) would have been impossible, The support of TERENA WG-i18n is also acknowledged,

References

[l] A. Chemov, Registration of a Cyrillic character set. RFC 1489,

RELCOM Development Team. KOII-R code page, http://www.alis. com:8085/langues/codage/iso&359/koi8-r.gif, July 1993.

[2] ISO-8859-5 code page, http://www.alis.com:808S/langues/codage/ iso8859/8859-5.gif.

[3] The Unicode Consortium Home Page, http://www.stonehand.com/ unicode.html. The Unicode Standard, Second Edition. ISBN: O-201.

48345-9.

[4] ISO/IEC JTC2/SC2/WG2. Multi-Octet Coded Character Set. Road-

map to the BMP of IS0 10646. Unofficial, revised HTML version:

http://www.indigo.ie/egt/standards/isolO646/bmp-roadmap.htmi. [5] W3C. Hypertext Transfer Protocol. http:Nwww.w3.org/pub/WWW/

Protocols/. [6] W3C. HyperText Markup Language (HTML). http://www.w3.o&

puh/WWWIMarkUpl. [7] W3C. Names and Addresses, URIS, URLs, URNS, URCs. http://

www.w3.orglpuhiWWWIAddressingl. [8] K. Simonsen, Character mnemonics & character sets. RFC 1345,

Rationel Almen Planlaegning, June 1992.

[9] Apache HTTP Server Home Page, http://www.apache.org/. [IO] H. Alvestrand, Tags for the identification of languages. RFC 1766,

UNINETT, March 1995.

[l l] J. Reynolds, .I. Pastel, Assigned Numbers. STD 2, RFC 1700, USC/

ISI, October 1994.

[ 121 R. Fielding, J. Gettys, J. Mogul, H. Frystyk, T. Bemers-Lee, Hypertext

Transfer Protocol-HTTP/l. 1. RFC 2068, January 1997.

[ 131 F. Yergeau, G. Nicol, G. Adams, M. Duerst, Internationalization of the

Hypertext Markup Language. RFC 2070, January 1997.

[ 141 B. Jerman-Blaii?, A. Gogala, D. GabrijelZiP, Transparent language

processing: A solution for internationalization of intemet services.

LISA Forum Newsletter Vol. V, 1996.

[ 151 8. Jerman-BlaiiE, Tool supporting the internationalization of the gen- eric network services. Computer Networks and ISDN Systems 27

(1994) 429-435.