Using regular expressions to handle non-ASCII text
Preview:
Citation preview
- Slide 1
- Using regular expressions to handle non-ASCII text
- Slide 2
- A motivating example
- Slide 3
- Program which puts data into database Create a simple mySQL
table Write a program which accepts a string from a form and
appends it to the database table We will use it on the next few
slides
- Slide 4
- First interaction We use the form to submit the string Fred is
here Checking the database shows that the string was correctly
stored
- Slide 5
- Second interaction We use the form to submit the string Fred is
here said Tom The program claims it handled the string correctly
But, checking the database shows that the slanted apostrophes look
funny The problem stems from the way the slanted apostrophes are
encoded The confusion is because is not a standard ASCII character
It is not the same as the basic apostrophe ' which is a standard
ASCII character
- Slide 6
- Third interaction The problem is even worse if are developing a
website to support customers who use languages besides English
Suppose we use the form to submit the Chinese string The program
claims it handled the string correctly But, checking the database
shows something strange The Chinese characters have been converted
to HTML entity numbers
- Slide 7
- Fourth interaction Actually, the treatment of Chinese is not as
bad as if we use the program to handle other Latin script languages
Suppose we use the form to submit the Polish word znakw The program
claims it handled the string correctly But, checking the database
shows something strange about the way the letter is handled
- Slide 8
- An interlude To see the root of the problem, we need to
understand how characters are handled We will return to the use of
regular expressions in website programming, but first we must look
at character encoding
- Slide 9
- Character encoding
- Slide 10
- A file containing a Polish word (part 1) Let's use Notepad to
create a new file containing the Polish word znakw (which means
symbols, signs or characters)
- Slide 11
- A file containing a Polish word (part 2) Notepad allows us to
save the file in different formats which it calls ANSI, Unicode,
Unicode big-endian UTF-8
- Slide 12
- Comparing the formats We can use XVI-32 to examine the
different files The ANSI file contains 6 bytes The so-called
Unicode file contains 14 bytes although Microsoft call the format
used in this file 'Unicode', the proper name for the format is UTF-
16LE, where LE means 'little-endian' The so-called Unicode
big-endian file also contains 14 bytes the proper name for the
format used in this file is UTF-16BE, where BE means
'little-endian' The UTF-8 file contains 10 bytes ANSI was developed
for English script UTF-16LE, UTF-16BE and UTF-8 are implementations
of an approach called Unicode, which was developed to support all
language scripts Let's examine these four formats
- Slide 13
- The ANSI format
- Slide 14
- Viewing the ANSI file in XVI32 The file contains 6 bytes, one
for each character The English characters z, n, a, k and w are
encoded using ASCII codes - byte values in the range 00 to 7F But
the code for is is based on an extension to ASCII called
Windows-1252, which uses byte values in the range 80 to FF thus, is
represented as F3 Extensions to ASCII which use values 00 to FF for
various purposes are often called "code-pages" and Windows-1252
(also known as Microsoft Windows Latin-1) is often called CP-1252.
By the way, Windows-1252 is often confused with a similar, but
slightly different, character code, ISO 8859-1 (a.k.a. ISO Latin-
1)
- Slide 15
- Code pages The CP-1252 or Microsoft Windows Latin-1 "code page"
is only one of many different ways of using byte values in the
range 80 through FF Different code pages support different
languages. CP-1251, for example, uses byte values 80 through FF for
Cyrillic, the alphabet used in Russian, Bulgarian, Serbian,
Macedonian, Kazakh, Kyrgyz, Tajik,... When I lived in Thailand, the
computers all used CP-874; this supports the Thai alphabet In
CP-874, the byte-value which represents in CP-1251, actually
represents the symbol (the Thai numeral for two - it is pronounced
'sawng') Using different code pages was OK when files generated in
one culture were never used outside that culture But it's no good
when a file generated in a country whose computers use one code
page is opened in a country where computers use another code page
It is also a problem when one needs to deal with different
languages in one document This motivated development of the
Unicode
- Slide 16
- Unicode
- Slide 17
- Code points Unicode is an abstract code as we shall see later,
Unicode can be implemented in various ways In Unicode, each symbol
is represented by an abstract code point A code point is usually
written in the form U+ followed by a sequence of hex digits, for
example U+007A The U+ is actually meant to remind us of the set
union symbol, , referring to the fact that Unicode is meant to be a
union of character sets Unicode provides enough code points for
1,114,112 symbols However, most of these code points are still
unused which is why its promoters are reasonably confident that it
will always provide enough code points to support all symbols
likely to be developed or, at least, all symbols developed by
members of our species!
- Slide 18
- Planes Unicode is intended to cope with all symbols existing or
likely to be developed It provides enough code points for 1,114,112
symbols This huge set of code points is divided into 17 "planes",
each of which contains 65,536 (2 16 ) code points Plane 0, the
Basic Multilingual Plane (BMP), contains code points for almost
(but not quite) all symbols used in current languages Plane 1, the
Supplementary Multilingual Plane (SMP), contains historic scripts
(hieroglyphs, cuneiform, Minoan Linear B), musical notation,
mathematical alpha-numerics, emoticons and game symbols (playing
cards, dominoes). Plane 2, the Supplementary Ideographic Plane
(SIP), is used for some Chinese, Japanese and Korean symbols that
are not in Plane 0 Planes 3-13 are still unused Plane 14, the
Supplementary Special-purpose Plane (SSP), contains special-
purpose non-graphical characters Planes 15 and 16, the
Supplementary Private Use Areas, are available for use by entities
outside the Unicode Consortium
- Slide 19
- Writing code points Code points in the Basic Multilingual Plane
(BMP) are written as U+ followed by four hex digits for example,
the code point for the letter z is written as U+007A Code points
outside the BMP are written using U+ followed by five or six hex
digits, as required, for example, the LANGUAGE-TAG character in
Plane 14 is written as U+E0001 while one private-use character in
Plane 16 is written as U+10FFFD
- Slide 20
- Blocks Within the Basic Multilingual Plane, code points are
grouped in contiguous ranges called blocks Each block has its own
unique and descriptive name Example blocks: Basic Latin, Latin-1
Supplement, Greek and Coptic, Cyrillic, Armenian, Hebrew, Arabic,
Arabic Supplement, Tibetan, Ogham Blocks contain contiguous code
points but may be of different sizes While the Basic Latin block
contains 128 code points, the Cyrillic block contains 256 code
points but the Armenian block contains only 96 and the Ogham block
contains only 32 code points
- Slide 21
- Where to find details of these blocks Unicode.org maintains a
list of all blocks at http://www.unicode.org/charts/ Clicking on a
block name gives you a PDF file for the block For example, the next
slide shows the PDF file for the Ogham block
- Slide 22
- Example PDF file for a Unicode block The PDF file for a Unicode
block gives the following information for each symbol in the block
a picture of the symbol its code point a descriptive name for the
symbol
- Slide 23
- Backward compatibility Unicode was designed to be compatible
with ASCII Thus, the Basic Latin block contains all 128 ASCII
standard characters Each ASCII code maps directly to a Unicode code
point in this block For example, the letter z, whose ASCII code is
7A, has the code point U+007A The letter n, whose ASCII code is 6E,
has the code point U+006E And so on
- Slide 24
- Backward compatibility (contd.) The Latin-1 Supplement block
also contains 128 code points Some, but not all, of these code
points are similar to the codes in the Windows-1252 (Microsoft
Windows Latin-1) code page Those code points in the Latin-1
Supplement block which do map directly to Windows-1252 codes
include the code points for Latin letters with accents and other
common diacritical marks such as umlauts Thus, the accented letter
, which has the Windows-1252 code of F3, has the Unicode code point
U+00F3
- Slide 25
- Implementations of Unicode Unicode is an abstract code Various
implementations include UTF-32 UTF-16 UTF-8
- Slide 26
- UTF-32 UTF-32 is a fixed-length encoding of Unicode Every code
point is directly encoded using 32 bits, or four bytes
- Slide 27
- UTF-16 UTF-16 is a 16-bit encoding of Unicode Unlike UTF-32, it
is a variable-length encoding code points are encoded with one or
two 16-bit code-units, that is, in UTF-16 a code point is encoded
as either two or four bytes
- Slide 28
- UTF-8 Like UTF-16, UTF-8 is a variable-length encoding It uses
different number of bytes for different code points Code points for
the most common characters, the English letters, are represented as
single bytes Less common characters are represented as two bytes,
Rarer characters are represented as three or more bytes This means
that UTF is the most space-efficient representation of Unicode We
will see it in more detail later
- Slide 29
- Examining the Notepad formats To put some flesh on this, let's
examine the various formats in which Notepad stores the small file
we saw earlier
- Slide 30
- The so-called Unicode format in Notepad As we shall see, this
format is actually a form of UTF-16 Its proper name is
UTF-16LE
- Slide 31
- Viewing the "Unicode" file in XVI32 (part 1) The file contains
14 bytes The first two bytes contain a byte order mark (BOM), which
will be explained on a later slide Then, each character is encoded
as two bytes
- Slide 32
- Viewing the "Unicode" file in XVI32 (part 2) The byte order
mark is stored at the start of a file to tell programs whether the
file is written in little-endian or big-endian format The Unicode
code point for the byte order mark is U+FEFF Note that the BOM is
actually stored is our file as FFFE FFFE is the little-endian
version of FEFF, so it tells us that the file is stored in
little-endian format So we know that each character in the rest of
the file is encoded in little-endian format
- Slide 33
- Viewing the "Unicode" file in XVI32 (part 3) The first two
bytes of the file, the BOM, tell us that file is stored in
little-endian format Then, each character is encoded as two bytes,
in little-endian format The Unicode code point for z is U+007A But,
because the file is in little- endian format, the code point for z
is stored in the file as 7A 00 The Unicode code point for n is
U+006E but this little-endian file stores it as 6E 00 And so on for
the other characters
- Slide 34
- Viewing the "Unicode" file in XVI32 (part 4) Unicode was
designed to be compatible with ASCII, so the Basic Latin block
contains all 128 ASCII standard characters, each ASCII code mapping
directly to a Unicode code point z, whose ASCII code is 7A, has the
code point U+007A and appears as 7A 00 in this little-endian file
n, whose ASCII code is 6E, has the code point U+006E and appears as
6E 00 in this little-endian file and so on for a, k and w
- Slide 35
- Viewing the "Unicode" file in XVI32 (part 5) The Latin-1
Supplement block also contains 128 code points Some, but not all,
of these code points are similar to the codes in the Windows-1252
(Microsoft Windows Latin-1) code page The Windows-1252 codes for
common Latin letters with accents or other diacritical marks do map
directly to code points in the Latin-1 Supplement block Thus, the
code point for , which has the Windows-1252 code of F3, has the
code point U+00F3 and appears as F3 00 in this little-endian
file
- Slide 36
- Big-endian Unicode
- Slide 37
- Viewing the Unicode big-endian file in XVI32 The proper name
for this format is UTF-16BE The file has 14 bytes The first two
bytes contain the byte order mark and, then, each of the six
characters is encoded as two bytes The fact that the byte order
mark, U+FEFF, is stored as FE FF tells that the file is in
big-endian format Thus, the code point for z, U+007A, is stored as
00 7A; the code point for n, U+006E, is stored as 00 6E; and so
on
- Slide 38
- UTF-8
- Slide 39
- A UTF-8 file represents characters using a space-efficient
representation of Unicode code points It uses different number of
bytes for different code points Code points for the most common
characters, the English letters, are represented as single bytes,
Less common characters are represented as two bytes, Rarer
characters are represented as three or more bytes
- Slide 40
- UTF-8 (contd.) Single-byte codes are used for the Unicode
points U+0000 through U+007F Thus, the UTF-8 codes for these
characters are exactly the same as the corresponding ASCII codes.
As we shall see, these single-byte codes can be easily
distinguished from the first bytes of multi-byte codes The
high-order bit of the single-byte codes is always 0 As we shall
see, the high-order bit in the first byte of a multi-byte code is
always 1
- Slide 41
- UTF-8 (contd. ) Each of the first 128 characters in UTF-8 need
only one byte This covers all ASCII (English) characters Each of
the next 1920 characters need two bytes This covers the remainder
of almost all Latin-derived alphabets It also covers the Greek,
Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Maldivian
alphabets It also covers the so-called Combining Diacritical Marks,
which can be used to construct new letters as well as providing an
alternative way of specifying the standard accented letters that
are already covered above Each of the rest of the 65,536 characters
in the Basic Multilingual Plane (which contains nearly all
characters used in living languages) need three bytes Four bytes
are needed for characters in the other Unicode planes, these
include less-common characters in the Chinese, Japanese and Korean
scripts as well as various historic scripts and mathematical
symbols
- Slide 42
- UTF-8 (contd.) As seen before, the high-order bit of a
single-byte code is always 0 A multi-byte code consists of a
leading byte and one or more continuation bytes. The leading byte
has two or more high-order 1s followed by a 0, while continuation
bytes all have '10' in the high-order position. The number of
high-order 1s in the leading byte of a multi-byte sequence
indicates the number of bytes in the sequence so the length of the
sequence can be determined without examining the continuation
bytes. The remaining bits of the encoding are used for the bits of
the code point being encoded, padded with high-order 0s if
necessary. The high-order bits go in the leading byte, lower-order
bits in succeeding continuation bytes. The number of bytes in the
encoding is the minimum required to hold all the significant bits
of the code point. We shall see an example of multi-byte coding in
a later slide
- Slide 43
- Viewing the UTF-8 file in XVI32 (part 1) As we shall see later,
the first three bytes in this file, EF BB BF, form a "byte order
mark", although this mark is not needed or recommended by the
Unicode standard The next four bytes, 7A 6E 61 6B, look like ASCII
codes - they contain the single-byte UTF-8 encodings of U+007A,
U+006E, U+0061 and U+006B (znak) The next two bytes C3 B3, contain
a two-byte encoding of , as explained in the next slide The last
byte contains a single- byte encoding of U+0077 (w)
- Slide 44
- Viewing the UTF-8 file in XVI32 (part 2) The two-byte UTF-8
encoding for , which has the code point U+00F3, is as follows Since
there are two bytes in the code, the leading byte is of the form
110x xxxx and the continuation byte has the form 10xx xxxx U+00F3
has the following bits, 0000 0000 1111 0011 The significant bits in
this code point are 1111 0011 There is room for 6 bits in the
continuation byte, so it can contain the six low-order bits 11
0011, so this byte becomes 1011 0011, which is B3 The two
high-order bits, 11, will be placed in the leading byte But there
is room for 5 bits in the leading byte so these two bits must be
padded with three high- order 0s So the leading byte becomes 110
000 11, that is 1100 0011, which is C3 So the UTF-8 code for is C3
B3
- Slide 45
- Viewing the UTF-8 file in XVI32 (part 3) We can now see that
the first three bytes in this file, EF BB BF, are the UTF-8
encoding of the Unicode byte order mark U+FEFF Since there are
three bytes in the code, the leading byte is of the form 1110 xxxx
and each of the two continuation bytes has the form 10xx xxxx So
the bytes are 1110 xxxx 10xx xxxx 10xx xxxx All bits in the U+FEFF
code point are significant 1111 1110 1111 1111 There is room for 6
bits in the last byte, so it can contain the six lowest-order bits
11 1111, so this byte becomes 1011 1111, which is BF The next six
bits, 1110 11, will be placed in the middle byte, so this becomes
1011 10 11, which is BB The leading byte gets the highest-order
bits, becoming 1110 1111, which is EF So the UTF-code for the byte
order mark is EF BB BF
- Slide 46
- Let's check our understanding by considering some other
languages
- Slide 47
- A web page in Hebrew Consider this page
http://www.haaretz.co.il/news/politics/1.2151492 Let's copy the
first word in the headline, (It's pronounced 'rosh' and means head,
leader, boss, chief)
- Slide 48
- Let's save this word in a UTF-8 file Start a new document in
Notepad Paste the word we have just copied And save the file using
the UTF-8 format
- Slide 49
- Inspecting the UTF-8 Open the file with XVI32 The first three
bytes, EF BB BF, are familiar They are the UTF-8 encoding of the
Unicode code point for the byte order mark
- Slide 50
- Inspecting the UTF-8 (contd.) There are six remaining bytes in
the file, D7 A8 D7 90 D7 A9 So we suspect there are two bytes per
character, but let's check Look at the first byte, D7, in binary
format 1101 0111 It must be a leading byte in a multi-byte code,
because its first bit is a 1 Indeed, it must be the first byte in
two-byte code, because its first bits are 110 So the first
character in the file has a two-byte UTF-8 code, D7 A8 Let's
compute the Unicode code point and see the character
- Slide 51
- Inspecting the UTF-8 (contd.) The first character in the file
has a two-byte UTF-8 code, D7 A8 In binary, this is 1101 0111 1010
1000 Using the colour code in the figure 1101 0111 1010 1000 Thus,
the data bits are 1 0111 10 1000 Organized into nibbles, this is
101 1110 1000 So the code point is 0000 0101 1110 1000 That is,
U+05E8 On the next slide, we will check what character this is
- Slide 52
- Checking the character (part 1) Use the code point, U+05E8, in
a HTML entity number, Save the HTML file And view the file in a
browser...
- Slide 53
- Checking the character (part 2) The first two bytes after the
byte order mark were D7 A8 They were the UTF-8 encoding of the
Unicode code point U+05E8 When used in a HTML entity number, this
is rendered as the Hebrew letter (pronounced resh), the first
letter in the word (pronounced rosh) Notice that Notepad is clever
enough to display this letter on the right, even though it is the
first letter in the file This is because Notepad recognizes that
these letters are from an alphabet in which words are written
right-to-left
- Slide 54
- Exercise As an exercise, check the last four bytes in the file
That is, check that D7 90 is the UTF-8 encoding of the code point
for the Hebrew letter (pronounced aleph) and that D7 A9 is the
UTF-8 encoding of the code point for the Hebrew letter (pronounced
shin)
- Slide 55
- A word in Arabic Consider the word (pronounced hasan - it means
'good') Lets save it in a UTF-8 file
- Slide 56
- Inspecting the UTF-8 Open the file with XVI32 Again, the first
three bytes, EF BB BF, encode the byte order mark The next byte,
D8, is the binary 1101 1000 Since its first bits are 110, it must
be the leading byte of a two-byte code So the code is D8 AD In
binary, this is 1101 1000 1010 1101 So the data bits are 110 0010
1101 That is 0000 0110 0010 1101 So the code point is U+062D
- Slide 57
- Checking the character (part 1) Use the code point, U+062D, in
a HTML entity number, Save the HTML file And view the file in a
browser...
- Slide 58
- Checking the character (part 2) This does not look right - the
letter does not look like the first (right-most) letter in the word
, which looks like However, this is simply a result of the fact
that Notepad is clever enough to recognize that the Arabic letter,
ha, has several forms When ha is written as an isolated letter, it
is written When ha is written at the start of a word, it is written
The letter also has two other forms, for when it appears in the
middle of a word and when it appears at the end of a word.
- Slide 59
- Inspecting the UTF-8 (contd.) The sixth byte in the file, D8,
is the binary 1101 1000 Since its first bits are 110, it must be
the leading byte of a two-byte code So the code is D8 B3 In binary,
this is 1101 1000 1011 0011 So the data bits are 110 0011 0011 That
is 0000 0110 0011 0011 So the code point is U+0633
- Slide 60
- Checking the character (part 1) Use the code point, U+0633, in
a HTML entity number, Save the HTML file And view the file in a
browser...
- Slide 61
- Checking the character (part 2) Again, this may not look
perfect, but it is correct the letter does not look exactly like
the middle letter in the word , which looks like Again, this is
simply a result of the fact that Notepad is clever enough to
recognize that the Arabic letter, sin, has several forms When sin
is written as an isolated letter, it is written When ha is written
in the middle of a word, it is written The letter also has two
other forms, for when it appears at the start of a word and when it
appears at the end of a word. The next slide show how clever
Firefox is at rendering Arabic
- Slide 62
- Checking a character sequence (part 1) Use the two code points,
U+062D and U+0633, in a HTML file And view the file in a
browser...
- Slide 63
- Checking a character sequence (part 2) Notice that Firefox now
renders the letter ha using its initial-position form, so it looks
like it does in Notepad The letter sin still looks different in
Firefox than in Notepad This is because Notepad shows the
medial-position form of the letter while Firefox shows the
final-position form of the letter Let's see what happens if we
encode all three Arabic letters of the word hasan in a HTML
file
- Slide 64
- Inspecting the UTF-8 (contd.) The eighth byte in the file, D9,
is the binary 1101 1001 Since its first bits are 110, it must be
the leading byte of a two-byte code So the code is D9 86 In binary,
this is 1101 1001 1000 0110 So the data bits are 110 0100 0110 That
is 0000 0110 0100 0110 So the code point is U+0646
- Slide 65
- Checking a character sequence (part 1) Use the three code
points, U+062D, U+0633 and U+0646, in a HTML file And view the file
in a browser...
- Slide 66
- Checking a character sequence (part 2) Notice that Firefox
renders all three letters exactly as they appear in Notepad Firefox
uses the initial-position version of ha, the medial-position of sin
and the final-position version of nun
- Slide 67
- Now, let's try Chinese The sentence means I am Irish or,
literally, I am Ireland person It is pronounced wo shi ai er lan
ren Let's use Notepad to save it in a UTF-8 file
- Slide 68
- Inspecting the UTF-8 (contd.) Again, the first three bytes
encode the byte order mark The fourth byte in the file, E6, is the
binary 1110 0110 Since its first bits are 1110, it must be the
leading byte of a three-byte code So the code is E6 88 91 In
binary, this is 1110 0110 1000 1000 1001 0001 So the data bits are
0110 0010 0001 0001 So the code point is U+6211
- Slide 69
- Checking the character (part 1) Use the code point, U+6211, in
a HTML file And view the file in a browser...
- Slide 70
- Checking the character (part 2) We can now see that U+6211 is
the code point for the Chinese character which means I or me
- Slide 71
- Exercise As an exercise, check the last fifteen bytes in the
file E6 98 AF E7 88 B1 E5 B0 94 E5 85 B0 E4 BA BA There are five
further Chinese characters in the file Are all of their code points
encoded as three-byte codes? Or do some of them have shorter, and
others longer, codes? What is the code point for the character
(pronounced ren, it means person)?
- Slide 72
- The same characters can be specified in different ways
- Slide 73
- Consider the two HTML pages shown below ABCD1.html looks the
same as ABCD2.html But...
- Slide 74
- The two identical-looking pages have different source code The
text, ABCD, is specified directly in the source code for ABCD1.html
But The same text, ABCD, is specified using HTML entity numbers in
the source code for ABCD2.html
- Slide 75
- Bear this in mind... When we are looking at the next few web
pages, remember that the same characters can be specified in
different ways
- Slide 76
- Encoding characters in HTML form data
- Slide 77
- A simple form-submission program The program, formInput1.php,
which is shown below, displays a form Then...
- Slide 78
- A simple form-submission program The program, formInput1.php,
which is shown below, displays a form Then, when the user submits a
string, it...
- Slide 79
- A simple form-submission program The program, formInput1.php,
which is shown below displays a form Then, when the user submits a
string, it...... reports the string it received
- Slide 80
- Another simple form-submission program The program,
formInput2.php, which is shown below, also displays a form
Then...
- Slide 81
- Another simple form-submission program The program,
formInput2.php, which is shown below, also displays a form Then,
when the user submits a string, it...
- Slide 82
- Another simple form-submission program The program,
formInput2.php, which is shown below, also displays a form Then,
when the user submits a string, it...... also reports the string it
received
- Slide 83
- Both reports look similar But...
- Slide 84
- ... the source texts for the two report pages are
different
- Slide 85
- What's the cause of the difference? The difference between the
source texts for the two reports must lie in some difference
between the source codes of the two programs formInput1.php and
formInput2.php Lets's compare the two programs
- Slide 86
- Difference between the two programs The only difference between
the two programs is that...... formInput2.php encodes every page it
delivers in UTF-8 This means that its form also encodes the user's
input in UTF-8 It also means that the source text for the report
page is encoded in UTF-8... which means that the character appears
correctly
- Slide 87
- Let's see how formInput1.php handles Chinese input The form
cannot encode the user's Chinese string in UTF-8 So it encodes the
characters in the string as HTML entity numbers we can see this in
the source code for the report page
- Slide 88
- Now see how formInput2.php handles Chinese input The form is
able to encode the user's Chinese string in UTF-8 So the Chinese
characters, themselves, appear in the source text for the report
page
- Slide 89
- Checking that the string received really is UTF-8 This program,
formInput3.php, creates a new file called userString and writes a
copy of the received string into it We can then download this new
file and examine its contents with XVI32
- Slide 90
- Notice that the new file, userString, contains 18 bytes These
are exactly the same as the bytes after the byte order mark in
chinese.txt, the file that was created by Notepad when we stored
the same Chinese string So our program does indeed receive (and
store) characters encoded in UTF-8
- Slide 91
- The moral is... If there is any possibility that users of your
web- pages will enter non-English characters in their form
submissions, make sure that you encode your web-pages in UTF-8
Indeed, you should always encode your web-pages in UTF-8
- Slide 92
- Diacritical marks in Unicode
- Slide 93
- Diacritical marks The writing systems for many languages use
symbols that are intended to modify other symbols They are usually
called diacritical marks (from a Greek word, , diakritikos, which
means distinguishing) Irish uses such marks to distinguish long
vowels from short vowels, for example, from a Other European
languages use different diacritical marks on vowels, for example,
to distinguish , , , and from a and Some European languages use
diacritical marks on consonants French distinguishes from c Spanish
distinguishes from n Slavic languages distinguish , and from s In
fact, the Irish writing system used diacritical marks on consonants
until recently, indicating lenited consonants with a diacritical
dot, as in , , , , , , , , Many non-European writing systems also
use diacritical marks, for various purposes
- Slide 94
- Interlude Unicode and Insular Script
- Slide 95
- Examples of old Irish script Old Irish script is still visible
on street signs in Cork city The photographs below were taken in
November 2013 The first letter in the second word in first
photograph shows a consonant with a diacritical mark Both
photographs show several special letter forms from the Insular
Script which was developed in the 600s and was still in use when I
was in school For more info, see
http://en.wikipedia.org/wiki/Insular_scripthttp://en.wikipedia.org/wiki/Insular_script
These letter forms are all supported by Unicode
- Slide 96
- Unicode support for Insular Script The special letters form of
Insular Script are supported in the Latin Extended-D block Details
about the individual letters can be found at codepoints.net For
example http://codepoints.net/U+A779 gives details about that is
U+A779 LATIN CAPITAL LETTER INSULAR Dhttp://codepoints.net/U+A779
Note that although Unicode supports Insular Script, not all fonts
do - one that does in Quivira
http://www.quivira-font.com/downloads.php
- Slide 97
- Using special fonts in HTML The HTML file below specifies the
code point for the capital D in Insular Script But Firefox does not
display the letter properly This is not a fault in Firefox It
simply means that the client machine on which Firefox is running
does not have a font which supports this Unicode code point We can
overcome this limitation by using a feature of CSS3
- Slide 98
- Using the @font-face command CSS3 provides the @font-face
command for telling a browser to load a special font which may not
be available on the client machine Below, the browser is told to
load the Quivira font which is available in a file called
Quivira.ttf in the same server directory as the HTML file The
browser can now render the Insular Capital D correctly
- Slide 99
- Back to Diacritical marks in Unicode
- Slide 100
- Diacritical marks in digital typography Many computerized
typography systems treat diacritically-marked characters (such as ,
or ) as completely distinct from base characters (such as a, n or
s) However, some systems do not provide separate symbols for
diacritically- marked characters Instead, they provide special
diacritical symbols (~ ` etc.) which may be combined with base
symbols to produce the same visual appearance to the reader Unicode
tries to subsume these different types of system Thus, for example,
Unicode provides two different ways of encoding It provides U+00E0,
the code point for the composite character But it also allows us to
achieve the same visual appearance by appending U+0300, the code
point for `, to U+0061, the code point for a In Unicode jargon,
code points for diacritical marks, like U+0300 the code point for
`, are called combining marks As we shall see later, there is a
special regular expression notation for dealing with combining
marks
- Slide 101
- An experiment Lets see, close-up, the effect of appending
U+0300, the code point for `, to U+0061, the code point for a We
will create a UTF-8 encoded binary file using XVI32 and open it
with Notepad Since Notepad seems to like an initial byte order mark
in UTF-8 format, the first three bytes in our file will be EF BB BF
Since U+0061 belongs to the Basic Latin block, this code point is
encoded in UTF-8 as one byte, namely 61 So far, then, our file
contains four bytes EF BB BF 61 Now we must encode U+0300 in UTF-8
Hex 03 00 is 0000 0011 0000 0000 in binary This has ten significant
bits are 11 0000 0000 A two-byte code, 110xxxxx 10xxxxxx, provides
space for eleven bits Padding the ten bits with a leading 0, we get
1100 1100 1000 0000, that is CC 80 in hex So our complete file is
EF BB BF 61 CC 80 Let's create it with XVI32 and then open it in
Notepad
- Slide 102
- An experiment (contd.) A UTF-8 file containing a byte order
mark followed by U+0061 (code point for the letter a) and U+0300
(the code point for the combining mark `), contains six bytes EF BB
BF 61 CC 80
- Slide 103
- An experiment (contd.) A UTF-8 file containing a byte order
mark followed by U+0061 (code point for the letter a) and U+0300
(the code point for the combining mark `), contains six bytes EF BB
BF 61 CC 80 Creating this with XVI32, we get
- Slide 104
- An experiment (contd.) A UTF-8 file containing a byte order
mark followed by U+0061 (code point for the letter a) and U+0300
(the code point for the combining mark `), contains six bytes EF BB
BF 61 CC 80 Creating this with XVI32, we get Saving this in a fill
called aGrave.txt, we get
- Slide 105
- An experiment (contd.) A UTF-8 file containing a byte order
mark followed by U+0061 (code point for the letter a) and U+0300
(the code point for the combining mark `), contains six bytes EF BB
BF 61 CC 80 Creating this with XVI32, we get Saving this in a fill
called aGrave.txt, we get Opening this file in Notepad, we get
- Slide 106
- List of Unicode combining marks The complete (as of 2013) list
of Unicode combining marks is available at
http://www.unicode.org/charts/PDF/U0300.pdf We are allowed to
combine these in any order we like, although not all rendering
software may display them correctly
- Slide 107
- List of Unicode combining marks The complete (as of 2013) list
of Unicode combining marks is available at
http://www.unicode.org/charts/PDF/U0300.pdf We are allowed to
combine these in any order we like, although not all rendering
software may display them correctly Let's try combining a few of
them
- Slide 108
- List of Unicode combining marks The complete (as of 2013) list
of Unicode combining marks is available at
http://www.unicode.org/charts/PDF/U0300.pdf We are allowed to
combine these in any order we like, although not all rendering
software may display them correctly Let's try combining a few of
them Let's try appending both U+0300 and U+0333 to U+0061
- Slide 109
- Another experiment Lets see how Notepad handles the result of
appending two combining marks, U+0300 and U+0333, to U+0061, the
code point for a We will create a UTF-8 encoded binary file using
XVI32 and open it with Notepad We already know that the byte
sequence for an initial byte order mark followed by U+0061 and
U+0300 is EF BB BF 61 CC 80 Now we must encode U+0333 in UTF-8 Hex
03 00 is 0000 0011 0011 0011 in binary This has ten significant
bits are 11 0011 0011 A two-byte code, 110xxxxx 10xxxxxx, provides
space for eleven bits Padding the ten bits with a leading 0, we get
1100 1100 1011 0011, that is CC B3 in hex So our complete file is
EF BB BF 61 CC 80 CC B3
- Slide 110
- Another experiment Lets see how Notepad handles the result of
appending two combining marks, U+0300 and U+0333, to U+0061, the
code point for a We will create a UTF-8 encoded binary file using
XVI32 and open it with Notepad We already know that the byte
sequence for an initial byte order mark followed by U+0061 and
U+0333 is EF BB BF 61 CC 80 Now we must encode U+0333 in UTF-8 Hex
03 00 is 0000 0011 0011 0011 in binary This has ten significant
bits are 11 0011 0011 A two-byte code, 110xxxxx 10xxxxxx, provides
space for eleven bits Padding the ten bits with a leading 0, we get
1100 1100 1011 0011, that is CC B3 in hex So our complete file is
EF BB BF 61 CC 80 CC B3 Creating it with XVI32 and opening it in
Notepad, we see that Notepad cannot render the result
perfectly
- Slide 111
- Yet another experiment But Firefox can render perfectly the
result of appending two combining marks, U+0300 and U+0333, to
U+0061, to the code point for a Let's create a UTF-8 file without
the (unnecessary byte order mark). Its contents will be 61 CC 80 CC
B3 Let's call the file aNovelWithoutBOM.txt and upload it to a
server The PHP program below delivers the content of this file to
Firefox, which can render it very well
- Slide 112
- Character Equivalence We have seen that Unicode provides
different ways of encoding diacritically-marked characters Such a
character may be encoded using different code point sequences a
sequence containing a single code point (for a composite character)
a sequence of several code points (one for a base character,
followed by one or more further code points for combining marks)
Some way is needed to determine whether or not two code point
sequences represent the same diacritically-marked character That
is, some way is needed to determine whether or not two code point
sequences are equivalent, so that programs can compare sequences,
organize them alphabetically and search for them The Unicode
standard defines two kinds of sequence equivalence canonical
equivalence a weaker notion called compatibility equivalence The
standard also defines normalization algorithms which, for each type
of equivalence, produce a unique code point sequence from all
equivalent sequences For more details, see
http://unicode.org/reports/tr15/
- Slide 113
- XML and UTF-8
- Slide 114
- Several versions of the same document Consider this short XML
document Franois Hollande Nicolas Paul Stphane Sarkzy de Nagy-Bocsa
I won! We will store it in two formats, ANSI and UTF-8, and see how
Firefox handles the result Later, we will develop a slightly better
version of the document and also store it in both formats
- Slide 115
- ANSI version of the document We have stored the document in an
ANSI-encoded file called memorandumStoredInANSIformat.xml Firefox
objects to the character
- Slide 116
- UTF-8 version of the document We have stored the document in an
UTF- 8-encoded file called memorandumStoredInUTF8format.xml Firefox
handles the file properly It has no trouble with the , , or
characters So, XML files which contain non-ASCII characters should
be stored in UTF-8 So make sure you use a text editor which
supports UTF-8 encoding
- Slide 117
- Improved version of the document Consider this version of the
XML document Franois Hollande Nicolas Paul Stphane Sarkzy de
Nagy-Bocsa I won! It has an encoding attribute We will store it in
two formats, ANSI and UTF-8, and see how Firefox handles the
result
- Slide 118
- ANSI version of the document We have stored the document in an
ANSI-encoded file called memorandumUTF8storedInANSI.xml There is a
discrepancy between the encoding attribute and the actual encoding
used in the file Not surprisingly, Firefox objects to the
character
- Slide 119
- UTF-8 version of the document We have stored the document in an
UTF- 8-encoded file called memorandumUTF8storedInUTF8.xml Firefox
handles the file properly Summary: The actual encoding matters more
than the encoding attribute, but it is better to use the
attribute
- Slide 120
- Another example Here is a memo from Xi Jinping to Bo Xilai We
have stored the document in an UTF-8-encoded file Firefox handles
the file properly With UTF-8, your XML documents can contain text
in any script
- Slide 121
- Yet another example Indeed, with UTF-8, even your tag names can
be in any script For example, here is another version of the memo
from Xi Jinping to Bo Xilai While the content is in Chinese, the
tag names are in Cyrillic You should always use UTF-8
- Slide 122
- Unicode and operating systems
- Slide 123
- Modern operating systems support Unicode When an operating
system supports Unicode, file names can contain any Unicode code
point However, even though an operating system may support Unicode
in file names, this does not mean that all applications will
display them properly For example, below, Firefox shows that two
files in a folder on a Linux server have Chinese characters in
their names But these are not rendered correctly if we use the
Linux ls command
- Slide 124
- Unicode implementations in operating systems Different
operating systems provide different implementations of Unicode in
file names Unix/Linux uses UTF-8 in file names Windows NTFS uses
UTF-16
- Slide 125
- UTF-8 and URLs
- Slide 126
- Any script can be used in URLs You already know that any ASCII
symbol can be encoded in a URL using the % character followed by a
string containing the ASCII code of the symbol In fact, this use of
ASCII is just a special case Any Unicode symbol can be used
url-encoded in URLs This means that a URL can contain any script
Below, see the English Wikipedia page for Bo Xilai The URL contains
his name in English On the next slide we will see the corresponding
Chinese page
- Slide 127
- Chinese characters in URLs Below, see the Chinese Wikipedia
page for Bo Xilai The URL contains his name in Chinese, url-encoded
using UTF-8 Firefox displays the Chinese characters in the URL box
But, even though the URL contains Chinese characters, Internet
Explorer (at least MSIE 8, the version on my desktop) does not show
them in the URL box Instead it shows the url-encoded UTF-8, three
bytes for each character, as we would expect:
https://zh.wikipedia.org/wiki/%E8%96%84%E7%86%99%E6%9D%A5
- Slide 128
- Handling Unicode in regular expressions
- Slide 129
- Handling combining marks with regular expressions As we shall
see, regular expressions can handle UTF-8 text although treating
combining marks can be tricky PHP, for example, supports Unicode
when a new modifier, the u modifier is appended to regular
expressions
- Slide 130
- The dot operator and Unicode Regular expression engines treat a
single Unicode code-point as a single character Thus the dot meta
character will match any single Unicode code-point (except the
line- break character)
- Slide 131
- The dot operator and combining marks Remember that the
character can be represented as one code point (U+00E0) or, using a
combining mark, as two (U+0061 U+0300) The dot operator will not
treat a character followed by a combining mark as one character It
will match only the basic character Instead, the meta-sequence \X
will match any unicode character, be it a single code point or a
sequence
- Slide 132
- To be continued See
http://www.regular-expressions.info/unicode.html
- Slide 133
- Unicode in XML and other Markup Languages
http://www.w3.org/TR/unicode-xml/ http://www.w3.org/TR/unicode-xml/
Regular Expression Matching in XSLT 2
http://www.xml.com/pub/a/2003/06/04/tr.html
http://www.xml.com/pub/a/2003/06/04/tr.html Support for XSLT 2.0
Although XSLT 2.0 is not natively supported in Firefox, it is
possible via Saxon-B (Java) or, more recently, Saxon-CE
(JavaScript) to perform.... Browsers so far don't support XSLT 2.0.
There is however an attempt to port Saxon 9 to client-side
Javascript so that it can be used within browsers: See
http://www.saxonica.com/ce/index.xml for open source
saxon-cehttp://www.saxonica.com/ce/index.xml