92
UTF-8, Perl and You By Rafael Almeria

UTF-8, Perl and You

Embed Size (px)

DESCRIPTION

UTF-8, Perl and You. By Rafael Almeria. Chapter 1: Introduction. 1 - Introduction. This talk does not deal with the motivation for using utf-8. 1 - Introduction. This talk is about: Implementation details. Understanding UTF-8. Converting your data, And knowing how to fix common problems. - PowerPoint PPT Presentation

Citation preview

Page 1: UTF-8, Perl and You

UTF-8, Perl and YouBy Rafael Almeria

Page 2: UTF-8, Perl and You

Chapter 1:Introduction

Page 3: UTF-8, Perl and You

1 - Introduction

This talk does not deal with themotivation for using utf-8.

Page 4: UTF-8, Perl and You

1 - Introduction

This talk is about:

Implementation details.

Understanding UTF-8.

Converting your data,

And knowing how to fix common problems.

Page 5: UTF-8, Perl and You

1 - Introduction

Some assumptions:

Language: Perl

Unix Operating System

Input encoded as: ASCII, ISO-8859-1/Latin-1 or Windows-1252.

Output encoded as: UTF-8

Page 6: UTF-8, Perl and You

1 - Introduction

What we’ll cover in this talk:

A primer on character encoding

A simplifying principle

UTF-8

Perl & UTF-8

Making the Browser Happy

Encoding Hell

Page 7: UTF-8, Perl and You

Chapter 2:A Very Brief Primer on Character

Encoding.

Page 8: UTF-8, Perl and You

2 - A Very Brief Primer on Character Encoding.

What is a character encoding?

Page 9: UTF-8, Perl and You

2 - A Very Brief Primer on Character Encoding.

It’s a specific way to represent the characters in a given character set.

Page 10: UTF-8, Perl and You

2 - A Very Brief Primer on Character Encoding.

A character set may have a numerical ordering on it for use with a given

character encoding.

Page 11: UTF-8, Perl and You

2 - A Very Brief Primer on Character Encoding.

The number given to a specific character in an ordered character set is

its code point.

Page 12: UTF-8, Perl and You

2 - A Very Brief Primer on Character Encoding.

Do not confuse the character’s code point with its representation!

Page 13: UTF-8, Perl and You

2 - A Very Brief Primer on Character Encoding.

It may be the same for ASCII, ISO-8859-1 and Windows-1252 and…

Page 14: UTF-8, Perl and You

2 - A Very Brief Primer on Character Encoding.

it may be the same for 1-byte UTF-8 but…

Page 15: UTF-8, Perl and You

2 - A Very Brief Primer on Character Encoding.

it’s definitely not true for multi-byte UTF-8.

Page 16: UTF-8, Perl and You

2 - A Very Brief Primer on Character Encoding.

It’s a common problem. So don’t confuse them!

Page 17: UTF-8, Perl and You

Chapter 3:A Simplifying Principle

Page 18: UTF-8, Perl and You

3 - A Simplifying Principle

If all of our data is encoded using only the following encodings (code point ranges are in parenthesis):

ASCII (0x00 - 0x7F)

ISO-8859-1/Latin-1 (0x00 - 0xFF)

Windows-1252 (0x00 - 0xFF)

Page 19: UTF-8, Perl and You

3 - A Simplifying Principle

and if we only care about printable content then

ASCII ISO-8859-1 Windows-1252

Page 20: UTF-8, Perl and You

3 - A Simplifying Principle

We can treat everything as Windows-1252!

Page 21: UTF-8, Perl and You

3 - A Simplifying Principle

This should be ok if we are sure that the documents are from one of these three kinds of encodings but we’re not sure

how each document is encoded.

Page 22: UTF-8, Perl and You

Chapter 4: UTF-8.

A Brave New World

Page 23: UTF-8, Perl and You

4 - UTF-8. A Brave New World

It supports every language you’ll probably ever need.

Page 24: UTF-8, Perl and You

4 - UTF-8. A Brave New World

No need for Windows-1252 this and Windows-1253 that.

Page 25: UTF-8, Perl and You

4 - UTF-8. A Brave New World

Its code point range is from 0x00 to 0x10FFFF

Page 26: UTF-8, Perl and You

4 - UTF-8. A Brave New World

It uses a variable (1 to 4) byte encoding.

Page 27: UTF-8, Perl and You

4 - UTF-8. A Brave New World

1-byte UTF-8 is used for code points in the range 0x00 to 0x7F.

Page 28: UTF-8, Perl and You

4 - UTF-8. A Brave New World

1-byte UTF-8 ASCIIMSBit is 0

code point representation

Page 29: UTF-8, Perl and You

4 - UTF-8. A Brave New World

Examples of 1-byte UTF-8:

“A” -> 0100 0001

“&” -> 0010 0110

“5” -> 0011 0101

Page 30: UTF-8, Perl and You

4 - UTF-8. A Brave New World

2-byte UTF-8 is used for code points in the range 0x0080 to 0x07FF.

Page 31: UTF-8, Perl and You

4 - UTF-8. A Brave New World

2-byte UTF-8code point != representation

Page 32: UTF-8, Perl and You

4 - UTF-8. A Brave New World

The code point is broken apart into two pieces.

Page 33: UTF-8, Perl and You

4 - UTF-8. A Brave New World

The five MSBits of the code point are assigned to the first byte and the six

LSBits are assigned to the second byte.

Page 34: UTF-8, Perl and You

4 - UTF-8. A Brave New World

For the first byte of 2-byte UTF-8

The three MSBits are set to 110

The remaining bits are the five MSBits of the code point.

Page 35: UTF-8, Perl and You

4 - UTF-8. A Brave New World

For the second byte of 2-byte UTF-8

The two MSBits are set to 10

The remaining bits are the six LSBits of the code point.

Page 36: UTF-8, Perl and You

4 - UTF-8. A Brave New World

3-byte UTF-8 is used for code points in the range 0x0800 to 0xFFFF.

Page 37: UTF-8, Perl and You

4 - UTF-8. A Brave New World

3-byte UTF-8code point != representation

Page 38: UTF-8, Perl and You

4 - UTF-8. A Brave New World

The code point is broken apart into three pieces.

Page 39: UTF-8, Perl and You

4 - UTF-8. A Brave New World The four MSBits of the code point are assigned to

the first byte.

The middle six bits are assigned to the second byte.

The six LSBits are assigned to the third byte.

Page 40: UTF-8, Perl and You

4 - UTF-8. A Brave New World

For the first byte of 3-byte UTF-8

The four MSBits are set to 1110

The remaining bits are the four MSBits of the code point.

Page 41: UTF-8, Perl and You

4 - UTF-8. A Brave New World

For the second byte of 3-byte UTF-8

The two MSBits are set to 10

The remaining bits are the six middle bits of the code point.

Page 42: UTF-8, Perl and You

4 - UTF-8. A Brave New World

For the third byte of 3-byte UTF-8

The two MSBits are set to 10

The remaining bits are the six LSBits of the code point.

Page 43: UTF-8, Perl and You

4 - UTF-8. A Brave New World

4-byte UTF-8 is used for code points in the range 0x10000 to 0x10FFFF.

Page 44: UTF-8, Perl and You

4 - UTF-8. A Brave New World

4-byte UTF-8code point != representation

Page 45: UTF-8, Perl and You

4 - UTF-8. A Brave New World

The code point is broken apart into four pieces.

Page 46: UTF-8, Perl and You

4 - UTF-8. A Brave New World

The three MSBits of the code point are assigned to the first byte.

The next six MSBits are assigned to the second byte.

Another of the next six MSBits are assigned to the third byte.

The six LSBits are assigned to the fourth byte.

Page 47: UTF-8, Perl and You

4 - UTF-8. A Brave New World

For the first byte of 4-byte UTF-8

The five MSBits are set to 11110

The remaining bits are the three MSBits of the code point.

Page 48: UTF-8, Perl and You

4 - UTF-8. A Brave New World

For the second byte of 4-byte UTF-8

The two MSBits are set to 10

The remaining bits are the next six middle bits of the code point.

Page 49: UTF-8, Perl and You

4 - UTF-8. A Brave New World

For the third byte of 4-byte UTF-8

The two MSBits are set to 10

The remaining bits are the next six middle bits of the code point.

Page 50: UTF-8, Perl and You

4 - UTF-8. A Brave New World

For the fourth byte of 4-byte UTF-8

The two MSBits are set to 10

The remaining bits are the six LSBits of the code point.

Page 51: UTF-8, Perl and You

Chapter 5:Perl & UTF-8

Page 52: UTF-8, Perl and You

5 - Perl & UTF-8

If you want to create UTF-8 strings in your Perl code then all you have to do is

use the following notation:

\x{codepoint}

Page 53: UTF-8, Perl and You

5 - Perl & UTF-8

For example, to create the string “niño”:

my $str = “ni\x{f1}o”;

Page 54: UTF-8, Perl and You

5 - Perl & UTF-8

To write this string to STDOUT you might do this:

binmode STDOUT, “:utf8”;print $str;

Page 55: UTF-8, Perl and You

5 - Perl & UTF-8

To undo it, do this:

binmode STDOUT;print $str;

Page 56: UTF-8, Perl and You

5 - Perl & UTF-8

Or to write UTF-8 data to disk, you could do this:

open(OFILE, “>:utf8”, $filename);print OFILE $str;

Page 57: UTF-8, Perl and You

5 - Perl & UTF-8

To read UTF-8 data from disk, you could do this:

open(IFILE, “<:utf8”, $filename);my $str = <IFILE>;

Page 58: UTF-8, Perl and You

5 - Perl & UTF-8

To convert Windows-1252 to UTF-8, you could do something like this:

use Text::Iconv;use Encode;my $utf8_str = Text::Iconv->new(“WINDOWS-1252”, “UTF-8”)->convert($str);Encode::_utf8_on($utf8_str);

Page 59: UTF-8, Perl and You

Chapter 6:Making the Browser Happy

Page 60: UTF-8, Perl and You

6 - Making the Browser Happy

All the efforts up to now will be for naught if the browser doesn’t

understand how the page is encoded.

Page 61: UTF-8, Perl and You

6 - Making the Browser Happy

To make the browser aware of the nature of the data either add…

Page 62: UTF-8, Perl and You

6 - Making the Browser Happy

Content-type: text/html; charset=utf-8

Page 63: UTF-8, Perl and You

6 - Making the Browser Happy

or if you want to tag each document…

Page 64: UTF-8, Perl and You

6 - Making the Browser Happy

for XML add this declaration at the top of the document:

<?xml version=“1.0” encoding=“utf-8” ?>

Page 65: UTF-8, Perl and You

6 - Making the Browser Happy

for HTML add this declaration at the top of the <head> section of the document:

<meta http-equiv=“Content-Type” content=“text/html; charset=utf-8” >

Page 66: UTF-8, Perl and You

6 - Making the Browser Happy

for XHTML add this declaration at the top of the <head> section of the document:

<meta http-equiv=“Content-Type” content=“text/html; charset=utf-8” />

Page 67: UTF-8, Perl and You

Chapter 7:Encoding Hell

Page 68: UTF-8, Perl and You

7 - Encoding Hell

So now we think we understand UTF-8…

Page 69: UTF-8, Perl and You

7 - Encoding Hell

…and we think we understand how to process this data in Perl but…

Page 70: UTF-8, Perl and You

7 - Encoding Hell

there is still SO MUCH OPPORTUNITY for things to go wrong!

Page 71: UTF-8, Perl and You

7 - Encoding Hell

The Byte Order Mark (0xFEFF code point) is one of them.

Page 72: UTF-8, Perl and You

7 - Encoding Hell

The intention is probably good but it can cause much grief.

Page 73: UTF-8, Perl and You

7 - Encoding Hell

Solution is to cut out the byte sequence EF BB BF from the beginning of the document.

Page 74: UTF-8, Perl and You

7 - Encoding Hell

Encoded Gibberish.

(It takes several forms)

Page 75: UTF-8, Perl and You

7 - Encoding Hell

All Gibberish

Page 76: UTF-8, Perl and You

7 - Encoding Hell

If it’s all gibberish then maybe the data is ok but you’re looking at it using the wrong pair of glasses. Change the document encoding declaration. Or try changing your browser’s

or application’s encoding setting.

Page 77: UTF-8, Perl and You

7 - Encoding Hell

Partially Gibberish

(Two Cases)

Page 78: UTF-8, Perl and You

7 - Encoding Hell

First Case: What does it look like?

Niño vs Ni?oNiño vs Ni o

Page 79: UTF-8, Perl and You

7 - Encoding Hell

You likely have the dreaded “mixed encoding” nightmare. Probably someone has poured ISO-8859-1 or Windows-1252 into a UTF-8 document or vice-versa. You

will need to figure out which bytes are which and clean the document up to make it pure

UTF-8.

Page 80: UTF-8, Perl and You

7 - Encoding Hell

Second Case: What does it look like?

niño (viewed in UTF-8 mode)niño (viewed in Windows-1252 mode)

Page 81: UTF-8, Perl and You

7 - Encoding Hell

You likely have the double encoding problem. Sometimes some of the data gets encoded as UTF-8 twice! Again, you’ll need

to look at the bytes and fix it.

Page 82: UTF-8, Perl and You

7 - Encoding Hell

Now some odds and ends…

Page 83: UTF-8, Perl and You

7 - Encoding Hell

HTML::Entities::decode_entities doesn’t always do what you think. Sometimes it returns ISO-8859-1 instead of UTF-8.

Caveat programmer!

Page 84: UTF-8, Perl and You

7 - Encoding Hell

Be careful if you’re using the encode or decode routines from Encode.pm, they may not set the string’s UTF-8 flag appropriately.

Page 85: UTF-8, Perl and You

7 - Encoding Hell

And as a checklist of sorts when you’re debugging…

Page 86: UTF-8, Perl and You

7 - Encoding Hell

When debugging…make sure that

The data has been encoded properly

The data has been flagged as UTF-8

That it has been written out properly.

That the document has the appropriate encoding declaration.

That your terminal or browser has been set to the correct encoding.

Page 87: UTF-8, Perl and You

Conclusion

Page 88: UTF-8, Perl and You

Conclusion

We notice that it is not easy to navigate the transition from traditional encodings to UTF-8 but with perseverance it is doable. We have illustrated the common encodings, how to process our information in this environment and how to tackle the common issues that might arise.

Page 89: UTF-8, Perl and You

References

Page 90: UTF-8, Perl and You

References

http://www.utf8-chartable.de/unicode-utf8-table.pl?htmlent=1 A nice list of UTF-8 characters, their character entities, code points and representation.

http://en.wikipedia.org/wiki/UTF-8

http://en.wikipedia.org/wiki/Replacement_character#Replacement_character

http://en.wikipedia.org/wiki/Character_encoding

http://en.wikipedia.org/wiki/Byte-order_mark

Page 91: UTF-8, Perl and You

References

http://en.wikipedia.org/wiki/Windows-1252

http://en.wikipedia.org/wiki/ISO/IEC_8859-1

http://en.wikipedia.org/wiki/ASCII

http://www.w3.org/International/O-charset

http://www.w3.org/International/O-HTTP-charset

http://www.w3.org/International/tutorials/tutorial-char-enc/

Page 92: UTF-8, Perl and You

References

http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode

http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF

http://www.joelonsoftware.com/articles/Unicode.html

http://unicode.org/