UTF-8, Perl and You

UTF-8, Perl and YouBy Rafael Almeria

Chapter 1:Introduction

1 - Introduction

This talk does not deal with themotivation for using utf-8.

1 - Introduction

This talk is about:

Implementation details.

Understanding UTF-8.

Converting your data,

And knowing how to fix common problems.

1 - Introduction

Some assumptions:

Language: Perl

Unix Operating System

Input encoded as: ASCII, ISO-8859-1/Latin-1 or Windows-1252.

Output encoded as: UTF-8

1 - Introduction

What we’ll cover in this talk:

A primer on character encoding

A simplifying principle

Perl & UTF-8

Making the Browser Happy

Encoding Hell

Chapter 2:A Very Brief Primer on Character

Encoding.

2 - A Very Brief Primer on Character Encoding.

What is a character encoding?

It’s a specific way to represent the characters in a given character set.

A character set may have a numerical ordering on it for use with a given

character encoding.

The number given to a specific character in an ordered character set is

its code point.

Do not confuse the character’s code point with its representation!

It may be the same for ASCII, ISO-8859-1 and Windows-1252 and…

it may be the same for 1-byte UTF-8 but…

it’s definitely not true for multi-byte UTF-8.

It’s a common problem. So don’t confuse them!

Chapter 3:A Simplifying Principle

3 - A Simplifying Principle

If all of our data is encoded using only the following encodings (code point ranges are in parenthesis):

ASCII (0x00 - 0x7F)

ISO-8859-1/Latin-1 (0x00 - 0xFF)

Windows-1252 (0x00 - 0xFF)

and if we only care about printable content then

ASCII ISO-8859-1 Windows-1252

We can treat everything as Windows-1252!

This should be ok if we are sure that the documents are from one of these three kinds of encodings but we’re not sure

how each document is encoded.

Chapter 4: UTF-8.

A Brave New World

4 - UTF-8. A Brave New World

It supports every language you’ll probably ever need.

No need for Windows-1252 this and Windows-1253 that.

Its code point range is from 0x00 to 0x10FFFF

It uses a variable (1 to 4) byte encoding.

1-byte UTF-8 is used for code points in the range 0x00 to 0x7F.

1-byte UTF-8 ASCIIMSBit is 0

code point representation

Examples of 1-byte UTF-8:

“A” -> 0100 0001

“&” -> 0010 0110

“5” -> 0011 0101

2-byte UTF-8 is used for code points in the range 0x0080 to 0x07FF.

2-byte UTF-8code point != representation

The code point is broken apart into two pieces.

The five MSBits of the code point are assigned to the first byte and the six

LSBits are assigned to the second byte.

For the first byte of 2-byte UTF-8

The three MSBits are set to 110

The remaining bits are the five MSBits of the code point.

For the second byte of 2-byte UTF-8

The two MSBits are set to 10

The remaining bits are the six LSBits of the code point.

3-byte UTF-8 is used for code points in the range 0x0800 to 0xFFFF.

The code point is broken apart into three pieces.

4 - UTF-8. A Brave New World The four MSBits of the code point are assigned to

the first byte.

The middle six bits are assigned to the second byte.

The six LSBits are assigned to the third byte.

The four MSBits are set to 1110

The remaining bits are the four MSBits of the code point.

The remaining bits are the six middle bits of the code point.

For the third byte of 3-byte UTF-8

4-byte UTF-8 is used for code points in the range 0x10000 to 0x10FFFF.

The code point is broken apart into four pieces.

The three MSBits of the code point are assigned to the first byte.

The next six MSBits are assigned to the second byte.

Another of the next six MSBits are assigned to the third byte.

The six LSBits are assigned to the fourth byte.

The five MSBits are set to 11110

The remaining bits are the three MSBits of the code point.

The remaining bits are the next six middle bits of the code point.

For the third byte of 4-byte UTF-8

The remaining bits are the next six middle bits of the code point.

For the fourth byte of 4-byte UTF-8

Chapter 5:Perl & UTF-8

5 - Perl & UTF-8

If you want to create UTF-8 strings in your Perl code then all you have to do is

use the following notation:

\x{codepoint}

5 - Perl & UTF-8

For example, to create the string “niño”:

my $str = “ni\x{f1}o”;

5 - Perl & UTF-8

To write this string to STDOUT you might do this:

binmode STDOUT, “:utf8”;print $str;

5 - Perl & UTF-8

To undo it, do this:

binmode STDOUT;print $str;

5 - Perl & UTF-8

Or to write UTF-8 data to disk, you could do this:

open(OFILE, “>:utf8”, $filename);print OFILE $str;

5 - Perl & UTF-8

To read UTF-8 data from disk, you could do this:

open(IFILE, “<:utf8”, $filename);my $str = <IFILE>;

5 - Perl & UTF-8

To convert Windows-1252 to UTF-8, you could do something like this:

use Text::Iconv;use Encode;my $utf8_str = Text::Iconv->new(“WINDOWS-1252”, “UTF-8”)->convert($str);Encode::_utf8_on($utf8_str);

Chapter 6:Making the Browser Happy

6 - Making the Browser Happy

All the efforts up to now will be for naught if the browser doesn’t

understand how the page is encoded.

To make the browser aware of the nature of the data either add…

Content-type: text/html; charset=utf-8

or if you want to tag each document…

for XML add this declaration at the top of the document:

<?xml version=“1.0” encoding=“utf-8” ?>

for HTML add this declaration at the top of the <head> section of the document:

for XHTML add this declaration at the top of the <head> section of the document:

Chapter 7:Encoding Hell

7 - Encoding Hell

So now we think we understand UTF-8…

7 - Encoding Hell

…and we think we understand how to process this data in Perl but…

7 - Encoding Hell

there is still SO MUCH OPPORTUNITY for things to go wrong!

7 - Encoding Hell

The Byte Order Mark (0xFEFF code point) is one of them.

7 - Encoding Hell

The intention is probably good but it can cause much grief.

7 - Encoding Hell

Solution is to cut out the byte sequence EF BB BF from the beginning of the document.

7 - Encoding Hell

Encoded Gibberish.

(It takes several forms)

7 - Encoding Hell

All Gibberish

7 - Encoding Hell

If it’s all gibberish then maybe the data is ok but you’re looking at it using the wrong pair of glasses. Change the document encoding declaration. Or try changing your browser’s

or application’s encoding setting.

7 - Encoding Hell

Partially Gibberish

(Two Cases)

7 - Encoding Hell

First Case: What does it look like?

Niño vs Ni?oNiño vs Ni o

7 - Encoding Hell

You likely have the dreaded “mixed encoding” nightmare. Probably someone has poured ISO-8859-1 or Windows-1252 into a UTF-8 document or vice-versa. You

will need to figure out which bytes are which and clean the document up to make it pure

UTF-8.

7 - Encoding Hell

Second Case: What does it look like?

niÃ±o (viewed in UTF-8 mode)niÃƒÂ±o (viewed in Windows-1252 mode)

7 - Encoding Hell

You likely have the double encoding problem. Sometimes some of the data gets encoded as UTF-8 twice! Again, you’ll need

to look at the bytes and fix it.

7 - Encoding Hell

Now some odds and ends…

7 - Encoding Hell

HTML::Entities::decode_entities doesn’t always do what you think. Sometimes it returns ISO-8859-1 instead of UTF-8.

Caveat programmer!

7 - Encoding Hell

Be careful if you’re using the encode or decode routines from Encode.pm, they may not set the string’s UTF-8 flag appropriately.

7 - Encoding Hell

And as a checklist of sorts when you’re debugging…

7 - Encoding Hell

When debugging…make sure that

The data has been encoded properly

The data has been flagged as UTF-8

That it has been written out properly.

That the document has the appropriate encoding declaration.

That your terminal or browser has been set to the correct encoding.

Conclusion

We notice that it is not easy to navigate the transition from traditional encodings to UTF-8 but with perseverance it is doable. We have illustrated the common encodings, how to process our information in this environment and how to tackle the common issues that might arise.

References

http://www.utf8-chartable.de/unicode-utf8-table.pl?htmlent=1 A nice list of UTF-8 characters, their character entities, code points and representation.

http://en.wikipedia.org/wiki/UTF-8

http://en.wikipedia.org/wiki/Replacement_character#Replacement_character

http://en.wikipedia.org/wiki/Character_encoding

http://en.wikipedia.org/wiki/Byte-order_mark

References

http://en.wikipedia.org/wiki/Windows-1252

http://en.wikipedia.org/wiki/ISO/IEC_8859-1

http://en.wikipedia.org/wiki/ASCII

http://www.w3.org/International/O-charset

http://www.w3.org/International/O-HTTP-charset

http://www.w3.org/International/tutorials/tutorial-char-enc/

References

http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode

http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF

http://www.joelonsoftware.com/articles/Unicode.html

http://unicode.org/

UTF-8, Perl and You

Documents

Introduction to Perl Basics I - University of Georgia8/31/2017 Introduction to Perl Basics I 29 Thank You! Basics I: Perl overview, fundamental data types Basics II: programming structures

Version 3.1 October 2018 - TSDuck · TDD Test-Driven Development UTF Unicode Transformation Format (UTF-8, UTF-16, UTF-32, etc.) XP eXtreme Programming ... Because of this legacy,

Use Perl like Perl

utf-8__RAK Faktorial

webshares.northseattle.eduwebshares.northseattle.edu/slmp/UTF 2011/UTF Requests 2011.docxWeb viewwebshares.northseattle.edu

Scalar Data Types and Basic I/O. Variables in Perl You DO NOT have to declare variables in Perl. –Unless you force it to force you to declare variables

UTF - Aermec · 7 utf 9 b utf 9 p utf 15 b utf 15 p utf 21 b utf 21 p utf 28 b utf 28 p utf 37 b utf 37 p 3.950 4.850 5.800 7.200 8.750 10.750 11.600 14.300 14.100 17.250

Student Workbook for Learning Perl¯言入门.第6版.Learning.Perl... · Perl 5.8. I’d prefer that you use a supported version of Perl, which at this writing is at least Perl 5.14

Perl - TMTOWTDI 宋政隆 Perl User. Outline What is Perl? Why learn/use Perl? How to get Perl? Things about Perl …

Utf 8''Bleaching

Utf 8'en'Cloudcomputing

You can do THAT without Perl?

Perl Tidy Perl Critic

Subculture utf-8

Perl 3: Advanced Perl - O'Reillyarchive.oreilly.com/oreillyschool/courses/Perl3... · If you completed OST's Perl 1 or Perl 2 courses, then we've already met. But in case you haven't,

UTF-8 Guide

UTF serii UTF miana - Ersetzt...9801 65511.02 IUTFUX Unita di condizionamento serie UTF Air conditioning units UTF series Urządzenia klimatyzacyjne serii UTF Lüftungs- und Klimaeinheiten

=UTF-8BTGluZ3VhZ2VtIGUgbGlndcOtc3RpY2EgTHlvbnMgQ2FwICA4IG - Copia

What Perl can do for you Or What you can do with Perl Paul Boddie Biotechnology Centre of Oslo

Learning To Program With Perl · This is actually distributed with perl itself, so if you have perl installed you already have all the documentation you could ever want. To access