75
The 9th Bit: Encodings in Ruby 1.9 Norman Clarke

The 9th Bit: Encodings in Ruby 1.9

Embed Size (px)

DESCRIPTION

Talk on Ruby 1.9's Encoding API given at RubyConf Brasil, 2010.

Citation preview

Page 1: The 9th Bit: Encodings in Ruby 1.9

The 9th Bit: Encodings in

Ruby 1.9Norman Clarke

Page 2: The 9th Bit: Encodings in Ruby 1.9

Encoding API

One of the most visible changes to Ruby in 1.9

Page 3: The 9th Bit: Encodings in Ruby 1.9

invalid multibyte char (US-ASCII)

invalid byte sequence in US-ASCII/UTF8

`encode': "\xE2\x80\xA6" from UTF-8 to ISO-8859-1

Page 4: The 9th Bit: Encodings in Ruby 1.9

Today’s Topics

• Character Encodings

• Ruby’s Encoding API

• Avoiding problems with UTF-8

Page 5: The 9th Bit: Encodings in Ruby 1.9

Why should I care?

Page 6: The 9th Bit: Encodings in Ruby 1.9

Ruby 1.9.2 is much faster than 1.8.7

Page 7: The 9th Bit: Encodings in Ruby 1.9

but forces you to be aware of encodings

Page 8: The 9th Bit: Encodings in Ruby 1.9

Encodings are boring but you can't ignore

them forever

Page 9: The 9th Bit: Encodings in Ruby 1.9

flickr.com/photos/29213152@N00/2410328364/

Character Encoding

Algorithm for interpreting a sequence of bytes as characters in

a written language

Page 10: The 9th Bit: Encodings in Ruby 1.9

0 nul 1 soh 2 stx 3 etx 4 eot 5 enq 6 ack 7 bel 8 bs 9 ht 10 nl 11 vt 12 np 13 cr 14 so 15 si 16 dle 17 dc1 18 dc2 19 dc3 20 dc4 21 nak 22 syn 23 etb 24 can 25 em 26 sub 27 esc 28 fs 29 gs 30 rs 31 us 32 sp 33 ! 34 " 35 # 36 $ 37 % 38 & 39 ' 40 ( 41 ) 42 * 43 + 44 , 45 - 46 . 47 / 48 0 49 1 50 2 51 3 52 4 53 5 54 6 55 7 56 8 57 9 58 : 59 ; 60 < 61 = 62 > 63 ? 64 @ 65 A 66 B 67 C 68 D 69 E 70 F 71 G 72 H 73 I 74 J 75 K 76 L 77 M 78 N 79 O 80 P 81 Q 82 R 83 S 84 T 85 U 86 V 87 W 88 X 89 Y 90 Z 91 [ 92 \ 93 ] 94 ^ 95 _ 96 ` 97 a 98 b 99 c 100 d 101 e 102 f 103 g104 h 105 i 106 j 107 k 108 l 109 m 110 n 111 o112 p 113 q 114 r 115 s 116 t 117 u 118 v 119 w120 x 121 y 122 z 123 { 124 | 125 } 126 ~ 127 del

ASCII

Page 11: The 9th Bit: Encodings in Ruby 1.9

ASCII: 7 bits

a97: 0110 0001

Page 12: The 9th Bit: Encodings in Ruby 1.9

Latin1: 8 bits

ã227: 1110 0011

Page 13: The 9th Bit: Encodings in Ruby 1.9

wikipedia.org/wiki/ISO/IEC_8859-1

Page 14: The 9th Bit: Encodings in Ruby 1.9

Other 8-bit Encodings

Work for most languages

Page 16: The 9th Bit: Encodings in Ruby 1.9

fanpop.com/spots/gandalf/images/7018563/title/gandalf-vs-el-balrog-fanart

Page 17: The 9th Bit: Encodings in Ruby 1.9

8-bit overlap

202

Ê Ę Ъ ت Κ ส Ź

Page 18: The 9th Bit: Encodings in Ruby 1.9

Unicode: An Improbable Success

¡cn:中文!

Page 19: The 9th Bit: Encodings in Ruby 1.9

Used internally by Perl, Java, Python 3, Haskell and others

Page 20: The 9th Bit: Encodings in Ruby 1.9

Unicode in Japan: not as popular

Page 21: The 9th Bit: Encodings in Ruby 1.9

Ruby 1.9: Character Set Independence

Page 22: The 9th Bit: Encodings in Ruby 1.9
Page 23: The 9th Bit: Encodings in Ruby 1.9

Ruby’s Encoding API

• Source code

• String

• Regexp

• IO

• Encoding

Page 24: The 9th Bit: Encodings in Ruby 1.9

# coding: utf-8class Canção GÊNEROS = [:forró, :carimbó, :afoxé]

attr_accessor :gêneroendasa_branca = Canção.newasa_branca.gênero = :forróp asa_branca.gênero

Source

Page 25: The 9th Bit: Encodings in Ruby 1.9

Warnings

• Breaks syntax highlighting

• #inspect, #p don’t work as of 1.9.2

• Some editors/programmers will probably mess up your code

• Just because you can, doesn’t mean you should

Page 26: The 9th Bit: Encodings in Ruby 1.9

# encoding: utf-8 string = “ã” string.length #=> 1 string.bytesize #=> 2 string.bytes.to_a #=> [195, 163] string.encode! "ISO-8859-1" string.length #=> 1 string.bytesize #=> 1 string.bytes.to_a #=> [227]

String

Page 27: The 9th Bit: Encodings in Ruby 1.9

# encoding: utf-8 string = “ã” string.length #=> 1 string.bytesize #=> 2 string.bytes.to_a #=> [195, 163] string.encode! "ISO-8859-1" string.length #=> 1 string.bytesize #=> 1 string.bytes.to_a #=> [227]

String

Page 28: The 9th Bit: Encodings in Ruby 1.9

# encoding: utf-8 string = “ã” string.length #=> 1 string.bytesize #=> 2 string.bytes.to_a #=> [195, 163] string.encode! "ISO-8859-1" string.length #=> 1 string.bytesize #=> 1 string.bytes.to_a #=> [227]

String

Page 29: The 9th Bit: Encodings in Ruby 1.9

# encoding: utf-8 string = “ã” string.length #=> 1 string.bytesize #=> 2 string.bytes.to_a #=> [195, 163] string.encode! "ISO-8859-1" string.length #=> 1 string.bytesize #=> 1 string.bytes.to_a #=> [227]

String

Page 30: The 9th Bit: Encodings in Ruby 1.9

puts a1 ("ã")puts a2 ("ã")a1.encoding #=> "ASCII-8BIT"a2.encoding #=> "UTF-8"a1.bytes.to_a == a2.bytes.to_a #=> truea1 == a2 #=> false

String

Page 31: The 9th Bit: Encodings in Ruby 1.9

# vim: set fileencoding=utf-8

pat = /ã/ pat.encoding #=> “UTF-8” pat.encode! “ISO-8859-1” #=> FAIL pat = “ã”.encode “ISO-8859-1” regexp = Regexp.new(pat) #=> OK

Regexp

Page 32: The 9th Bit: Encodings in Ruby 1.9

# vim: set fileencoding=utf-8

pat = /ã/ pat.encoding #=> “UTF-8” pat.encode! “ISO-8859-1” #=> FAIL pat = “ã”.encode “ISO-8859-1” regexp = Regexp.new(pat) #=> OK

Regexp

Page 33: The 9th Bit: Encodings in Ruby 1.9

# vim: set fileencoding=utf-8

pat = /ã/ pat.encoding #=> “UTF-8” pat.encode! “ISO-8859-1” #=> FAIL pat = “ã”.encode “ISO-8859-1” regexp = Regexp.new(pat) #=> OK

Regexp

Page 34: The 9th Bit: Encodings in Ruby 1.9

f = File.open("file.txt", "r:ISO-8859-1") data = f.read data.encoding #=> “ ISO-8859-1”

IO

Page 35: The 9th Bit: Encodings in Ruby 1.9

f = File.open("file.txt", "rb:UTF-16BE:UTF8") data = f.read data.encoding #=> “UTF-8”

IO

Page 36: The 9th Bit: Encodings in Ruby 1.9

f = File.open("file.txt", "r:BINARY") # (or “rb”) data = f.read data.encoding #=> “ASCII-8BIT” data.force_encoding "UTF-8"

IO

Page 37: The 9th Bit: Encodings in Ruby 1.9

f = File.open("file.txt", "r:BINARY") # (or “rb”) data = f.read data.encoding #=> “ASCII-8BIT” data.force_encoding "UTF-8"

IO

Page 38: The 9th Bit: Encodings in Ruby 1.9

f = File.open("file.txt", "r:BINARY") # (or “rb”) data = f.read data.encoding #=> “ASCII-8BIT” data.force_encoding "UTF-8"

IO

Page 39: The 9th Bit: Encodings in Ruby 1.9

Encoding.list.size #=> 95 Encoding.default_external = "ISO-8859-1" Encoding.default_internal = "UTF-8" File.open("latin1.txt", "r") do |file| p file.external_encoding #=> ISO-8859-1 data = file.read p data.encoding #=> UTF-8 end

Encoding

Page 40: The 9th Bit: Encodings in Ruby 1.9

Encoding.list.size #=> 95 Encoding.default_external = "ISO-8859-1" Encoding.default_internal = "UTF-8" File.open("latin1.txt", "r") do |file| p file.external_encoding #=> ISO-8859-1 data = file.read p data.encoding #=> UTF-8 end

Encoding

Page 41: The 9th Bit: Encodings in Ruby 1.9

Encoding.list.size #=> 95 Encoding.default_external = "ISO-8859-1" Encoding.default_internal = "UTF-8" File.open("latin1.txt", "r") do |file| p file.external_encoding #=> ISO-8859-1 data = file.read p data.encoding #=> UTF-8 end

Encoding

Page 42: The 9th Bit: Encodings in Ruby 1.9

Encoding.list.size #=> 95 Encoding.default_external = "ISO-8859-1" Encoding.default_internal = "UTF-8" File.open("latin1.txt", "r") do |file| p file.external_encoding #=> ISO-8859-1 data = file.read p data.encoding #=> UTF-8 end

Encoding

Page 43: The 9th Bit: Encodings in Ruby 1.9

UTF-8: a Unicode Encoding

Unicode, UTF-8, UTF-16, UTF-32, UCS-2, etc.

Page 44: The 9th Bit: Encodings in Ruby 1.9

UTF-8

Backwards-compatible with ASCII

Page 45: The 9th Bit: Encodings in Ruby 1.9

Use UTF-8 unless you have a good reason not to

Page 46: The 9th Bit: Encodings in Ruby 1.9

UTF-8 and HTML

<meta http-equiv="content-type"

content="text/html;charset=UTF-8" />

Page 47: The 9th Bit: Encodings in Ruby 1.9

UTF-8 and HTML

日本語

Page 48: The 9th Bit: Encodings in Ruby 1.9

UTF-8 and HTML

日本語

Page 49: The 9th Bit: Encodings in Ruby 1.9

UTF-8 and HTML

<form action="/" accept-

charset="UTF-8">

Page 50: The 9th Bit: Encodings in Ruby 1.9

UTF-8 and HTML

f.html?l=日本語

Page 51: The 9th Bit: Encodings in Ruby 1.9

UTF-8 and HTML

f.html?l=%26%2326085%3B%26%2326412%3B

%26%2335

Page 52: The 9th Bit: Encodings in Ruby 1.9

...here's where things get kind of strange.

Page 53: The 9th Bit: Encodings in Ruby 1.9

“JOÃO”.downcase #=> “joÃo”“joão”.upcase #=> “JOãO”

Case Folding

Page 54: The 9th Bit: Encodings in Ruby 1.9

# UnicodeUnicode.downcase(“JOÃO”)

# Active Support“JOÃO”.mb_chars.downcase

Case Folding

Page 55: The 9th Bit: Encodings in Ruby 1.9

# NOT always true "João" == "João"

Equivalence

Page 56: The 9th Bit: Encodings in Ruby 1.9

"ã" or "a" + "~"

Two ways to represent many characters

Page 57: The 9th Bit: Encodings in Ruby 1.9

Composed

a = Unicode.normalize_C("ã")a.bytes.to_a #=> [195, 163]

Page 58: The 9th Bit: Encodings in Ruby 1.9

Decomposed

a = Unicode.normalize_D("ã")a.bytes.to_a #=> [97, 204, 131]

Page 59: The 9th Bit: Encodings in Ruby 1.9

Why?

dec = Unicode.normalize_D("ã")dec =~ /a/ # match

comp = Unicode.normalize_C("ã")comp =~ /a/ # no match

Page 60: The 9th Bit: Encodings in Ruby 1.9

Normalize string keys!!!

Page 61: The 9th Bit: Encodings in Ruby 1.9

{ "João" => "authorized", "João" => "not authorized"}

You have been warned

Page 62: The 9th Bit: Encodings in Ruby 1.9

Some libraries

• Unicode

• Active Support

• Java’s stdlib

Page 63: The 9th Bit: Encodings in Ruby 1.9

Cleaning up bad data: avoid Iconv

Page 64: The 9th Bit: Encodings in Ruby 1.9

require "active_support"require "active_support/multibyte/unicode"

include ActiveSupport::MultibyteUnicode.tidy_bytes(@bad_string)

Tidy Bytes

Page 65: The 9th Bit: Encodings in Ruby 1.9

MySQL

Set encoding options early

Page 66: The 9th Bit: Encodings in Ruby 1.9

# 1: decompose@s = Unicode.normalize_D(@s)

# 2: delete accent [email protected]!(/[^\x00-\x7F]/, '')

# 3: FAIL

Approximating ASCII:"João" => "joao"

Page 67: The 9th Bit: Encodings in Ruby 1.9

OK

ã á ê ü à ça a e u a c

Page 68: The 9th Bit: Encodings in Ruby 1.9

FAIL

ß ø œ æ"" "" "" ""

Page 69: The 9th Bit: Encodings in Ruby 1.9

Use instead:

• Active Support’s Inflector.transliterate

• I18n.transliterate

• Babosa

Page 70: The 9th Bit: Encodings in Ruby 1.9

To Sum Up...

Page 71: The 9th Bit: Encodings in Ruby 1.9

Ruby is weird

Page 72: The 9th Bit: Encodings in Ruby 1.9

Use UTF-8

Page 73: The 9th Bit: Encodings in Ruby 1.9

Normalize UTF-8 keys

Page 74: The 9th Bit: Encodings in Ruby 1.9

Configure MySQL properly for UTF-8

Page 75: The 9th Bit: Encodings in Ruby 1.9

THANKS!

github.com/norman/enc

@compay

[email protected]