31
Anders Karlsson [email protected] What character is THAT?

What character is that

Embed Size (px)

DESCRIPTION

Character sets and collations are am important part of the database setup. In this presentation I show you the history of character sets and how they are used today, how UTF-8 works and how MySQL handles all this.

Citation preview

Page 1: What character is that

Anders [email protected]

What character is THAT?

Page 2: What character is that

Agenda

• About Anders Karlsson• Part 1 - The gruesome background

• The history of character sets and collations

• The “classic” 7 and 8 bit ASCII character sets

• Part 2 – UNICODE Rocks!• What is UNICODE and encodings• Why UTF-8 is smart. Or not so smart

• Part 3 - MySQL and UNICODE• Questions? Answers?

Page 3: What character is that

About Anders Karlsson

• Senior Sales Engineer at SkySQL• Former Database Architect at Recorded Future, Sales

Engineer and Consultant with Oracle, Informix, TimesTen, MySQL / Sun / Oracle etc.

• Has been in the RDBMS business for 20+ years• Has also worked as Tech Support engineer, Porting

Engineer and in many other roles• Outside SkySQL I build websites (www.papablues.com),

develop Open Source software (MyQuery, mycleaner etc), am a keen photographer, has an affection for English Real Ales and a great interest in computer history

10/04/2023 SkySQL Ab 2011 Confidential 3

Page 4: What character is that

Part 1 – The history which we are not to ignore (but which has already been ignored several times)

Page 5: What character is that

The history of Characters Sets and collations

• At first there were no characters, only numbers• Then on the 7th day we realized characters and

words was a good thing, but that computers can only handle numbers, so we needed a way of representing characters as numbers

• So we different mappings from characters to numbers: ASCII, EBCDIC, FIELDATA, Baudot etc, in different variations (in particular EBCDIC)

Page 6: What character is that

ASCII – The mother of character sets

• For anyone not being a machochist (i.e. anyone not using a mainframe), the character set of choice soon became 7-bit ASCII (American Standard Code for Information Interchange), first published in 1963

• 7-bits was enough for US English characters and control characters, with some legroom (note that ASCII is US English, not UK English, centric)

• The 8th bit was used for parity in transmission

Page 7: What character is that

All ASCII hell breaks loose

• As the original 7-bit US ASCII didn’t support anything but US English, variations started to appear.

• Any decent computer was supporting 8-bit characters, but as the assumption was still that bit 8 was a partity bit.

• So 7-bit local variations was developed, Swedish 7-bit ASCII for example (anyone coding in C knows and hates this)

Page 8: What character is that

And then we get 8-bit ASCII hell!

• Extended 8-bit ASCII solves a few problems, but also introduces a few new ones. Most of the new problems came from an attempt of making 8-bit Extended ASCII compatible with 7-bit ASCII variations

• The Extended 8-bit “ASCII” characters sets are largely standardized as ISO 8859 (with variations). Most common is ISO 8859-1 (latin-1)

• 8859-15 is a not so popular 8859-1 update, including a Euro-sign among a few otherthings. If the Euro-sign really is a usefuladdition is yet to be determined

• Another 8859-1 variation is Windows CP1252,which is an enhanced 8859-1 character set

Page 9: What character is that

Oh, then we have collations!

• A “collation” determines how characters in a character set are to be sorted!

• 7-bit ASCII was great (numeric order same as character order)– Or was it? Really? Upper / Lower case?

• 7-bit localized ASCII was not so great. To say the least. Swedish 7-BIT ASCII was not correctly sorted (å last in the alphabet, after ä and ö)

• 8-bit Extended ASCII didn’t help much (Swedish again being in the wrong order, but not the same wrong order as with 7-bit “Swedish ASCII”)

Page 10: What character is that

Collation basics

• Don’t ever think that the character set determines the sorting!– The same character set used in

different countries may be sorted differently

– Different sorting models may be used in the same country (A good example is case sensitivity)

• Also, collations is not only about sorting, it’s also about comparisons and a few other things

Page 11: What character is that

Interoperating with ASCII

• A long as we were all using 1 single computer or a bunch of similar computers in a LAN, the issues were limited

• As usual, the Internet turned this beautiful environment into something truly evil!

• Internet got started in the US– Which means, again, that the founders were

convinced that 7-bit ASCII would be OK. That this had been an incorrect assumption 30 years before Facebook came around made no difference. Of course not.

Page 12: What character is that

Interoperability necessities

• For us to be able to communicate we need to be able to tell what character set we expect here at the client side, the server has to tell what it delivers, and then we need a way to align all this.

• The trick: <meta http-equiv=Content-Type content="text/html; charset=iso8559-1">– Or maybe not? This tells what I get, but doesn’t

allow me to say what I want!• Actually, this didn’t help as much as we hoped

Page 13: What character is that

Part 1 Conclusion

• The many different local variations of characters served us well, for a while

• Now we have a global IT environment with many different character sets and collations, and we can’t deal with multiple local versions anymore

• And we have languages whose character set will not fit in 8 bits anyway

• And the we need to sort and compare all this!

Page 14: What character is that

Part 2 – UNICODE and Ken Thompson saves the world, without Batman and not by tracking down the penguin

Page 15: What character is that

UNICODE – One Character set for all

• Yes, that is what UNICODE (or ISO/IEC 10646) sets out to do – A common character set for ALL languges (close to 240.000 characters are defined in UNICODE 4.1 today, MySQL is somewhat at UNICODE 3.0). Sort of.

• This means that UNICODE has character codes than can not fit in 1 byte. This is big surprise to anyone on the other side of the pond, but there you go

• But there is a remedy: UNICODE Encodings!

Page 16: What character is that

UNICODE Encodings

• A UNICODE encoding is a standardized way of representing a character in the UNICODE character set

• UNICODE encodings represent select parts of the full UNICODE character set

• UNICODE encodings are part of the UNICODE standard itself (and this is a VERY good thing! If this wasn’t the case, both Apple and Microsoft would have invented their own encodings I’m sure)

Page 17: What character is that

UNICODE Encodings

• Among the UNICODE encodings are– UCS-2 – 2 bytes wide (i.e. only 64k different

characters can be represented)– UTF-16 – 2 or 4 bytes wide. This is then a variable

length scheme with a very complex setup. When only 2 bytes are used, they are the same as UCS-2

– UTF-32 – 4 bytes fixed size• To be honest, besides UTF-16 / UCS-2 that is

common in Windows and related frameworks (like COM), none of these are very popular

Page 18: What character is that

UTF-8 – Some smart dudes at work!

• The problem than UNICODE has is that it has to represent all those characters. This should break some applications for sure.

• Well, Encodings solve that too, and the mother of all encodings is UTF-8.Invented not by Albert Einstein orBatman but by Ken Thompson!

• Let’s now have a round of applausefor Ken Thompson!

Page 19: What character is that

The details of UTF-8

• UNICODE characters 0 – 127 are the same as in standard 7-bit ASCII (remember that?)

• UTF-8 works the same: For characters 0 – 127, the most significant (first) bit of the first (and only) byte is 0

• Beyond 7-bit ASCII characters, the number of “leading” 1’s in the first byte tells how many bytes make of the up the character

• All other bytes start with a 1 and a 0• And the rest of the bits make up the character

Page 20: What character is that

The details of UTF-8

• So in the first byte, it is one of two things:– A leading 0 meaning a single byte character– A number of 1’s (at least 2, as 1 byte characters

are indicated by a leading 0) followed by a 0• This means that the first byte in a character NEVER

starts with the sequence 10

– All other bytes starts with 10– 1 UTF-8 byte can contain up to 7 bits of data– 2 UTF-8 bytes contains from 8 to 13 bits of data– 3 UTF-8 bytes contains from 14 to 16 bits of data– 4 UTF-8 bytes contains from 17 to 21 bits of data

Page 21: What character is that

Some useful aspects of UTF-8

• You can always find the leading byte of a character in a word, starting from any byte– Just move “backward” til a byte not having a

leading 10 is found• Byte values 0 – 127 are ONLY present as

character values 0 – 127, nowhere else!– All other byte values have the highest bit set– So strlen(), strcmp() etc. still works, but on a byte

by byte, not character by character, level

Page 22: What character is that

So, are we all OK with UTF-8 now?• Let’s see. Using UTF-8 we can represents

binary values with up to 21 bits, which is 2.097.152 characters! Which ismore than enough! (But 640KRAM was ALSO more than enough)

• If we limit ourselves to 3 bytes UTF-8we can represent 65.536 differentcharacters, the same as if we useUCS-2 (which is fixed 2-byte format).65.536 characters is what is in theUNICODE Basic Multilingual Plane

Page 23: What character is that

Why we actually need 4-bytes UTF-8

• Beyond the BMP comes a couple of other “planes”. The one that causes most issues is the one that adds a bunch of Chinese, Japanese and Korean characters

• For these, we need to go beyond the BMP and hence beyond the nice and cosy 65.536 characters. Duh!

• And this is why the MySQL assumption on UTF-8 means a maximum of 3 bytes might not be such a good idea after all

Page 24: What character is that

Part 3 – MySQL and UNICODE

Page 25: What character is that

So, how does MySQL handle all this?• MySQL supports a whole range of UNICODE

Encodings and collations! Good!• MySQL understand the case when we have

one character set stored in a column in a table and another one on the client side, and nicely does a conversion for is! Good!

• Not all UNICODE Encodings are valid on the Client side! Not so good

• Actually, anything beyond UTF-8, when it comes to UNICODE on the client side, is troublesome

Page 26: What character is that

Lessons in MySQL and UNICODE

• Lesson 1: Learn about UNICODE and understand how it works

• Lesson 2: Stick with UTF-8! Most others does that too. Including Java, Java Script, JSON, the web any many, many others!

• Lesson 3: UCS-2 may seem like a good idea, it is fixed length after all. It’s not (a good idea that is, fixed length it is)

• Lesson 4: Don’t forget about collations! They are important!

Page 27: What character is that

Collations – The Sequel

• Collations determine how strings are sorted– Order by– Indexes– WHERE col1 > ‘Über’

• Collations determine how strings are compared– Is A = Ä or not? Y = Ü?

• What in particular for COLLATIONS used for PRIMARY KEYs

Page 28: What character is that

Storing UTF-8 data in MySQL

• Most Storage Engines are happy to use utf-8• The MySQL Interpretation of UTF-8 is 1 – 3

bytes, or 65.536 different characters!– This means that

• A CHAR(10) column requires 30 bytes fixed space!• A VARCHAR(10) column is limited to 30 bytes

• MySQL 5.5 and up also supports 4-byte UTF-8 by using the character set utf8mb4

Page 29: What character is that

Storing UTF-8 data in MySQL• VARCHAR columns are actually fixed in some

Storage Engines, most notably those engines developed sometime around the time of the American Civil War, when variable length data was still in it’s infancy

• UTF-8 can potentially waste A LOT of space• Extra space for UTF-8 also affects byte size

limits, such as VARCHAR and INDEX sizes• UTF-8 data sorting is way more complex than

a simple binary sort (so in some ways, things were better in the old 7-bit ASCII days)

Page 30: What character is that

Some simple demos, Questions and Answers.And I haven’t even began to talk about byte ordering and byte order marks.

Page 31: What character is that

THANK YOU!

Anders [email protected]://karlssonondatabases.blogspot.com