What character is that

Anders [email protected]

What character is THAT?

Agenda

• About Anders Karlsson• Part 1 - The gruesome background

• The history of character sets and collations

• The “classic” 7 and 8 bit ASCII character sets

• Part 2 – UNICODE Rocks!• What is UNICODE and encodings• Why UTF-8 is smart. Or not so smart

• Part 3 - MySQL and UNICODE• Questions? Answers?

About Anders Karlsson

• Senior Sales Engineer at SkySQL• Former Database Architect at Recorded Future, Sales

Engineer and Consultant with Oracle, Informix, TimesTen, MySQL / Sun / Oracle etc.

• Has been in the RDBMS business for 20+ years• Has also worked as Tech Support engineer, Porting

Engineer and in many other roles• Outside SkySQL I build websites (www.papablues.com),

develop Open Source software (MyQuery, mycleaner etc), am a keen photographer, has an affection for English Real Ales and a great interest in computer history

10/04/2023 SkySQL Ab 2011 Confidential 3

Part 1 – The history which we are not to ignore (but which has already been ignored several times)

The history of Characters Sets and collations

• At first there were no characters, only numbers• Then on the 7th day we realized characters and

words was a good thing, but that computers can only handle numbers, so we needed a way of representing characters as numbers

• So we different mappings from characters to numbers: ASCII, EBCDIC, FIELDATA, Baudot etc, in different variations (in particular EBCDIC)

ASCII – The mother of character sets

• For anyone not being a machochist (i.e. anyone not using a mainframe), the character set of choice soon became 7-bit ASCII (American Standard Code for Information Interchange), first published in 1963

• 7-bits was enough for US English characters and control characters, with some legroom (note that ASCII is US English, not UK English, centric)

• The 8th bit was used for parity in transmission

All ASCII hell breaks loose

• As the original 7-bit US ASCII didn’t support anything but US English, variations started to appear.

• Any decent computer was supporting 8-bit characters, but as the assumption was still that bit 8 was a partity bit.

• So 7-bit local variations was developed, Swedish 7-bit ASCII for example (anyone coding in C knows and hates this)

And then we get 8-bit ASCII hell!

• Extended 8-bit ASCII solves a few problems, but also introduces a few new ones. Most of the new problems came from an attempt of making 8-bit Extended ASCII compatible with 7-bit ASCII variations

• The Extended 8-bit “ASCII” characters sets are largely standardized as ISO 8859 (with variations). Most common is ISO 8859-1 (latin-1)

• 8859-15 is a not so popular 8859-1 update, including a Euro-sign among a few otherthings. If the Euro-sign really is a usefuladdition is yet to be determined

• Another 8859-1 variation is Windows CP1252,which is an enhanced 8859-1 character set

Oh, then we have collations!

• A “collation” determines how characters in a character set are to be sorted!

• 7-bit ASCII was great (numeric order same as character order)– Or was it? Really? Upper / Lower case?

• 7-bit localized ASCII was not so great. To say the least. Swedish 7-BIT ASCII was not correctly sorted (å last in the alphabet, after ä and ö)

• 8-bit Extended ASCII didn’t help much (Swedish again being in the wrong order, but not the same wrong order as with 7-bit “Swedish ASCII”)

Collation basics

• Don’t ever think that the character set determines the sorting!– The same character set used in

different countries may be sorted differently

– Different sorting models may be used in the same country (A good example is case sensitivity)

• Also, collations is not only about sorting, it’s also about comparisons and a few other things

Interoperating with ASCII

• A long as we were all using 1 single computer or a bunch of similar computers in a LAN, the issues were limited

• As usual, the Internet turned this beautiful environment into something truly evil!

• Internet got started in the US– Which means, again, that the founders were

convinced that 7-bit ASCII would be OK. That this had been an incorrect assumption 30 years before Facebook came around made no difference. Of course not.

Interoperability necessities

• For us to be able to communicate we need to be able to tell what character set we expect here at the client side, the server has to tell what it delivers, and then we need a way to align all this.

• The trick: <meta http-equiv=Content-Type content="text/html; charset=iso8559-1">– Or maybe not? This tells what I get, but doesn’t

allow me to say what I want!• Actually, this didn’t help as much as we hoped

Part 1 Conclusion

• The many different local variations of characters served us well, for a while

• Now we have a global IT environment with many different character sets and collations, and we can’t deal with multiple local versions anymore

• And we have languages whose character set will not fit in 8 bits anyway

• And the we need to sort and compare all this!

Part 2 – UNICODE and Ken Thompson saves the world, without Batman and not by tracking down the penguin

UNICODE – One Character set for all

• Yes, that is what UNICODE (or ISO/IEC 10646) sets out to do – A common character set for ALL languges (close to 240.000 characters are defined in UNICODE 4.1 today, MySQL is somewhat at UNICODE 3.0). Sort of.

• This means that UNICODE has character codes than can not fit in 1 byte. This is big surprise to anyone on the other side of the pond, but there you go

• But there is a remedy: UNICODE Encodings!

UNICODE Encodings

• A UNICODE encoding is a standardized way of representing a character in the UNICODE character set

• UNICODE encodings represent select parts of the full UNICODE character set

• UNICODE encodings are part of the UNICODE standard itself (and this is a VERY good thing! If this wasn’t the case, both Apple and Microsoft would have invented their own encodings I’m sure)

UNICODE Encodings

• Among the UNICODE encodings are– UCS-2 – 2 bytes wide (i.e. only 64k different

characters can be represented)– UTF-16 – 2 or 4 bytes wide. This is then a variable

length scheme with a very complex setup. When only 2 bytes are used, they are the same as UCS-2

– UTF-32 – 4 bytes fixed size• To be honest, besides UTF-16 / UCS-2 that is

common in Windows and related frameworks (like COM), none of these are very popular

UTF-8 – Some smart dudes at work!

• The problem than UNICODE has is that it has to represent all those characters. This should break some applications for sure.

• Well, Encodings solve that too, and the mother of all encodings is UTF-8.Invented not by Albert Einstein orBatman but by Ken Thompson!

• Let’s now have a round of applausefor Ken Thompson!

The details of UTF-8

• UNICODE characters 0 – 127 are the same as in standard 7-bit ASCII (remember that?)

• UTF-8 works the same: For characters 0 – 127, the most significant (first) bit of the first (and only) byte is 0

• Beyond 7-bit ASCII characters, the number of “leading” 1’s in the first byte tells how many bytes make of the up the character

• All other bytes start with a 1 and a 0• And the rest of the bits make up the character

The details of UTF-8

• So in the first byte, it is one of two things:– A leading 0 meaning a single byte character– A number of 1’s (at least 2, as 1 byte characters

are indicated by a leading 0) followed by a 0• This means that the first byte in a character NEVER

starts with the sequence 10

– All other bytes starts with 10– 1 UTF-8 byte can contain up to 7 bits of data– 2 UTF-8 bytes contains from 8 to 13 bits of data– 3 UTF-8 bytes contains from 14 to 16 bits of data– 4 UTF-8 bytes contains from 17 to 21 bits of data

Some useful aspects of UTF-8

• You can always find the leading byte of a character in a word, starting from any byte– Just move “backward” til a byte not having a

leading 10 is found• Byte values 0 – 127 are ONLY present as

character values 0 – 127, nowhere else!– All other byte values have the highest bit set– So strlen(), strcmp() etc. still works, but on a byte

by byte, not character by character, level

So, are we all OK with UTF-8 now?• Let’s see. Using UTF-8 we can represents

binary values with up to 21 bits, which is 2.097.152 characters! Which ismore than enough! (But 640KRAM was ALSO more than enough)

• If we limit ourselves to 3 bytes UTF-8we can represent 65.536 differentcharacters, the same as if we useUCS-2 (which is fixed 2-byte format).65.536 characters is what is in theUNICODE Basic Multilingual Plane

Why we actually need 4-bytes UTF-8

• Beyond the BMP comes a couple of other “planes”. The one that causes most issues is the one that adds a bunch of Chinese, Japanese and Korean characters

• For these, we need to go beyond the BMP and hence beyond the nice and cosy 65.536 characters. Duh!

• And this is why the MySQL assumption on UTF-8 means a maximum of 3 bytes might not be such a good idea after all

Part 3 – MySQL and UNICODE

So, how does MySQL handle all this?• MySQL supports a whole range of UNICODE

Encodings and collations! Good!• MySQL understand the case when we have

one character set stored in a column in a table and another one on the client side, and nicely does a conversion for is! Good!

• Not all UNICODE Encodings are valid on the Client side! Not so good

• Actually, anything beyond UTF-8, when it comes to UNICODE on the client side, is troublesome

Lessons in MySQL and UNICODE

• Lesson 1: Learn about UNICODE and understand how it works

• Lesson 2: Stick with UTF-8! Most others does that too. Including Java, Java Script, JSON, the web any many, many others!

• Lesson 3: UCS-2 may seem like a good idea, it is fixed length after all. It’s not (a good idea that is, fixed length it is)

• Lesson 4: Don’t forget about collations! They are important!

Collations – The Sequel

• Collations determine how strings are sorted– Order by– Indexes– WHERE col1 > ‘Über’

• Collations determine how strings are compared– Is A = Ä or not? Y = Ü?

• What in particular for COLLATIONS used for PRIMARY KEYs

Storing UTF-8 data in MySQL

• Most Storage Engines are happy to use utf-8• The MySQL Interpretation of UTF-8 is 1 – 3

bytes, or 65.536 different characters!– This means that

• A CHAR(10) column requires 30 bytes fixed space!• A VARCHAR(10) column is limited to 30 bytes

• MySQL 5.5 and up also supports 4-byte UTF-8 by using the character set utf8mb4

Storing UTF-8 data in MySQL• VARCHAR columns are actually fixed in some

Storage Engines, most notably those engines developed sometime around the time of the American Civil War, when variable length data was still in it’s infancy

• UTF-8 can potentially waste A LOT of space• Extra space for UTF-8 also affects byte size

limits, such as VARCHAR and INDEX sizes• UTF-8 data sorting is way more complex than

a simple binary sort (so in some ways, things were better in the old 7-bit ASCII days)

Some simple demos, Questions and Answers.And I haven’t even began to talk about byte ordering and byte order marks.

THANK YOU!

Anders [email protected]://karlssonondatabases.blogspot.com

Technology

What character is that