PyParis 2017 / Unicode and bytes demystified, by Boris Feld

ℙƴ☂ℌøἤ ⒝⒴⒯⒠⒮⒝⒴⒯⒠⒮

DΣMYƧƬIFIΣD

BorisFELD-PyParis,Paris-2017

http://lothiraldan.github.io/

http://pyparis.org/

BorisFELD

Pythondeveloper

MercurialandPythonconsultantatOctobus

https://lothiraldan.github.io/

@lothiraldan

/me

https://octobus.net/

https://lothiraldan.github.io/

https://twitter.com/lothiraldan

Unicodeis��!

Let'stestit!

WhatisthelengthofthisUnicodestringinPython2?

len(u' ')

1

2

3

4

1.Unicodelength

Itdependsofyourpython:

DOCKER_IMAGE=quay.io/pypa/manylinux1_x86_64$>dockerrun-t-i$DOCKER_IMAGE/opt/python/cp27-cp27mu/bin/python\-c"printlen(u'\U0001f60e')"1

Butitcanalsobe:

DOCKER_IMAGE=quay.io/pypa/manylinux1_x86_64$>dockerrun-t-i$DOCKER_IMAGE/opt/python/cp27-cp27m/bin/python\-c"printlen(u'\U0001f60e')"2

Unicodelength

Whencouldyouseethiserrormessage?

UnicodeEncodeError:'ascii'codeccan'tencodecharacter

Whendoing.encode('ascii')

Whendoing.decode('ascii')

Whendoing.decode('utf-8')

Inallofthesessituations

2.UnicodeEncodeError

Inallofthesesituations!

>>>x=u'é'>>>x.encode('ascii')Traceback(mostrecentcalllast):File"<stdin>",line1,in<module>UnicodeEncodeError:'ascii'codeccan'tencodecharacteru'\xe9'inposition0:ordinalnotinrange(128)>>>x.decode('ascii')Traceback(mostrecentcalllast):File"<stdin>",line1,in<module>UnicodeEncodeError:'ascii'codeccan'tencodecharacteru'\xe9'inposition0:ordinalnotinrange(128)>>>x.decode('utf-8')Traceback(mostrecentcalllast):File"<stdin>",line1,in<module>UnicodeEncodeError:'ascii'codeccan'tencodecharacteru'\xe9'inposition0:ordinalnotinrange(128)

UnicodeEncodeError

Whenshouldyouusechrandunichr?

Youshouldalwaysusechr.

Youshouldalwaysuseunichr.

YoushouldchrforASCIIandunichrforUnicode.

3.Chrvsunichr

Preferusingunichrforeverything.

Chrvsunichr

Skepticaldogisskeptical

Wehavetogoback!

The60s

Apollo11

Woodstock

Somethingimportant

Somethinghuge

ASCIIwasborn

In1960s,theAmericanStandardsAssociationwantedtoanswerthequestion:

Howtorepresenttextdigitally?

Theimportantquestion

Problem,computersareonlyspeakingbits.Howtotransformtextintobits?

Problem

Weknowhowtoconvertintegertobinary:

0=00000001=00000012=00000103=0000011.............127=1111111

Let'sassigneachcharacteranintegerfrom0to127named"codepoint".

Prettysimplesolution

ASCIIwithPython

Let'stakeastring:

"pyparis"

Astringisasequenceofcharacters:

assertlist("pyparis")==['p','y','p','a','r','i','s']

Whatisastring?

asserttype("pyparis"[0])==<type'str'>assertlen("pyparis"[0])==1

Acharacter(fromtheGreekχαρακτήρ"engravedorstampedmark"oncoinsorseals,"brandingmark,symbol")

isasignorsymbol.

—Wikipedia

Acharacterisbasicallyanything.Itcouldrepresentsbealetter,adigitorevenanemoji.

Whatischaracter

https://en.wikipedia.org/wiki/Character_(symbol)

ForretrievingtheASCIIcodepointofacharacter,wecanuseord:

assertord("p")==112

Toreversetheprocesswecanusechr:

assertchr(112)=="p"

CodepointinPython

p y p a r i s

CodePoint 112 121 112 97 114 105 115

Codepoints

p y p a r i s

CodePoint 112 121 112 97 114 105 115

Binary 1110000 1111001 1110000 1100001 1110010 1101001 1110011

codepoint encode binarycodepoint decode binary

ASCIIencoding

encodeismeanttotransformastringintosomebytes:

string='abc'bytes=bytes.encode('ascii')asserthex(bytes)=='616263'

decodeismeanttotransformsomebytesintoastring:

bytes=unhex('616263')string=bytes.decode('ascii')assertstring=='abc'

Eachofthesemethodsacceptsanencodingparameterforthenameoftheconversionalgorithmtouse.

EncodevsDecode

Everythingisawesome...

...right?

Smallproblem

ASCIIsolvedtheproblemforUSAbutnotforeveryoneelse.

Noteveryonespeaksenglish

ASCIIonlyusethe7lowerbitsofabyte.01100001

Butonmostcomputerabyteisactually8bitssowecansupportmorecharacters.

Andsonewstandardwereborn...

Otherstandards

SomewerebasedonASCIIandusea8bittoaddsupportforaccentsforexample,likeLatin1thatdefinesthecharacterÉwiththecodepoint201.

Someother,werenotcompatibleatall,likeEBCDIC,usedonIBMmainframes,wherethe1001011(codepoint75)codepointrepresentthepunctuationmark"."whileinASCIIitrepresent"A".

Ofcoursetheywerenotallcross-compatible...

Otherstandards

Itwasamess

Initialtext a b ã é

Latin1CodePoint 97 98 227 233

Latin1encoding 01100001 01100010 11100011 11101001

ASCIIdecoding a b ERROR ERROR

MacOSRomandecoding a b „ È

EBCDICdecoding / ERROR T Z

Example

Herecomesoursavior!

OneStandardtorulethemall,

OneStandardtofindthem,

OneStandardtobringthemall

andinthegreatergoodbindthem

Unicodethesavior

Unicodeisacomputingindustrystandardfortheconsistentencoding,representation,andhandlingoftextexpressedin

mostoftheworld'swritingsystems.

—Wikipedia

Itallstartedin1987-1988asacoordinationbetweenJoeBeckerfromXeroxandLeeCollinsandMarkDavisfromApple.

TheunicodecodepointsarefortunatelyforusASCIIcompatible.

WhatisUnicode?

https://en.wikipedia.org/wiki/Unicode

ThelatestversionofUnicodecontainsarepertoireof128,237characterscovering135modernandhistoric

scripts,aswellasmultiplesymbolsets.

—Wikipedia

ASCIIwasdefining127characters,soUnicodedefines1000timesmorecharacters.

Itdefinesseveralblocks:

BasicLatin:ab...XYZ

Greek,Aramaic,Cherokee:ΔעᏗ

Righttoleftscripts,Cuneiform,hieroglyphs:

MahjongTiles,DominoTiles,Playingcards:

Emoticons,Musicalnotations:

Unicodesize

https://en.wikipedia.org/wiki/Unicode

RemembertheASCIItable?

UnicodevsASCII

UnicodewithPython

Let'stakeaunicodecharacter€.

First,declaretheencodingofyourpythonsourcefileasutf-8:

#-*-coding:utf-8-*-

Then,youcanwriteitthisway:

u'€'

Or:

u'\u20AC'

Itscodepointis8364:

ord(u'€')==8364

HowtowriteUnicodeinPython

Let'sconvertthecodepointintobinary:

€

CodePoint 8364

Naiveconversion 0010000010101100

Problem

Itdoesn'tfitinto1byte.

Theproblemswhenyoustartusingmorethan1bytesaremultipleandannoying:

Howtoorderthebytes,BigAndLittleEndianproblemsanyone?

Howtorecognizewhichbyteyouarereadinginafileorstream?

Howtodetectandcorrecttransmissionerrorswhereonlysomebytesweremissing?

8364intobinarytakestwobytes.Unicodecharacterscodepointsgoeswellbeyond1000000(becauseofnonallocatedyet),takingupto3bytes.

Multi-bytes

AsASCIIwassimple,transformingASCIIcodepointsintobinarywasstraightforward.

ButthepresenceofhighcodepointcharactersinUnicodecomplexifytheprocess.Therearemultiplewaysofdoingit,calledencodings:

UTF-8

UTF-16

UTF-32

Multipleencoding

Ifyouarenotsure,useUTF-8,itwillbecompatiblewitheverycharacters,workswellmostofthetimeandsolvedmulti-bytesrelatedproblemsElegantly.

IfyouprocessmoreAsiancharactersthanLatin,useUTF-16soyouuselessspaceandmemory.

Ifyouneedtointeractwithanotherprogram,usethedefaultotherprogramencoding(CSVanyone?).

ComparisonofUnicodeencodings-Wikipedia

Chooseanencoding

https://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings

UTF-8EverywhereManifesto

UTF-8everywhere

http://utf8everywhere.org/

A €

CodePoint 65 8364

Naiveconversion

01000001 0010000010101100

UTF-8 01000001 111000101000001010101100

UTF-16 0000000001000001 0010000010101100

UTF-3200000000000000000000000001000001

00000000000000000010000010101100

Whatarethedifferences?

Let'sclarifysomething:

encodeismeanttotransformanunicodestringintosomebytes:

hex(u'é'.encode('utf-8'))=='c3a9'

decodeismeanttotransformsomebytesintoanunicodestring:

unhex('c3a9').decode('utf-8')==u'é'

EncodevsDecode

Python2

CountingthelengthofanASCIIstringiseasy,countthenumberofbytes!

Butit'smuchmoreharderwithUnicodestrings.

Python2trieshardtogetyouacorrectanswer.

Let'stakebackourexample: .Itscodepointis128526.

1.Stringlength

Python2comesinseveralflavor,twoarerelatedtoUnicode.Itseitheranarrowbuildorawidebuild.ItbasicallychangehowPythonstoresitsstrings.

Forcodepoint<65535,everythingworksthesame,Pythonstoreeachcharacterseparatelyandonlyonecharacter.

Forcodepoint>65535,itdiffers.ThewidebuildcharactersizeisenoughforallUnicodecodepoints.Butthenarrowbuildcharactersizeisnotbigenoughforcodepoint>65535,soitstoreuppercodepointsasapairofcharacters.

Thenarrowbuilduselessmemorybutitexplainswhythenarrowbuildreturns2forlen(u' '),it'sbecausePython2actuallystoretwocharacters.

MultipleflavorsofPython2

Rememberthesignificationofencodeanddecode?

EncodetransformsanUnicodestringintosomebytes.

DecodetransformssomebytesintoanUnicodestring.

2.Encoding/DecodinginPython2

Python2alwayshadastringtypebutintroducedtheUnicodetypeinPython2.1.

Python2strisbadlynamedasit'sbasicallyabagofbytes.Whenyoudisplayit,Pythonwilltrytodecodeitforyou.SoforASCIIonlystrings,encodeanddecodewillreturnthesame.

x='abc'assertx.encode('ascii')==xassertx.decode('ascii')==x

Python2typesystem

Pythonisastronglytypedlanguage,meaningthatPythonshouldn'tcoercetypesbehindyourback:

'012'+3Traceback(mostrecentcalllast):File"<stdin>",line1,in<module>TypeError:cannotconcatenate'str'and'int'objects

Butit'snotrespectingthispropertywithstrings.RememberthatdecodeconvertbytesintoanUnicodestringinPython?

x=u'é'x.decode('utf-8')

AsdecodeiscalledonanUnicodeinstance,itisn'tbytes.Sopythontriestomakessomebytesoutofthestringanddoes:

x=u'é'x.encode('ascii').decode('utf-8')

That'swayyoucanseeanUnicodeEncodeErrorerrorwhiletryingtodecodeanUnicodestringinPython2.

Python2typecoercing

Youcanusechrtogetthecharacterofacodepoint:

assertchr(65)=='A'

ButitonlyworkswithASCIIcharacters!

chr(8364)Traceback(mostrecentcalllast):File"<stdin>",line1,in<module>ValueError:chr()argnotinrange(256)

ForUnicodeyouneedtouseunichr:

assertunichr(8364)==u'€'

3.Python2chrvsunichr

Python3♥ ♥ ♥ ♥

Python3nowalwaysstoreitsstringsthesamewayandlenreturnsyoutherightanswernomatterwhat:

x=' 'assertlen(x)==1

1.Python3singleflavor

Python3biggestchangewastochangethetypesystemsofstrings.

Bytes String Unicodestrings

Python2 str unicode

Python3 bytes str

2.Python3bigchange

NowthatPython3haveseparatetypesforbytesandstring,wenowlongercanmesswithencodeanddecode:

string=''string.decode('ascii')Traceback(mostrecentcalllast):File"<stdin>",line1,in<module>AttributeError:'str'objecthasnoattribute'decode'

DecodinganUnicodestringnevermadesenseanyway.

bytes=b''bytes.encode('utf-8')Traceback(mostrecentcalllast):File"<stdin>",line1,in<module>AttributeError:'bytes'objecthasnoattribute'encode'

Soyoualwaysknowwhatthetypesyouaredealingwith.

2.Python3coherenttypesystem

Unicodestringsarenowthenorm,soPython3droppedtheuprefixforUnicodestringsandreplaceditbyabprefixforbytes,soyoudirectlywrite:

x=' '

Python3.3reintroducedtheprefixforcodebasesthatneedstobecompatiblewithPython2andPython3,soit'salsoworks:

x=u' '

2.Nomoreuprefix

Python3nolongerhaveseparatefunctionsforchrandunichr,justusechr.

assertchr(65)=='A'assertchr(8364)=='€'

3.Python3chr

Painrelieftips

Thankstothenewtypesystem,itisnoweasiertoidentifywhichpartofthecodeneedstoencodestringsanddecodebytes.

bytes Outsideworld

decode Library

unicode

Businesslogic

unicode

encode Library

bytes Outsideworld

1.Unicodesandwich

SoftwareshouldonlyworkwithUnicodestringsinternally,decodingtheinputdataassoonaspossibleandencoding

theoutputonlyattheend.

—Pythondoconunicode

Unicodesandwich

https://docs.python.org/3/howto/unicode.html

Youcannotinfertheencodingsofbytes:

Content-Type:text/html;charset=ISO-8859-4

<metahttp-equiv="Content-Type"content="text/html;charset=utf-8"/>

<?xmlversion="1.0"encoding="UTF-8"?>

#-*-coding:iso8859-1-*-

Ifyoureallyreallyreallyreallyneedtoguesstheencoding,youcanusechardet,butremember,it'sabesteffortscenario.

2.Usedeclaredencoding

https://github.com/chardet/chardet

encodeanddecodeacceptsasecondargumentsforerrorhandling.Bydefaultitissetonstrict,whichmeanscrash

x=u'abcé'x.encode('ascii',errors='strict')Traceback(mostrecentcalllast):File"<stdin>",line1,in<module>UnicodeEncodeError:'ascii'codeccan'tencodecharacteru'\xe9'inposition3...

Youcanalsousereplacetoreplaceinvalidcharacterby?:

assertx.encode('ascii',errors='replace')=='abc?'

Oryoucansimplyignorethem:

assertx.encode('ascii',errors='ignore')=='abc'

FinallyyoucanreplacethembytheirXMLcode:

assertx.encode('ascii',errors='xmlcharrefreplace')=='abcé'

3.Errorhandling

UseUnicodeanytimepossible.

UsePython3.

ExplicitlyencodestranddecodestrinPython2,itmightsolvesbugsinyourcodeandeasePython3conversions.

Unicodesandwich.

Neverguessanencoding!

Useerrorhandling.

Conclusion

forcinrange(0x1F410,0x1F4f0):print(r"\U%08x"%c).decode("unicode-escape"),

Pythonfun

Thankyou!

TheAbsoluteMinimumEverySoftwareDeveloperAbsolutely,PositivelyMustKnowAboutUnicodeandCharacterSets(NoExcuses!)

PragmaticUnicode

UnicodeInPython,CompletelyDemystified

Whateveryprogrammerabsolutely,positivelyneedstoknowaboutencodingsandcharactersetstoworkwithtext

Holybatman

Redditonunicode

References

https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

https://nedbatchelder.com/text/unipain.html

http://farmdev.com/talks/unicode/

http://kunststube.net/encoding/

http://www.manuel-strehl.de/publications/holy-batman/presentation

https://www.reddit.com/r/Unicode/

Technology

PyParis 2017 / Unicode and bytes demystified, by Boris Feld