Upload
pole-systematic-paris-region
View
163
Download
1
Embed Size (px)
Citation preview
ℙƴ☂ℌøἤ ⒝⒴⒯⒠⒮⒝⒴⒯⒠⒮
DΣMYƧƬIFIΣD
BorisFELD-PyParis,Paris-2017
BorisFELD
Pythondeveloper
MercurialandPythonconsultantatOctobus
https://lothiraldan.github.io/
@lothiraldan
/me
Unicodeis���!
Let'stestit!
WhatisthelengthofthisUnicodestringinPython2?
len(u' ')
1
2
3
4
1.Unicodelength
Itdependsofyourpython:
DOCKER_IMAGE=quay.io/pypa/manylinux1_x86_64$>dockerrun-t-i$DOCKER_IMAGE/opt/python/cp27-cp27mu/bin/python\-c"printlen(u'\U0001f60e')"1
Butitcanalsobe:
DOCKER_IMAGE=quay.io/pypa/manylinux1_x86_64$>dockerrun-t-i$DOCKER_IMAGE/opt/python/cp27-cp27m/bin/python\-c"printlen(u'\U0001f60e')"2
Unicodelength
Whencouldyouseethiserrormessage?
UnicodeEncodeError:'ascii'codeccan'tencodecharacter
Whendoing.encode('ascii')
Whendoing.decode('ascii')
Whendoing.decode('utf-8')
Inallofthesessituations
2.UnicodeEncodeError
Inallofthesesituations!
>>>x=u'é'>>>x.encode('ascii')Traceback(mostrecentcalllast):File"<stdin>",line1,in<module>UnicodeEncodeError:'ascii'codeccan'tencodecharacteru'\xe9'inposition0:ordinalnotinrange(128)>>>x.decode('ascii')Traceback(mostrecentcalllast):File"<stdin>",line1,in<module>UnicodeEncodeError:'ascii'codeccan'tencodecharacteru'\xe9'inposition0:ordinalnotinrange(128)>>>x.decode('utf-8')Traceback(mostrecentcalllast):File"<stdin>",line1,in<module>UnicodeEncodeError:'ascii'codeccan'tencodecharacteru'\xe9'inposition0:ordinalnotinrange(128)
UnicodeEncodeError
Whenshouldyouusechrandunichr?
Youshouldalwaysusechr.
Youshouldalwaysuseunichr.
YoushouldchrforASCIIandunichrforUnicode.
3.Chrvsunichr
Preferusingunichrforeverything.
Chrvsunichr
Skepticaldogisskeptical
Wehavetogoback!
The60s
Apollo11
Woodstock
Somethingimportant
Somethinghuge
ASCIIwasborn
In1960s,theAmericanStandardsAssociationwantedtoanswerthequestion:
Howtorepresenttextdigitally?
Theimportantquestion
Problem,computersareonlyspeakingbits.Howtotransformtextintobits?
Problem
Weknowhowtoconvertintegertobinary:
0=00000001=00000012=00000103=0000011.............127=1111111
Let'sassigneachcharacteranintegerfrom0to127named"codepoint".
Prettysimplesolution
ASCIIwithPython
Let'stakeastring:
"pyparis"
Astringisasequenceofcharacters:
assertlist("pyparis")==['p','y','p','a','r','i','s']
Whatisastring?
asserttype("pyparis"[0])==<type'str'>assertlen("pyparis"[0])==1
Acharacter(fromtheGreekχαρακτήρ"engravedorstampedmark"oncoinsorseals,"brandingmark,symbol")
isasignorsymbol.
—Wikipedia
Acharacterisbasicallyanything.Itcouldrepresentsbealetter,adigitorevenanemoji.
Whatischaracter
ForretrievingtheASCIIcodepointofacharacter,wecanuseord:
assertord("p")==112
Toreversetheprocesswecanusechr:
assertchr(112)=="p"
CodepointinPython
p y p a r i s
CodePoint 112 121 112 97 114 105 115
Codepoints
p y p a r i s
CodePoint 112 121 112 97 114 105 115
Binary 1110000 1111001 1110000 1100001 1110010 1101001 1110011
codepoint encode binarycodepoint decode binary
ASCIIencoding
encodeismeanttotransformastringintosomebytes:
string='abc'bytes=bytes.encode('ascii')asserthex(bytes)=='616263'
decodeismeanttotransformsomebytesintoastring:
bytes=unhex('616263')string=bytes.decode('ascii')assertstring=='abc'
Eachofthesemethodsacceptsanencodingparameterforthenameoftheconversionalgorithmtouse.
EncodevsDecode
Everythingisawesome...
...right?
Smallproblem
ASCIIsolvedtheproblemforUSAbutnotforeveryoneelse.
Noteveryonespeaksenglish
ASCIIonlyusethe7lowerbitsofabyte.01100001
Butonmostcomputerabyteisactually8bitssowecansupportmorecharacters.
Andsonewstandardwereborn...
Otherstandards
SomewerebasedonASCIIandusea8bittoaddsupportforaccentsforexample,likeLatin1thatdefinesthecharacterÉwiththecodepoint201.
Someother,werenotcompatibleatall,likeEBCDIC,usedonIBMmainframes,wherethe1001011(codepoint75)codepointrepresentthepunctuationmark"."whileinASCIIitrepresent"A".
Ofcoursetheywerenotallcross-compatible...
Otherstandards
Itwasamess
Initialtext a b ã é
Latin1CodePoint 97 98 227 233
Latin1encoding 01100001 01100010 11100011 11101001
ASCIIdecoding a b ERROR ERROR
MacOSRomandecoding a b „ È
EBCDICdecoding / ERROR T Z
Example
Herecomesoursavior!
OneStandardtorulethemall,
OneStandardtofindthem,
OneStandardtobringthemall
andinthegreatergoodbindthem
Unicodethesavior
Unicodeisacomputingindustrystandardfortheconsistentencoding,representation,andhandlingoftextexpressedin
mostoftheworld'swritingsystems.
—Wikipedia
Itallstartedin1987-1988asacoordinationbetweenJoeBeckerfromXeroxandLeeCollinsandMarkDavisfromApple.
TheunicodecodepointsarefortunatelyforusASCIIcompatible.
WhatisUnicode?
ThelatestversionofUnicodecontainsarepertoireof128,237characterscovering135modernandhistoric
scripts,aswellasmultiplesymbolsets.
—Wikipedia
ASCIIwasdefining127characters,soUnicodedefines1000timesmorecharacters.
Itdefinesseveralblocks:
BasicLatin:ab...XYZ
Greek,Aramaic,Cherokee:ΔעᏗ
Righttoleftscripts,Cuneiform,hieroglyphs:
MahjongTiles,DominoTiles,Playingcards:
Emoticons,Musicalnotations:
Unicodesize
RemembertheASCIItable?
UnicodevsASCII
UnicodewithPython
Let'stakeaunicodecharacter€.
First,declaretheencodingofyourpythonsourcefileasutf-8:
#-*-coding:utf-8-*-
Then,youcanwriteitthisway:
u'€'
Or:
u'\u20AC'
Itscodepointis8364:
ord(u'€')==8364
HowtowriteUnicodeinPython
Let'sconvertthecodepointintobinary:
€
CodePoint 8364
Naiveconversion 0010000010101100
Problem
Itdoesn'tfitinto1byte.
Theproblemswhenyoustartusingmorethan1bytesaremultipleandannoying:
Howtoorderthebytes,BigAndLittleEndianproblemsanyone?
Howtorecognizewhichbyteyouarereadinginafileorstream?
Howtodetectandcorrecttransmissionerrorswhereonlysomebytesweremissing?
8364intobinarytakestwobytes.Unicodecharacterscodepointsgoeswellbeyond1000000(becauseofnonallocatedyet),takingupto3bytes.
Multi-bytes
AsASCIIwassimple,transformingASCIIcodepointsintobinarywasstraightforward.
ButthepresenceofhighcodepointcharactersinUnicodecomplexifytheprocess.Therearemultiplewaysofdoingit,calledencodings:
UTF-8
UTF-16
UTF-32
Multipleencoding
Ifyouarenotsure,useUTF-8,itwillbecompatiblewitheverycharacters,workswellmostofthetimeandsolvedmulti-bytesrelatedproblemsElegantly.
IfyouprocessmoreAsiancharactersthanLatin,useUTF-16soyouuselessspaceandmemory.
Ifyouneedtointeractwithanotherprogram,usethedefaultotherprogramencoding(CSVanyone?).
ComparisonofUnicodeencodings-Wikipedia
Chooseanencoding
A €
CodePoint 65 8364
Naiveconversion
01000001 0010000010101100
UTF-8 01000001 111000101000001010101100
UTF-16 0000000001000001 0010000010101100
UTF-3200000000000000000000000001000001
00000000000000000010000010101100
Whatarethedifferences?
Let'sclarifysomething:
encodeismeanttotransformanunicodestringintosomebytes:
hex(u'é'.encode('utf-8'))=='c3a9'
decodeismeanttotransformsomebytesintoanunicodestring:
unhex('c3a9').decode('utf-8')==u'é'
EncodevsDecode
Python2
CountingthelengthofanASCIIstringiseasy,countthenumberofbytes!
Butit'smuchmoreharderwithUnicodestrings.
Python2trieshardtogetyouacorrectanswer.
Let'stakebackourexample: .Itscodepointis128526.
1.Stringlength
Python2comesinseveralflavor,twoarerelatedtoUnicode.Itseitheranarrowbuildorawidebuild.ItbasicallychangehowPythonstoresitsstrings.
Forcodepoint<65535,everythingworksthesame,Pythonstoreeachcharacterseparatelyandonlyonecharacter.
Forcodepoint>65535,itdiffers.ThewidebuildcharactersizeisenoughforallUnicodecodepoints.Butthenarrowbuildcharactersizeisnotbigenoughforcodepoint>65535,soitstoreuppercodepointsasapairofcharacters.
Thenarrowbuilduselessmemorybutitexplainswhythenarrowbuildreturns2forlen(u' '),it'sbecausePython2actuallystoretwocharacters.
MultipleflavorsofPython2
Rememberthesignificationofencodeanddecode?
EncodetransformsanUnicodestringintosomebytes.
DecodetransformssomebytesintoanUnicodestring.
2.Encoding/DecodinginPython2
Python2alwayshadastringtypebutintroducedtheUnicodetypeinPython2.1.
Python2strisbadlynamedasit'sbasicallyabagofbytes.Whenyoudisplayit,Pythonwilltrytodecodeitforyou.SoforASCIIonlystrings,encodeanddecodewillreturnthesame.
x='abc'assertx.encode('ascii')==xassertx.decode('ascii')==x
Python2typesystem
Pythonisastronglytypedlanguage,meaningthatPythonshouldn'tcoercetypesbehindyourback:
'012'+3Traceback(mostrecentcalllast):File"<stdin>",line1,in<module>TypeError:cannotconcatenate'str'and'int'objects
Butit'snotrespectingthispropertywithstrings.RememberthatdecodeconvertbytesintoanUnicodestringinPython?
x=u'é'x.decode('utf-8')
AsdecodeiscalledonanUnicodeinstance,itisn'tbytes.Sopythontriestomakessomebytesoutofthestringanddoes:
x=u'é'x.encode('ascii').decode('utf-8')
That'swayyoucanseeanUnicodeEncodeErrorerrorwhiletryingtodecodeanUnicodestringinPython2.
Python2typecoercing
Youcanusechrtogetthecharacterofacodepoint:
assertchr(65)=='A'
ButitonlyworkswithASCIIcharacters!
chr(8364)Traceback(mostrecentcalllast):File"<stdin>",line1,in<module>ValueError:chr()argnotinrange(256)
ForUnicodeyouneedtouseunichr:
assertunichr(8364)==u'€'
3.Python2chrvsunichr
Python3♥ ♥ ♥ ♥
Python3nowalwaysstoreitsstringsthesamewayandlenreturnsyoutherightanswernomatterwhat:
x=' 'assertlen(x)==1
1.Python3singleflavor
Python3biggestchangewastochangethetypesystemsofstrings.
Bytes String Unicodestrings
Python2 str unicode
Python3 bytes str
2.Python3bigchange
NowthatPython3haveseparatetypesforbytesandstring,wenowlongercanmesswithencodeanddecode:
string=''string.decode('ascii')Traceback(mostrecentcalllast):File"<stdin>",line1,in<module>AttributeError:'str'objecthasnoattribute'decode'
DecodinganUnicodestringnevermadesenseanyway.
bytes=b''bytes.encode('utf-8')Traceback(mostrecentcalllast):File"<stdin>",line1,in<module>AttributeError:'bytes'objecthasnoattribute'encode'
Soyoualwaysknowwhatthetypesyouaredealingwith.
2.Python3coherenttypesystem
Unicodestringsarenowthenorm,soPython3droppedtheuprefixforUnicodestringsandreplaceditbyabprefixforbytes,soyoudirectlywrite:
x=' '
Python3.3reintroducedtheprefixforcodebasesthatneedstobecompatiblewithPython2andPython3,soit'salsoworks:
x=u' '
2.Nomoreuprefix
Python3nolongerhaveseparatefunctionsforchrandunichr,justusechr.
assertchr(65)=='A'assertchr(8364)=='€'
3.Python3chr
Painrelieftips
Thankstothenewtypesystem,itisnoweasiertoidentifywhichpartofthecodeneedstoencodestringsanddecodebytes.
bytes Outsideworld
decode Library
unicode
Businesslogic
unicode
encode Library
bytes Outsideworld
1.Unicodesandwich
SoftwareshouldonlyworkwithUnicodestringsinternally,decodingtheinputdataassoonaspossibleandencoding
theoutputonlyattheend.
—Pythondoconunicode
Unicodesandwich
Youcannotinfertheencodingsofbytes:
Content-Type:text/html;charset=ISO-8859-4
<metahttp-equiv="Content-Type"content="text/html;charset=utf-8"/>
<?xmlversion="1.0"encoding="UTF-8"?>
#-*-coding:iso8859-1-*-
Ifyoureallyreallyreallyreallyneedtoguesstheencoding,youcanusechardet,butremember,it'sabesteffortscenario.
2.Usedeclaredencoding
encodeanddecodeacceptsasecondargumentsforerrorhandling.Bydefaultitissetonstrict,whichmeanscrash
x=u'abcé'x.encode('ascii',errors='strict')Traceback(mostrecentcalllast):File"<stdin>",line1,in<module>UnicodeEncodeError:'ascii'codeccan'tencodecharacteru'\xe9'inposition3...
Youcanalsousereplacetoreplaceinvalidcharacterby?:
assertx.encode('ascii',errors='replace')=='abc?'
Oryoucansimplyignorethem:
assertx.encode('ascii',errors='ignore')=='abc'
FinallyyoucanreplacethembytheirXMLcode:
assertx.encode('ascii',errors='xmlcharrefreplace')=='abcé'
3.Errorhandling
UseUnicodeanytimepossible.
UsePython3.
ExplicitlyencodestranddecodestrinPython2,itmightsolvesbugsinyourcodeandeasePython3conversions.
Unicodesandwich.
Neverguessanencoding!
Useerrorhandling.
Conclusion
forcinrange(0x1F410,0x1F4f0):print(r"\U%08x"%c).decode("unicode-escape"),
Pythonfun
Thankyou!
TheAbsoluteMinimumEverySoftwareDeveloperAbsolutely,PositivelyMustKnowAboutUnicodeandCharacterSets(NoExcuses!)
PragmaticUnicode
UnicodeInPython,CompletelyDemystified
Whateveryprogrammerabsolutely,positivelyneedstoknowaboutencodingsandcharactersetstoworkwithtext
Holybatman
Redditonunicode
References