Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
U n i v e r s a l A c c e p t a n c e S t e e r i n g G r o u p
IntroductiontoUniversalAcceptanceMarkSvancarekandLuisaVilla
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 2
AboutThisDocumentPurpose
TheInternet’stechnologies,includingitsnamingcomponents,areundercontinualevolutionandchange.In
recentyears,agreatnumberofnewTLDswithASCIIcharactersandIDNtop-leveldomainshavebeenreleased
by ICANN.Examples include.nyc,.hsbc,.eco,and.. However, the response to the change in the naming
landscapehasnotbeen fast enough.Manyapplicationsand
services are not being updated to manage new TLDs. This
affectstheuserexperience.Forexample:
• Validemailaddressesarenotbeingaccepted
• Domain names are mistakenly treated as search
termsintheaddressbarofthebrowser.
Unlesssoftwarerecognizesandcanprocessthenewdomains,astateknownasUniversalAcceptance,itwillnotbepossibletoprovideaconsistentandpositiveexperienceforInternetusers.Thisdocument,therefore,
providesabroadintroductiontoUniversalAcceptancetoassistinthedevelopmentofUniversalAcceptance-
readysoftware.
TargetAudience
• SoftwareDevelopers
• ChiefTechnicalOfficers(CTOs)
• Thetechnicalcommunityingeneral
DocumentStructure
Part1
BaselineconceptsofUniversalAcceptancesuchaswhatisaDomainNameandtheDomain
NameSystem,ASCIIandUnicode,Punycode,EmailAddressInternationalization,andother
basicconcepts.
Part2 The fivecriteriaofUniversalAcceptanceaswellas thegoodpractices foreachof thesecriteria. Also contains user scenarios and nonconformance practices to UniversalAcceptance,technicalrequirementsandcurrentchallenges.
Part3 Advanced topics such as right-to-left scripts, the Bidi algorithm,Normalization and Case
Folding.
Part4 Containstheglossaryandusefulonlineresources.
Needmoreinformation?
The UASG and the community are available to provide advice to software developers and
implementersonwhatisneeded.
• Contactustoshareyourideasandsuggestionsonthetopicatinfo@uasg.tech
• JointheUniversalAcceptancediscussionathttp://tinyurl.com/ua-discuss
• Tolearnmoreabouttheeffort,visithttp://www.icann.org/universalacceptance
Manyapplicationsandservicesarenotbeingupdatedto
managethesenewTLDs.Thisaffectstheuserexperience.
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 3
Contents
Introduction.......................................................................................................................................................5
ABriefHistoryofDomainNameInternationalization...................................................................................5
TheNeedforUniversalAcceptance...............................................................................................................5
Part1:BaselineConceptsofUniversalAcceptance...........................................................................................6
DomainName............................................................................................................................................6
DomainNameSystem(DNS).....................................................................................................................6
TopLevelDomains(TLDs)..........................................................................................................................6
GenericTopLevelDomains(gTLDs)..........................................................................................................7
CharacterSetsandScripts.........................................................................................................................7
ASCIIandUnicode.....................................................................................................................................7
InternationalizedDomainNames(IDNs)andPunycode...........................................................................8
Email..........................................................................................................................................................9
AddressesandEmailAddressInternationalization(EAI)...........................................................................9
DynamicLinkGeneration(Linkification)..................................................................................................10
Part2:UniversalAcceptanceinAction............................................................................................................11
FiveCriteriaofUniversalAcceptance..........................................................................................................11
UserScenarios.............................................................................................................................................12
NonconformancetoUniversalAcceptancePractices..................................................................................14
TechnicalRequirementsforUAReadiness......................................................................................................15
HighlevelRequirements..............................................................................................................................15
DeveloperConsiderations............................................................................................................................15
AGuidingPrincipleforAchievingUniversalAcceptance:Postel’sLaw...................................................16
GoodPracticesforDevelopingandUpdatingSoftwaretoAchieveUA-Readiness.................................16
AuthoritativeSourcesforDomainNames...................................................................................................22
DNSRootZone.........................................................................................................................................22
PublicSuffixList.......................................................................................................................................22
OtherChallenges..............................................................................................................................................23
General........................................................................................................................................................23
IDN-StyleEmailandWhyItIsNottheSameasEAI.....................................................................................23
LinkificationandItsChallenges....................................................................................................................24
Part3:AdvancedTopics...................................................................................................................................26
ComplexScripts...........................................................................................................................................26
RighttoLeftLanguagesandUnicodeConformance...............................................................................26
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 4
TheBidiAlgorithm...................................................................................................................................26
TheBidiRuleforDomainNames.............................................................................................................27
Joiners......................................................................................................................................................27
HomoglyphandConfusinglySimilarCharacters......................................................................................28
NormalizationandCaseFolding..................................................................................................................29
Normalization..........................................................................................................................................29
CaseFolding.............................................................................................................................................30
Part4:GlossaryandOtherResources..............................................................................................................32
Glossary.......................................................................................................................................................32
RFCs..............................................................................................................................................................34
KeyStandards..............................................................................................................................................36
OnlineResources.........................................................................................................................................37
Acknowledgements..........................................................................................................................................39
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 5
IntroductionABriefHistoryofDomainNameInternationalizationIn the 1970s, the characters available for registering domain names were limited to a subset of ASCIIcharacters(lettersa-z,digits0-9andthehyphen“-“).Sincetheearliest.comregistration,symbolics.com,in
1985, the number and characteristics of domain names have expanded to reflect the needs of the ever-
increasingglobaluseoftheInternetasacommunalresource.Today,themajorityofInternetusersarenon-
English speakers. However, the dominant language used on the Internet is English. To help with theinternationalization of the Internet, in 2003, the Internet
EngineeringTaskForce(IETF)startedreleasingstandardsproviding
technical guidelines for the deployment of InternationalizedDomainNames(IDN)throughatranslationmechanismtosupport
non-ASCII representations of domain names in geographically
diverselocalscripts(e.g., - . , ua-test. ,etc.).
The Board of Directors of the Internet Corporation for Assigned
NamesandNumbers (ICANN)approvedtheprocessto introduce
new IDN Country Code Top Domain Names (ccTLDs) in October
2009,withthefirstIDNccTLDsaddedtotherootzoneinMay2010.
InJune2011,theBoardapprovedandauthorizedthelaunchofthe
new Generic Top Level Domain (gTLD) program, which included
newASCII aswell as IDNTLDs. The first batchof TLDs from this
programwasadded to the root zone in2013.Theadditionof IDNccTLDsandnewTLDshasdramatically
increasedthepaceatwhichTLDsareaddedtotherootzone.
AdecadeaftertheIETFreleaseditsIDN-relatedguidelines,andthankstotheICANNNewTLDProgram,more
thanonethousandnewTLDshavenowbeenreleased.Inspiteofalltheseefforts,however,muchsoftware
and many applications are still not Universal Acceptance-ready. This causes problems to Internet users,
includingthosewhoselanguagesarewritteninscriptsthatincludenon-ASCIIcharacters.
TheNeedforUniversalAcceptanceTokeeppacewiththisnewTLDworld,newsoftwaremustbebuiltandoldsoftwareandapplicationsmustbe
updated.ThestateofsuccessfullycomplyingwiththisnewworldofTLDsiscalledUniversalAcceptance.
UniversalAcceptanceisthestatewhereallvaliddomainnamesandemailaddressesareaccepted,validated,stored,processedanddisplayedcorrectlyandconsistentlybyallInternet-enabledapplications,devicesandsystems. Inotherwords, every validwebaddress resolves to theexpectedwebsiteandevery validemail
addressdeliversmailtotheexpecteddestination.Duetotherapidlychangingdomainnamelandscape,many
systemsdonotrecognizeorappropriatelyprocessnewdomainnames,primarilybecausetheymaybeina
non-ASCIIformat,becausethesoftwareisnotawareofthenewlyreleasedTLD,orbecausethelengthoftheir
TLDvariesinlength.Thesameistrueforemailaddressesthatincorporatethesenewextensions.
TheUniversalAcceptanceSteeringGroup(UASG),acommunity-led,industry-wideinitiativethatissupported
by ICANN, isworkingoncreatingawareness, identifyingandresolvingproblemsassociatedwithUniversal
AcceptanceofDomainNamestohelpensureaconsistentandpositiveexperienceforInternetusersglobally.
UniversalAcceptanceisthestatewhereallvaliddomain
namesandemailaddressesareaccepted,validated,stored,processedanddisplayed
correctlyandconsistentlybyallInternet-enabledapplications,
devicesandsystems.
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 6
Part1:BaselineConceptsofUniversalAcceptanceThissectioncontainsanoverviewofthebasictermsandconceptsnecessarytounderstandbeforereading
themoreadvancedsectionsofthisdocument.
DomainNameAdomainnameisadottedtextstringusedasahuman-friendlytechnicalidentifierforcomputers
andnetworksontheInternet.Forexample:
www.domain.tld
Howtoreadadomainname:
• EachdotrepresentsalevelintheDomainNameSystem(DNS)hierarchy.
• ATop-LevelDomain(TLD)isoftencalledthesuffixattheendofadomainname.
• Theindividualwordsorcharactersbetweenthedotsarecalledlabels.Forthoselanguagesorscriptsthatarewrittenfromlefttoright(LTR),1thelabelfurthestrightrepresentsthetop-leveldomain.
• Thesecondlabelfromtheendrepresentsthesecond-leveldomain.
• Any labels thatcomebefore thesecond-leveldomainareconsideredsubdomainsof thesecond-leveldomain(sometimescalledthird-leveldomains).
DomainNameSystem(DNS)EachresourceontheInternetisassignedanaddresstobeusedbytheInternetProtocol(IP).Since
IP addresses are difficult to remember, the DNS provides amapping between IP addresses and
human-readable domain names. Servers collectively providing a public DNS exist at well-known
addressesontheInternet.
TopLevelDomains(TLDs)Humanreadabledomainnamesaremanagedbyorganizationsknownasregistries.Whenadomain
name is registered, it consists ofmultiple text strings representingmultiple domain levels, each
separatedbya“.”character.InLTRscripts,theright-mostdomainlevelisthetop-leveldomain(TLD).
SomeTLDsaredelegated to specific countriesor territories.ThesearecalledCountryCodeTLDs(ccTDs).
1Languagesorscriptswrittenfromrighttoleft(RTL)willbediscussedlaterinthisdocument.
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 7
GenericTopLevelDomains(gTLDs)Starting in 2013, ICANN (the organization responsible for the creation and maintenance of TLD
assignments) has approved the creation of a large number of new TLDs. These new TLDs can
represent brands, communities of interest, geographic communities (cities, regions) and more
genericconcepts.Collectively,allofthesenewTLDsareknownasGenericTopLevelDomains(gTLDs).
CharacterSetsandScriptsLanguagesarewrittenusingwritingsystems.Mostwritingsystemsuseonescript,whichisasetof
graphiccharactersusedforthewrittenformofoneormorelanguages.Asmallnumberofwriting
systemsemploymorethanonescriptatthesametime.Thesecharactersorscriptscanberecognized
byhumans.However,theyarenotusefultocomputers. Instead,acomputerneedsascripttobe
encodedinawaythatitcanprocess(forexample,toresolveawebaddress).Themechanismforthis
iscalledacharactermappingorcodedcharacterset(CCS),oracodepage.2Acharactermapping
associatescharacterswithspecificnumbers.Manydifferentcodepageshavebeencreatedovertime
fordifferentpurposes,butforthistopicwewillfocusononlytwo:ASCIIandUnicode.
ASCIIandUnicodeIntheexamplesofTLDsabove,allofthetextstringsarerepresentedusingtheLatincharacterset.
ThischaractersetisincludedintheAmericanStandardCodeforInformationInterchange(ASCII,or
US-ASCII) character-encoding scheme. ASCII is an older encoding scheme andwas based on the
Englishlanguage.Forhistoricalreasons,itbecamethestandardcharacterencodingschemeonthe
Internet.ASCIIusesonly7bitspercharacter,whichlimitsthesetto128characters,notallofwhich
canbeusedindomainnames.DomainnamesarelimitedtothecharactersA-Z,thenumbers0-9,
andhyphen“-“.
2TherearesubtletiestothetermsthatarenotdirectlyrelevanttothetopicofUniversalAcceptance.Ifyou
are interested in more information about the terminology, a useful starting point is:
https://tools.ietf.org/html/rfc6365
ExamplesofcommonTLDs ExamplesofccTLDs ExamplesofnewgTLDs
.com
.gov
.info
.org
China=.cn Germany=.deUnitedStates=.us
.app
.lawyer
.shopping
.panasonic
.osaka
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 8
ASCII-ISO8859-1(Latin-1)Table3
BecausemostwritingsystemsdonotusetheLatincharacterset,alternateencodingshavealsobeen
adopted.Unicode,alsoknownastheUniversalCodedCharacterSet(UCS),iscapableofencodingmorethan1millioncharacters.EachoftheseUnicodecharactersisacalledacodepoint.Themost
commonformofUnicodeiscalledUniversalCodedCharacterSetTransformFormat8-bit(UTF-8).
ToseeallUnicodecharactercodecharts,goto:http://unicode.org/charts
InternationalizedDomainNames(IDNs)andPunycodeTheuseofUnicodeenablesdomainnamestocontainnon-ASCIIcharacters.Asnotedearlierinthis
document,domainnamesthatusenon-ASCIIcharactersarecalledInternationalizedDomainNames
(IDNs).4Theinternationalizedportionofadomainnamecanbeinanylevel–notjusttheTLDbut
alsotheotherlabels.
SincetheDNSitselfpreviouslyonlyusedASCII,5itwasnecessarytocreateanadditionalencodingto
allownon-ASCIIUnicodecodepointstobeconvertedintoASCIIstrings,andviceversa.Thealgorithm
thatimplementsthisUnicode-to-ASCIIencodingiscalledPunycode;theoutputstringsarecalledA-Labels.A-LabelscanbedistinguishedfromanordinaryASCIIlabelbecausetheyalwaysstartwiththe
followingfourcharacters:
• xn--
ThesecharactersarecalledtheACEprefix.6
ThePunycode transformation is reversible: it can transformfromUnicode toanA-Labelandalso
fromanA-labelbacktoUnicode(knownasaU-Label).
TheonlyRFC-defined7useofthePunycodealgorithmis forexpressing internationalizeddomains.
However, rather than implement Unicode, some developers choose to apply Punycode to other
scenarios.
3Source:CaliforniaStateUniversity.1997.ASCII-ISO8859-1(Latin-1)TablewithHTMLEntityNames.http://web.calstatela.edu/faculty/jchen13/Docs/CS120/Lectures/ASCIITable_with_HTML_Entity_Names.ht
m4Notethatnoteverynon-ASCIIcharacterisanIDN.
5Forcurrentstatus,seehttp://tools.ietf.org/html/rfc6055#section-3
6ASCIICompatibleEncoding(ACE)prefixisusedtodistinguishPunycode-encodedlabelsfromordinaryASCII
labels.7RFC:RequestforComments.SeetheGlossaryofterminPart4ofthisdocumentformoreinformation.
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 9
AddressesandEmailAddressInternationalization(EAI)Emailaddressescontaintwoparts:
1. Alocalpart(theusername,beforethe“@” character)
2. Adomain(afterthe“@”character)
ThedomainpartcancontainanyTLD,includinganewTLD.BothportionsmaybeUnicodeU-labels.
NOTE:Anadditionalformat,IDN-StyleEmailAddresses,willbediscussedbelow.
EmailAddressInternationalization(EAI)requirestheuseofUnicodeinallpartsoftheemailaddress.
EachoftheexamplesabovecouldbeexpressedasEAI,andthisisthepreferredformat.
Examplesof(imaginary)IDNs
example. (Punycode encoding = example.xn--q9jyb4c)
.info (Punycode encoding = xn--uesx7b.info)
. � (Punycode encoding = xn--q9jyb4c.xn--uesx7b)
Tolearnmore,seetheIDNFAQ:http://unicode.org/faq/idn.html
Examplesof(imaginary)EmailAddressesincludingIDNs
user@example.
user@ .info
@example.lawyer
(UsesinternationalizedTLD)
(Usesinternationalized2ndleveldomain)
(UsesinternationalizedusernameandnewgTLD)
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 10
DynamicLinkGeneration(Linkification)Modernsoftware,suchaspopularwordprocessingorspreadsheetapplications,sometimesallowsa
usertocreateahyperlinksimplybytypinginastringthatlookslikeawebaddress,emailaddressor
networkpath.Forexample,typing“www.icann.org”intoanemailmessagemayresultinaclickable
linktohttp://www.icann.org beingautomaticallycreatediftheapptreats“www.”asaspecial
prefixor“.org”asaspecialsuffix.
Linkificationshouldworkconsistentlyforallwell-formedwebaddresses,emailaddressesornetwork
paths.
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 11
Part2:UniversalAcceptanceinActionFiveCriteriaofUniversalAcceptanceAs described in the section, Universal Acceptance is the state where all valid domain names and email
addressesareaccepted,validated,stored,processedanddisplayedcorrectlyandconsistentlybyallInternet-enabledapplications,devicesandsystems.Thesefivecriteriaaredescribedbelow.
1.Accept8
Acceptingoccurswheneveranemailaddressoradomainnameisreceivedasastringofcharactersfromauserinterface,fromafile,orfromanAPIusedbyasoftwareapplicationoronlineservice.
Applicationsandservicesallowdomainnamesandemailaddressestobe:
• Enteredintouserinterfaces,AND/OR • ReceivedfromotherapplicationsandservicesviaAPIs
2.Validate9
Validationmayoccur inmanyplaceswheneveranemailaddressoradomainnameiseitherreceivedoremittedasastringofcharactersbyanapplicationoronlineservice.
Validationisintendedtoensurethattheenteredinformationiseithervalidorat
least definitely not invalid. In other words, validation ensures the syntax
correctnessofthegiveninformation.
Fordomainnamesandemailaddresses,manyprogrammershavebeenusingsome
heuristics (for example, checking that a TLD has the “correct” number of
characters,orthatthecharactersarefromtheASCIIcharacterset).However,these
heuristicsarenolongerapplicablebecausetheInternetischanging:
• DomainnamesandemailaddressescannowincludeUnicode(non-ASCII)
characters
• ThelistofTLDsisgrowing
• ATLDcanbeupto63characterslong
3.Store
The Storage process occurswhenever an email address or a domain name isstoredasastringofcharactersinadatabaseorfileusedbyasoftwareapplicationoronlineservice.
Applications and services might require long-term and/or transient storage of
domainnamesandemailaddresses.Regardlessofthelifetimeofthedata,itmust
bestoredin:
• RFC-definedformats,OR
• AlternateformatsthatcanbeeasilytranslatedtoandfromRFC-defined
formats(thisismuchlessdesirable)
8AcceptingistreatedasdistinctfromValidatinginthisdocument.Inpractice,theabilitiesmayoverlap.
9AcceptingandProcessingaretreatedasdistinctfromValidatinginthisdocument.Inpractice,theabilities
mayoverlap.
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 12
Although RFCs require the use ofUTF-8, other formatsmay be encountered in
legacycode.Seethe“GoodPractices”sectionbelow.
4.Process10
Processingoccurswheneveranemailaddressoradomainnameisusedbyanapplicationorservicetoperformanactivity(forexample,searchingorsortingalist),ortransformedintoanalternateformat(suchasstoringASCIIasUnicode).
Processingmeansusingdomainnamesandemailstringsinafeature.Additional
validationmayoccurduringprocessing.Thereisnolimittothenumberofways
thatdomainnamesandemailaddressescouldbeprocessed(examples:“Identify
allthepeopleassociatedwithNewZealandbecausetheyhaveanamewitha.nzccTLD”; “Identify all the pharmacists because they have a
[email protected] email address”; “Identify firewalls that might filter
DNSrequeststhatdon’tapplytotheirpolicies”).
5.Display
TheDisplay process occurs whenever an email address or a domain name isrenderedwithinauserinterface.
Displayingdomainnamesandemailaddressesisusuallystraightforwardwhenthe
scriptsusedaresupportedintheunderlyingoperatingsystemandwhenthestrings
are stored in Unicode. If these conditions are not met, application-specific
transformationsmayberequired.
UserScenariosThe examples and definitions above may give the impression that Universal Acceptance is only about
computersystemsandonlineservices.Thereality,however, is that it’salsoabout thepeopleusing those
systemsandservices.
BelowaresomeexamplesofactivitiesthatrequireUniversalAcceptance:
RegisteringanewTLD
An organization adopts a “brand” TLD to offer its customers a differentiated
customerexperiencebyprovidingemailaddressesintheformat,[email protected].
UniversalAcceptancemeans:
• Web apps accept these new “@example.brand” email addresses as
validastheywouldwithTLDssuchas.com,.net,.org.
AccessingagTLD Auseraccessesawebsite,whosedomainnamecontainsanewTLD,bytypingan
addressintoabrowserorclickingalinkinadocument.
UniversalAcceptancemeans:
• EventhoughtheTLDisnew,anybrowsertheuserwishestousedisplays
the web address in its native form and accesses the site as the user
10ProcessingistreatedasdistinctfromValidatinginthisdocument.Inpractice,theabilitiesmayoverlap.
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 13
expects.ThebrowserdoesnotdisplayPunycodedtexttotheuserunless
itbenefitstheuserinsomeway.
UsinganemailaddresscontaininganewgTLDasanonlineidentity
AuseracquiresanemailaddresswiththedomainportionusinganewgTLD,and
usesthisemailaddressastheiridentityforaccessingtheirbankandairlineloyalty
accounts.
UniversalAcceptancemeans:
• Eventhoughthedomainusedintheemailaddress isnew,thebankor
airline siteaccepts theaddressexactlyas if itwereanestablishedTLD
suchas.bizor.eu.
AccessinganIDN
AuseraccessesanIDNURL,bytypinganaddressintoabrowserorclickingalink
inadocument.
UniversalAcceptancemeans:
• Evenifthedomainnamecontainscharactersdifferentthanthelanguage
settingsontheuser’scomputer,anybrowsertheuserwishestousewill
displaythewebaddressasexpectedandaccessthesitesuccessfully.
Usinganinternationalizedemailaddressforemail
A user has acquiredmultiple email addresses, some are internationalized (e.g.
īnfo@ - . ).
UniversalAcceptancemeans:
• Theusercansendtoandreceivefromanyemailaddressandusingany
emailclient.
Usinganinternationalizedemailaddressasanonlineidentity
AuseracquiresanEAIemailaddress,andusesthisemailaddressastheiridentity
foraccessingtheirbankandairlineloyaltyaccounts.
UniversalAcceptancemeans:
• ThebankorairlinesiteacceptstheEAIidentityexactlyasifitwereany
otheremailidentity.
DynamicallycreatingaHyperlinkinanApplication
Ausertypesawebaddressintoadocumentoremailmessage.
UniversalAcceptancemeans:
• Therulesusedbytheapplicationtoautomaticallygenerateahyperlink
arethesameeveniftheaddressisanEAIorcontainsanewTLD.
DevelopinganApplication
Adeveloperwritesanappthataccesseswebresources.
UniversalAcceptancemeans:
• ThetoolsusedbythedevelopersincludelibrariesthatenableUniversal
AcceptancebysupportingUnicode,IDNsandEAI.
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 14
NonconformancetoUniversalAcceptancePracticesThefollowingareconsideredtobepoorpractice:
✖DisplayingPunycodedtexttotheuserwithoutacorrespondinguserbenefit.
Forexample,toshowthemappingbetweenaU-labelandaA-label.
✖RequiringausertoenterPunycodedtextwhensigningupforanewemailaddressorrequiringauser
toenterPunycodedtextwhensigningupforanewhosteddomain.
✖Validatingthesyntaxofdomainnameoremailaddressusingoutofdatecriteriaornon-authoritative
onlinedomainnameresources.
✖ UsinganoutdatedlistofTLDseventhoughnewTLDsareregularlybeingadded.
✖ ExposinginternaluseofPunycodedtexttousers.
Forexample,convertingfromEAItoanIDN-styleemailaddresswhenreplyingtoanEAIuser.
✖ Treatingsomedomainnamesassearchtermsratherthanasdomainnamesbecausetheapplication
doesnotrecognizethemassuch.
✖ SettingspamblockerstoautomaticallyblockentireTLDs.
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 15
TechnicalRequirementsforUAReadinessHighlevelRequirementsAnapplicationorservicethatsupportsuniversalacceptance(UA):
1. Supportsalldomainnamesregardlessoflengthorcharacterset.
SeeRFC5892.
2. Allowsmultiplecharactersetsthatarevalidfordomainnamesandemailaddresses.
Thatis,permitsUnicodecodepoints.
3. CancorrectlyrenderallcodepointsinUnicodestrings.
SeeRFC3490.
4. Cancorrectlyrenderright-to-left(RTL)stringssuchasthoseinArabicandHebrew.
ForinformationaboutRTLscripts,seeRFC5893.
5. CancommunicatedatabetweenapplicationsandservicesinformatsthatsupportUnicodeandareconvertibleto/fromUTF-8.
ForinformationaboutUTF-8,seeRFC3629.
6. OfferspublicAPIsthatsupportUnicode&UTF-8.
7. OffersprivateAPIsthatsupportUnicode&UTF-8.
PrivateAPIsapplyonlytointer-servicecallsbythesamevendor.
8. StoresuserdatainformatsthatsupportUnicodeandisconvertibleto/fromUTF-8.
Suchconversionswouldbevisibleonlytotheproduct/serviceowner.
9. Supports all domain name strings in the authoritative ICANN TLD list and the community-servedPublicSuffixListregardlessoflengthorcharacterset.
Seehttps://newgtlds.icann.org/en/program-status/delegated-strings.
10. Cansendemailtoandreceivefromrecipientsregardlessofdomainnameorcharacterset.
SeeRFC6530.
11. TreatsEAIaddressesthesamewayastheirPunycodedequivalents(IDNemailformat).
DeveloperConsiderationsSincemanyexistingsoftwaresystemscontainhardcodedassumptionsaboutdomainsandemailaddresses,
codechangesmayberequiredtorecognizeIDNsandnewTLDs.Thissectiondiscusseshowdeveloperscan
incorporatecodechangesthatwillenableUniversalAcceptanceofallnewTLDs.
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 16
AGuidingPrincipleforAchievingUniversalAcceptance:Postel’sLawIn RFC 793, Jon Postel formulated the Robustness Principle, now known as Postel's Law, as animplementation guideline for the then-new TCP. In computing, the Robustness Principle is a general
designguidelineforsoftware:
"Beconservativeinwhatyoudo,beliberalinwhatyouacceptfromothers."
Inotherwords,beconservativeinwhatyousendandbeliberalinwhatyouaccept.Thisisalsoagood
approach when dealing with the vagaries of Universal Acceptance currently implemented in the
ecosystem.
GoodPracticesforDevelopingandUpdatingSoftwaretoAchieveUA-Readiness
Accept
✔
AlwaysofferUnicodeequivalents.
Usersshouldbeallowed,butnotrequired,toenterASCIICompatibleEncoded(or“Punycoded”)
text in place of its Unicode equivalent. However, Unicode should be shown by default, with
Punycodedtextonlyshowntotheuseronlywhenitprovidesabenefit.
! Don’tgenerateIDN-Styleemailaddresses,butdobeabletohandlethemifpresentedbysomeone
else’ssoftware.
✔
Anyuserinterfaceelementrequiringausertotypeadomainnameoremailaddressmustsupport
Unicode,labelsupto63characters,anddomainnamestringsupto253characters.
• SeeRFC1035.
Validate
✔
Validateonlytotheminimumextendnecessary.
Validateonly if it is required for theoperationof theapplicationor service. This is themost
reliablewaytoensurethatallvaliddomainnamesareacceptedintoyoursystems.
✔ Recognizethatsyntacticallycorrectinputsmaynotrepresentdomainnamesoremailaddresses
currentlyinuseontheInternet.
!
Ifyoumustvalidate,considerthefollowing:
• VerifytheTLDportionofadomainnameagainstanauthoritativetable.Examplesofsome
authoritativetablesthatyoucanuseare:
o http://www.internic.net/domain/root.zone
o http://www.dns.icann.org/services/authoritative-dns/index.html
o http://data.iana.org/TLD/tlds-alpha-by-domain.txt
Seealso:https://www.icann.org/en/system/files/files/sac-070-en.pdf
• QuerythedomainnameagainsttheDNS
o ConsiderusingtheGETDNSAPI(http://getdnsapi.net/)
• Requirerepeatedentryofanemailaddresstoprecludetypingerrors
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 17
• ValidatethecharactersinlabelsonlytotheextentofdeterminingthattheU-labeldoes
notcontain"DISALLOWED"codepointsorcodepointsthatareunassignedinitsversion
ofUnicode
o SeeRFC5892
• Limitvalidationoflabelsitselftoasmallnumberofwhole-labelrulesdefinedintheRFCs
o SeeRFC5894
• IfastringresemblingadomainnamecontainstheArabicfullstopcharacter“۔”(U+06D4),or the ideographic full stop character “ ” (U+002E), convert it to the full stop “.”
(U+002E).
• Doensurethattheproductorfeaturehandlesnumberscorrectly
o For example: ASCII numerals and Asian ideographic number representations
shouldallbetreatedasnumbers
Store
✔ ApplicationsandservicesshouldsupporttheappropriateUnicodestandards.
✔
InformationshouldbestoredintheUTF-8(UnicodeTransformationFormat)wheneverpossible.
SomesystemsmayrequiresupportforUTF-16aswell,butgenerallyUTF-8ispreferred.UTF-7and
UTF-32shouldbeavoided.
!
Consider all end-to-end scenarios before converting A-Labels (Punycode) to U-Labels and vice
versawhenstoring.
ItmaybedesirabletomaintainonlyU-Labelsinafileordatabase,becauseitsimplifiessearching
and sorting.However, conversionmayhave implicationswhen interoperatingwitholder, non-
Unicode-enabledapplicationsandservices.Considerstoringandindexingbothformats.
✔
Clearlymarkemailaddressesanddomainnamesduringstorageforeasieraccess.
Instanceswhereemailaddressesanddomainnameshavebeenfiledunderthe“author”fieldofa
documentor“contactinfo”inalogfilehaveledtothelossoftheoriginaladdress.
✔ Ifyoudon’tstoreinUnicode,youmustbeabletomatchstringsinmultipleformats.
Forexample,asearchforexample. shouldalsofindexample.xn--q9jyb4c.
Process
✔ EnsureallserverresponseshaveUnicodespecifiedinthecontenttype.
✔
SpecifyUnicodeinthewebserverhttpheaderanddirectlyinawebfile.
• EverywebfileshouldincludetheUTF-8charset
• Itisimportanttoensurethattheencodingisspecifiedoneveryresponse
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 18
!
Consider all end-to-end scenarios before converting A-Labels (Punycode) to U-Labels and vice
versaduringprocessing.
ItmaybedesirabletomaintainonlyU-Labelsinafileordatabaseasitsimplifiessearchingand
sorting. However, conversion may have implications when interoperating with older, non-
Unicode-enabledapplicationsandservices.Considerstoringinbothformats.
✔ Ensure that the product or feature handles sort order, searches, and collation according to
locale/languagespecifications,andthatitaddressesmultilanguagesearchingandsorting.
✖
Don’tuseURL-encodingfordomainnames:
• example. iscorrect
• example.%E3%81%BF%E3%82%93%E3%81%AA isnotcorrect
✔
SincetheUnicodestandardiscontinuallyexpanding,codepointsnotdefinedwhentheapplication
orservicewascreatedshouldbecheckedtoensuretheywillnot“break”theuserexperience.
Missing fonts in the underlying operating system may result in non-displayable characters
(frequentlythe“o”characterisusedtorepresentthese),butthissituationshouldnotresultina
fatalcrash.
✔ UsesupportedUnicode-enabledAPIs.
✔
Use the latest Internationalized Domain Names in Applications (IDNA) Protocol and Tables
documentsforIDNs:
• RFC5891
• RFC5892
✔ ProcessinUTF-8formatwhereverpossible.
✔
Upgradeapplicationsandservers/servicestogether.
IftheserverisUnicodeandclientisnon-Unicode,orviceversa,thedatawillneedtobeconvertedtoeachcodepageeverytimethedatatravelsbetweenserverandclient.
✔ Performcodereviewstoavoidbufferoverflowattacks.
Whendoingcharactertransformation,textstringsmaygroworshrinksubstantially.
Display
✔
DisplayallUnicodecodepointsthataresupportedbytheunderlyingoperatingsystem.
Ifanapplicationmaintainsitsownfontsets,comprehensiveUnicodesupportshouldbeofferedto
thecollectionoffontsavailablefromtheoperatingsystem.
✔ Whendevelopinganapporaserviceconsiderthelanguagessupportedandmakesureoperating
systemsandapplicationscoverthoselanguages.
✔ Convertnon-UnicodedatatoUnicodebeforedisplay.
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 19
For example, the end user should see “example. ” as opposed to “example.xn--q9jyb4c”. (ThisconversionisanexampleofUA-readyprocessing).
✔ DisplayUnicodebydefault.
DisplayPunycodedtexttotheuseronlywhenitprovidesabenefit.
!
Beawarethatmixed-scriptaddresseswillbecomemorecommon.
• Some Unicode characters may look the same to the human eye, but different to
computers
• Don’t assume that mixed-script strings are intended for malicious purposes, such as
phishing
• Iftheuserinterfacecallsthestringstotheuser’sattention,besurethatitdoessoina
waywhichisnotprejudicialtousersofnon-Latinscripts
LearnmoreaboutUnicodeSecurityConsiderationsat:http://unicode.org/reports/tr36
✔ UseUnicodeIDNACompatibilityProcessinginordertomatchuserexpectations.
Tolearnmore,goto:http://unicode.org/reports/tr46
✔ Beawareofunassignedanddisallowedcharactersfordomainnames.
• SeeRFC5892
Unicode
✔ UsesupportedUnicode-enabledAPIs.
✖
Don’tbuildyourownAPIsfor:
• Stringformatconversions
• Determiningwhichscriptcomprisesastring
• Determiningifastringcontainsamixofscripts
• Unicodenormalization/decomposition
✖
Don’tuseUTF-7orUTF-32.
• UTF-7 is generallynotusedasanative representationwithinapplicationsas it is very
awkwardtoprocess.DespiteitssizeadvantageoverthecombinationofUTF-8witheither
quoted-printableorbase64,theInternetMailConsortiumrecommendsagainstitsuse.
• ThemaindisadvantageofUTF-32isthatitisspaceinefficient,usingfourbytespercode
point.Non-BMPcharactersaresorareinmosttexts[citationneeded],theymayaswell
beconsiderednon-existentforsizingissues,makingUTF-32uptotwicethesizeofUTF-
16anduptofourtimesthesizeofUTF-8.
✔ UseUnicodeincookiessotheycanbereadcorrectlybyapplications.
✔ UseIDNA2008ProtocolandTablesdocuments:
• RFC5891
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 20
• RFC5892
✖ Don’tuseIDNA2003;innearlyallcasesithasbeensupersededbyIDNA2008.
✖ DonotautomaticallyassumethatexternalAPIscanconsumedatathathasbeenNFKC11converted.
!
MaintainIDNAandUnicodetablesthatareconsistentwithregardtoversions.
For example, unless the application actually executes the classification rules in the Tables
document (RFC 5892), its IDNA tables must be derived from the version of Unicode that is
supportedmoregenerallyonthesystem.Aswithregistration,thetablesdonotneedtoreflect
thelatestversionofUnicode,buttheymustbeconsistent.
! ValidatethecharactersinlabelsonlytotheextentofdeterminingthattheU-labeldoesnotcontain
“DISALLOWED”12codepointsorcodepointsthatareunassignedinitsversionofUnicode.
✔
Limitvalidationoflabelsitselftoasmallnumberofwhole-labelrules:
• Noleadingcombiningmarks
• Bidirectionalconditionsaremetifright-to-leftcharactersappear
• Any contextual rules that are associated with joiner characters (and CONTEXTJ13
charactersmoregenerally)aretested
!
Don’tuseUTF-16exceptwhereitisexplicitlyrequired(asincertainWindowsAPIs).
WhenusingUTF-16,notethat16bitscanonlycontaintherangeofcharactersfrom0x0to0xFFFF,
andadditionalcomplexityisusedtostorevaluesabovethisrange(0x10000to0x10FFFF).Thisis
doneusingpairsofcodeunitsknownassurrogates.Ifhandlingofsurrogatepairsisnotthoroughly
tested,itmayleadtotrickybugsandpotentialsecurityholes.
Linkification
✔ Ifastringresemblingadomainnamecontainstheideographicfullstopcharacter“ ”(U+3002),
doacceptitandtransformitto“.”.
General
✔ Useauthoritativeresourcestovalidatedomainnames.
Donotmakeheuristicassumptions,suchas“allTLDsare2,3,4,or6charactersinlength”.
11NFKC(NormalizationFormCompatibilityComposition):Charactersaredecomposedbycompatibility,then
recomposedbycanonicalequivalence.See:http://unicode.org/reports/tr15
12DISALLOWED:CodepointsthatshouldnotbeincludedinIDNs.See:https://tools.ietf.org/html/rfc5892
13CONTEXTJ:ContextualRuleforJoincontrols.See:https://tools.ietf.org/html/rfc5892
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 21
✔
Ensurethattheproductorfeaturehandlesnumberscorrectly.
Forexample,ASCIInumeralsandAsianideographicnumberrepresentationsshouldallbetreated
asnumbers.
!
Lookformailaddressesinunexpectedplaces:
• Artist/author/photographer/copyrightmetadata
• Fontmetadata
• DNScontactrecords
• Binaryversioninformation
• Supportinformation
• OEMcontactinformation
• Registration,feedback,andotherforms
! LookforpotentialIRI
14pathsinunexpectedplaces:
• Single-labelmachinenamesregardlessofloadedsystemcodepage
• Fully-qualifiedmachinenamesregardlessofloadedsystemcodepage
✔ UseGB18030(China)forChineselanguagesupport15inadditiontoUTF-8.
!
Restrictthecodepointsallowedwhengeneratingnewdomainnamesandemailaddresses:
All products that use email addressesmust accept internationalized email addresses, allowing
characters>U+007f.Thatis,nocharacters>U+007faredisallowed.However,anapporservice
neednotallowallofthesecharacterswhenausercreatesanewIDNoremailaddress.Useonly
thislistofallowedcharactersforIDNs:http://unicode.org/reports/tr36/idn-chars.txt
PreventingcertainIDNsoremailaddressesfrombeingcreatedinthefirstplacecanmitigatesome
likelysecurityandaccessibilityconcerns.(NOTE:Postel’sLawwouldstillrequiresoftwaretoaccept
suchstringsifpresented.)
!
BeawarethatUniversalAcceptancecannotalwaysbemeasuredthroughautomatedtestcases
alone.
Forexample,testinghowanapporprotocolhandlesnetworkresourcemaynotalwaysbepossible
and sometimes it is best to verify the compliance through functional spec review and design
review.
!
Don’tautomaticallyassumethatbecauseacomponentdoesnotdirectlycallname-resolutionAPIs,
ordirectlyuseemailaddresses,itdoesnotmeanthattheydonotaffectit.
Understandhownetworknamesareobtainedbythecomponent; it isnotalwaysthroughuser
interaction.Thefollowingaresomeexamplesonhowthecomponentcangetanetworkname:
• Grouppolicy
• LDAPquery
• Configurationfiles
14IRI:InternationalizedResourceIdentifiers.See:https://www.ietf.org/rfc/rfc3987.txt
15GB18030-2000 is a Chinese government standard that specifies an extended codepage for use in the
Chinesemarket.See:http://icu-project.org/docs/papers/unicode-gb18030-faq.html
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 22
• WindowsRegistry
• Transferredto/fromanothercomponent/feature
✔
Performcodereviewstoavoidbufferoverflowattacks.
• InUnicode,stringsmayexpandincasing:Fluß→FLUSS→fluss
• Whendoingcharacterconversion,textmaygroworshrinksubstantially
AuthoritativeSourcesforDomainNamesDNSRootZoneTherearea fewoptions for theauthoritative listofTLDs.The firstoption is theDNSrootzone itself. It is
DNSSEC-signed,sothelistisproperlyauthenticated.Youcanobtaintherootzonefromanyofthefollowing
links:
• http://www.internic.net/domain/root.zone
• http://www.dns.icann.org/services/authoritative-dns/index.html
• http://data.iana.org/TLD/tlds-alpha-by-domain.txt
PublicSuffixListThePublicSuffixList (PSL),managedbyvolunteersof theMozillaFoundation,providesanaccurate listof
domainnamesuffixes.ThislistisasetofDNSnamesorwildcardsconcatenatedwithdotsandencodedusing
UTF-8.IfyouneedtousethePSLasanauthoritativesourcefordomainnames,yoursoftwaremustregularly
receivePSLupdates.DonotbakestaticcopiesofthePSLintoyoursoftwarewithnoupdatemechanism.You
canusethelinkbelowtomakeyourappdownloadanupdatedlistperiodically.Thelistgetsupdatedonceper
dayfromGithub:
• https://publicsuffix.org/list/public_suffix_list.dat
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 23
OtherChallengesGeneral
VariableencodingofIDNs
Insomeapplications,IDNsareencoded:
• InPunycode,asperIDNA,ifthenameisidentifiedasanInternetname,
BUT
• InUTF-8, if thenameis identifiedasanameonthe localareanetwork
(“intranet”)
Mechanismtodetectandconvertcharsets
Someolderemailapplicationswereencodedinalocalcodepageanddidnothave
a set mechanism for detecting and converting charset as needed. This was
especiallytruefortheemailheader(TO,CC,BCC,Subject).
Failuretohandlenon-DNSprotocols
SomeapplicationsthatdoIDNA(forexample,IE7+)breakfornon-DNSprotocols. Thiscouldaffectaccessingresourcesusingnon-DNSprotocols.
Mechanismtomanagemultipleemailaddressesintoasingleuseridentity
Whenauserisaliasingmultipleemailaddressesitmaybetrickytomanagethese
addressesasasingleuseridentity.
Email programs can direct traffic to such aliases to the samemailbox, but the
applicationmaystillperceivetheseemailstopertaintodifferentidentities.
Tipforsoftwaredevelopers
✔
Whenallowingausertogenerateadomainnameoremailaddress,consideravoidingtheuseof
visuallyconfusingcharacterstopreventhomographattacks.Useonlythislistofallowedcharacters
forIDN:http://unicode.org/reports/tr36/idn-chars.txt
IDN-StyleEmailandWhyItIsNottheSameasEAIEAI isdefinedasusingUnicodeonly;A-Labels (Punycode)arenotallowed.Nevertheless,developershave
sometimesadaptedemailsoftwareandservicestohandleIDN-Styleemailaddressesratherthanmakeafull
conversiontoUnicode.
BecauseIDNscanbePunycodeencoded,someexistingsoftwareallowstheIDNportionofanemailaddress
toberepresented inASCIIorUnicode.Forexample,somesoftwarewill treatthesetwo“IDN-Styleemail”
addressesequivalentlyforallpurposes(sending,receiving,andsearching):
However,somesoftwarewillnotrobustlytreattheseaddressesasequivalent,eventhougharebothvalid,
because there isno requirement for software toprocessanA-label (i.e. “xn--q9jyb4c”) into itsU-labelequivalent (i.e. “ ”) before comparing. This can result in unpredictable user experience. The user
NotallsoftwarewilltreatthesetwoIDN-Styleemailsasfunctionallyequivalent
user@example. = [email protected]
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 24
experience may become especially confusing if some software converts U-labels into A-labels for
“compatibility”;asmessagesarereplied-toorforwarded,theaddresseswhicharevisiblydifferenttoauser,
orwhichfailtosearchandsortasexpected,mayincrease.
Intheexamplebelow,somesoftwaremayattempttoconverteventhelocalpartoftheemailaddressusing
Punycode,creatingsomethingthatlookslikeanA-LABELinthelocalpartoftheaddress.Thisisnotallowed
under the existingRFCs, and is very likely to result in failures to receive email by certain systems and to
generatesearchingandsortingdifficultiesasexplainedabove.
RobustUA-readysoftwareandservicesmaybeabletohandleandtreatalltheseformatsidentically,even
thosewhich are not RFC-compliant.Nevertheless,UA-ready software should not generate true EAI email
addressesonly.
LinkificationandItsChallengesModernsoftwaresometimesallowsausertoautomaticallycreateahyperlinksimplybytypinginastringthat
lookslikeawebaddress,emailnameornetworkpath.Forexample,typing“www.icann.org”intoanemail
message may result in a clickable link to http://www.icann.org being automatically created if the
applicationtreats“www.”asaspecialprefixor“.org”asaspecialsuffix.
Linkificationshouldworkconsistentlyforallwell-formedwebaddresses,emailnamesornetworkpaths.
Linkificationistheactionwhereanapplicationacceptsastringanddynamicallydetermineswhetheritshould
createahyperlinktoanInternetLocation(URL)oranemailaddress(mailto:)
Linkificationusesalgorithmsandrulescreatedbysoftwaredeveloperstodeterminewhether astringshould
be deemed a link – or not. Related to this is how people can identify a string as a domain name.While
browsers,emailclientsandwordprocessorsareobviousplaces,therearemanymoreapplicationsthatmake
thesedecisions.
Good Practice Recommendations
1. Attempttolinkifybasedonexplicitprotocolprefixes(e.g.“http://”,ftp://”,“mailto:”)butonlycomplete
theactioniftherestofthestringiswellformed
ExampleString ExpectedBehavior/Result
example.com Nolinkificationbecauseprotocolisabsentandnotinferred.
http://example.com Createhyperlinkbecauseprotocolisexplicit
http:example.com Nolinkificationbecauseofbadsyntax(missing//)
NeverconvertthelocalpartofanemailaddressusingPunycode
✔ 戶@example.
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 25
ExampleString ExpectedBehavior/Result
http://example.a NolinkificationbecauseICANNPoliciesrequireTLDtobeatleasttwo
characters. NB: This syntax could be supported within an internal
network.
http://example..ab Nolinkificationbecauseofbadsyntax(consecutivedots)
http:// - . Createhyperlinkbecauseprotocolisexplicit.
2. Attempttolinkifybasedonimplicitprotocolprefixes(e.g.“www”infers“http://www”)
ExampleString ExpectedBehavior/Result
www.example.com Createhyperlinkbecauseprotocolisimplied16
[email protected] Createmailto:[email protected].
3. MaptheIdeographicFullStop“”(U+3002)toFullStop“.”(U+002E)(e.g.http:// comàhttp://
.com)ifstringisotherwisewellformed.
4. IfTLDsareusedasa‘specialsuffix’todeterminelinkability,thenallTLDsmustbeincluded.Alistof
validTLDsshouldbeupdateddynamicallyonafrequentbasis.
16Note:itmightbethecasethattheactualwebsiterequiresthatenduserstypehttps://insteadofhttp://.If
thisisthecase,thenthehyperlinkmaynotresolveormayreturnanerrorpage.
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 26
Part3:AdvancedTopicsComplexScriptsThedetailsofcomplexscriptsmaynotbeofinteresttothosewhoarenotdeveloperscreatingtheirownstring
parsing libraries. Nevertheless, a summary is included here to ensure that all readers have sufficient
awarenesstorecognizecodebugsrelatedtothesescriptswhenencounteredinuserexperiences.
RighttoLeftLanguagesandUnicodeConformanceMostscriptsdisplaycharactersfromlefttorightwhentextispresentedinhorizontallines.However,
therearealso several scripts, suchasArabicorHebrew,where theorderingofhorizontal text in
displayisfromrighttoleft.Thetextcanalsobebidirectional(lefttoright–righttoleft)whenaright-
to-leftscriptusesdigitsthatarewrittenfromlefttorightorwhen itusesembeddedwordsfrom
Englishorotherscripts.
Challengesandambiguitiescanoccurwhenthehorizontaldirectionofthetext isnotuniform.To
solvethisissue,thereisanalgorithmtodeterminethedirectionalityforbidirectionalUnicodetext.
Thereisasetofrulesthatshouldbeappliedbytheapplicationtoproducethecorrectorderatthe
timeofdisplaywhicharedescribedbytheUnicodeBidirectionalAlgorithm.Wegenerallyreferto
thisasthe“Bidialgorithm”.
TheBidiAlgorithmTheBidialgorithmdescribeshowsoftwareshouldprocesstextthatcontainsbothleft-to-right(LTR)
and right-to-left (RTL) sequences of characters. Thebase direction17 assigned to the phrasewilldeterminetheorderinwhichtextisdisplayed.
Toknowifasequenceis left-to-rightorright-to-left,eachcharacter inUnicodehasanassociated
directionalproperty.Mostlettersarestronglytyped(strongcharacters)asLTR(left-to-right).Lettersfromright-to-leftscriptsarestronglytypedasRTL(right-to-left).Asequenceofstrongly-typedRTL
characterswillbedisplayedfromrighttoleft.Thisisindependentofthesurroundingbasedirection.
Forexample:
(LTR)example-مثال(RTL).
Textwithdifferentdirectionalitycanbemixedinline.Insuchcases,theBidialgorithmproducesa
separatedirectionalrunoutofeachsequenceofcontiguouscharacterswiththesamedirectionality.
SpacesandpunctuationarenotstronglytypedaseitherLTRorRTLinUnicodebecausetheymaybe
used in either type of script. They are therefore classified asneutral orweak characters.Weak
charactersarethosewithvaguedirectionality.Examplesofthistypeofcharacterinclude:
• Europeandigits
• EasternArabic-Indicdigits
• Arithmeticsymbols,andcurrencysymbols
• Punctuationsymbolsthatarecommontomanyscripts,suchasthecolon,comma,full-stop,
andtheno-break-space
Thedirectionalityofneutralcharactersisindeterminatewithoutcontext.Someexamplesinclude:
17InHTMLthebasedirectioniseitherinheritedfromthedefaultdirectionofthedocument,whichisleft-to-
right,orexplicitlysetbythenearestparentelementthatusesthedirattribute.
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 27
• Tabs
• Paragraphseparators
• Mostotherwhitespacecharacters
Whenaneutralcharacterisbetweentwostronglytypedcharactersthathavethesamedirectional
type, it will also assume that directionality. For example, a neutral character between two RTLcharacters will be treated as a RTL character itself, and will have the effect of extending the
directionalrun:
نطاق.مثال •
Evenifthereareseveralneutralcharactersbetweenthetwostronglytypedcharacters,theywillall
betreatedinthesameway.
When a space or punctuation falls between two strongly typed characters that have different
directionality, the neutral character (or characters) will be treated as if they have the same
directionalityastheprevailingbasedirection.Forexample:
• example. مثال
Unlessadirectionaloverrideispresentnumbersarealwaysencoded(andentered)big-endian18,andthenumeralsrenderedLTR.Theweakdirectionalityonlyappliestotheplacementofthenumberin
itsentirety.
ToseetheBidialgorithmindetail,goto:http://unicode.org/reports/tr9/tr9-11.html
TheBidiRuleforDomainNamesABididomainnameisonethatcontainsatleastoneRTLlabel.Thereisarulethatdeterminesthe
conditionstobemetforthelabelsinBididomainnames.ThisrulecanbefoundonSection2ofRFC
5893:https://tools.ietf.org/html/rfc5893
JoinersSomelanguagesusealphabeticscriptsinwhichsinglephonemesarewrittenusingtwocharacters
calledadigraph.Inotherwords,adigraphisagroupoftwosuccessivelettersthatrepresentasinglesound(orphoneme).
Somedigraphsarefullyjoinedasligatures.Inwritingandtypography,aligaturehappenswheretwoormoregraphemesorlettersarejoinedasasingleglyph.Anexampleistheampersandcharacter
(&),whichevolvedfromtheadjoinedLatinletterseandt(“et”means“and”).
18“Big-endianandlittle-endianaretermsthatdescribetheorderinwhichasequenceofbytesarestoredin
computermemory.Big-endianisanorderinwhichthe‘bigend’(mostsignificantvalueinthesequence)is
storedfirst(attheloweststorageaddress).Little-endianisanorderinwhichthe‘littleend’(leastsignificant
valueinthesequence)isstoredfirst.”
Source:http://searchnetworking.techtarget.com/definition/big-endian-and-little-endian
ExamplesofdiagraphsinEnglish
ch(asinchurch)ph(phone)
th(then)th(think)
sh(shoe)
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 28
Ifligaturesanddigraphshavethesameinterpretationinalllanguagesthatuseagivenscript,Unicode
normalizationgenerallyresolvesthedifferencesandmakesthemmatch.Whentheyhavedifferent
interpretations,matchingmustusealternativemethods,likelychosenattheregistrylevel,orusers
mustbeeducatedtounderstandthatmatchingwillnotoccur.Anexampleofdifferentinterpretation
canbefoundinSection4.3ofRFC5894:https://tools.ietf.org/html/rfc5894
TheUnicodeConsortiumliststwomainstrategiestodeterminethejoiningbehaviorofaparticular
characterafterapplyingtheBidialgorithm:
• “Whenshaping,an implementationcan referback to theoriginalbacking store to see if
therewereadjacentZWNJorZWJ19characters.
• Alternatively,theimplementationcanreplaceZWJandZWNJbyanout-of-bandcharacter
property associated with those adjacent characters, so that the information does not
interferewiththeBidialgorithmandtheinformationispreservedacrossrearrangementof
thosecharacters.OncetheBidialgorithmhasbeenapplied,thatout-of-bandinformation
canthenbeusedforpropershaping.”20
Intheabsenceofcarebyregistriesabouthowstringsthatcouldhavedifferentinterpretationsunder
IDNA2003andthecurrentspecificationarehandled,itispossiblethatthedifferencescouldbeused
asacomponentofname-matchingorname-confusionattacks.Suchcareisthereforeappropriate.
Tolearnmoreaboutjoiners,seeSection4.3ofRFC5894:https://tools.ietf.org/html/rfc5894
HomoglyphandConfusinglySimilarCharactersHomoglyphsarecharactersthat,duetosimilaritiesinsizeandshape,mightappearidenticalatfirst
glance.
Topreventconfusinglylookingdomainnamesbeingregistered,registriescanusethe“homoglyph
bundling”procedure.21
HomoglyphbundlingiswhenyouregisteranIDNandtheregistrationsystemautomaticallybundles
all the homoglyphs of that name (if there are any). Thismeans that several domain names are
bundledatonetime,andnoneoftheotherdomainnamesinthatbundlecanberegistered.
Homoglyphbundlingisagoodpracticeforregistriestoavoidpossiblephishingpracticesthatintend
totricktheuserwithvisuallyconfusingcharacters.
TolearnmoreaboutUnicodesecuritymechanismsforconfusabledetection,goto:
19TolearnmoreaboutZWNJ/ZWJ,goto:http://www.unicode.org/L2/L2005/05307-zwj-zwnj.pdf
20Source:MarkDavis,AharonLanin,AndrewGlass.2015.Unicode.http://unicode.org/reports/tr9
21https://www.icann.org/resources/pages/idn-guidelines-2011-09-02-en
Examplesofhomoglyphs
Cyrilliccharactera
Latincharactera
=
=
Unicodenumber0430
Unicodenumber0061
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 29
• http://www.unicode.org/reports/tr39/#Confusable_Detection
Toseealistofhomoglyphs,goto:
• http://homoglyphs.net
Tolearnmoreaboutconfusinglysimilarcharactersandgoodpractice,see:
• M3AAWGUnicodeAbuseOverviewandTutorial
https://www.m3aawg.org/sites/default/files/m3aawg-unicode-tutorial-2016-02.pdf
• M3AAWGBestPracticesforUnicodeAbusePrevention
https://www.m3aawg.org/sites/default/files/m3aawg-unicode-best-practices-2016-02.pdf
NormalizationandCaseFoldingNormalization
UnicodeNormalizationhelpstodeterminewhetheranytwoUnicodestringsareequivalenttoeach
other. Some characters can be represented inUnicode by several code sequences. This is called
Unicodeequivalence.Unicodeprovidestwotypesofequivalences:
• Canonical(NFD)
• Compatibility(NFK)
Sequencesrepresentingthesamecharacterarecalledcanonicallyequivalent.Thesesequenceshavethesameappearanceandmeaningwhenprintedordisplayed.Forexample:
Compatibility equivalents are sequences which can have different appearances, but in some
contextsthesamemeaning.Itisaweakertypeofequivalencebetweencharactersorsequencesof
characters.
Examplesofcanonicallyequivalentcharacters
U+006E (Latin lowercase “n”) followed by U+0303 (the
combiningtilde“◌̃”)
= ñ
U+00F1(lowercaseletter“ñ”oftheSpanishalphabet) = ñ
Examplesofcompatibilityequivalentcharacters
U+FB00(thetypographicligature“ff”) = ff
U+0066U+0066(twoLatin“f”letters) = ff
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 30
In the example above, the code point U+FB00 is defined to be compatible, but not canonically
equivalent to the sequence U+0066 U+0066. Sequences that are canonically equivalent are also
compatible,buttheoppositeisnotnecessarilytrue.
To avoid interoperability problems arising from the use of canonically equivalent, yet different,
charactersequences,theW3CrecommendsusingNormalizationFormC22forallcontent.
To see a list of all characters that may change in any of the Normalization Forms, go to:
http://www.unicode.org/charts/normalization
Someotherpointstonote:
• OnlystringsNOTtransformedbyNFKC23arevalid.
• WhentwoapplicationsshareUnicodedata,butnormalizethemdifferently,errorsanddata
losscanoccur.
• NormalizationFormsmustremainstableovertime.Inotherwords,astringmustremain
normalizedunderallfutureversionsofUnicode(backwardcompatibility).
Tipforsoftwaredevelopers
✖
Don’tnormalizebyconvertingtouppercase,orignoringnon-spacingcharacters,becausethismay
alsomakesorting,datacopy,dataimportandexport,dataretrievalbyclientapplicationsrather
difficultandmayresultindatalossorcorruption.
TolearnmoreaboutNormalizationFormsgoto:http://www.unicode.org/reports/tr15
CaseFoldingCasefoldingistheprocessofmakingtwotexts,whichdifferincasebutareotherwise“thesame”,
identical.Mapping[a-z]to[A-Z]worksformostsimpleASCII-onlytextdocuments.However,itbegins
tobreakdownwithlanguagesthatuseadditionalcharacters.
UnicodedefinesthedefaultcasefoldmappingforeachUnicodecodepoint.Therearecommonandfullcasefoldmappings:
• Commonfoldmappingsarethosethathaveasimple,straightforwardmappingtoasingle
matching(mainlylowercase)codepoint
• FullfoldmappingsarethosethatwouldnormallyrequiremorethanoneUnicodecharacter
One importantconsideration,accordingto theW3C,24 iswhether thevaluesarerestrictedto the
ASCIIsubsetofUnicodeorifthevocabularypermitstheuseofcharacters(suchasaccentsonLatin
lettersorabroadrangeofUnicodeincludingnon-Latinscripts)thatpotentiallyhavemorecomplex
casefoldingrequirements.25
22NFC:CanonicalDecomposition,followedbyCanonicalComposition.
23NFKC:CompatibilityDecomposition,followedbyCanonicalComposition.
24W3C:TheWorldWideWebConsortium(W3C)isaninternationalcommunitywhereMemberorganizations,
afull-timestaffandthepublicworktogethertodevelopWebstandards.See:https://www.w3.org
25 Source: A Phillips. 2015. Character Model for the World Wide Web: String Matching and Searching.
https://www.w3.org/TR/charmod-norm
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 31
Tipforsoftwaredevelopers
✔ ConsiderUnicodeNormalizationinadditiontocasefolding.
TolearnmoreaboutUnicodenormalization,see:
• http://www.w3.org/TR/charmod-norm
• http://unicode.org/reports/tr15
Forrecommendationsaboutcasefolding,goto:
• https://www.w3.org/International/wiki/Case_folding
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 32
Part4:GlossaryandOtherResourcesGlossary
A-label The ASCII-compatible encoded (ACE) representation of an internationalized
domainname,e.g.how it is transmitted internallywithin theDNSprotocol.A-
labelsalwayscommencewiththeprefix“xn--”.ContrastwithU-label.
ACEprefix ASCIICompatibleEncodingPrefix.
ASCIICharacters AmericanStandardCodeforInformationInterchange.Thesearecharactersfrom
thebasicLatinalphabettogetherwiththeEuropean-Arabicdigits.Thesearealso
includedinthebroaderrangeof"Unicodecharacters"thatprovidesthebasisfor
IDNs.
API AnApplicationProgramming Interface (API) is a setof routines,protocols, and
tools for building software and applications. An API may be for a web based
system,operatingsystem,ordatabasesystem,anditprovidesfacilitiestodevelop
applicationsforthatsystemusingagivenprogramminglanguage.
Codespace Rangethatdefinethelowerandupperboundsforanencoding.
CodePoints Acodepointorcodepositionisanyofthenumericalvaluesthatmakeupthecode
space. They are used to distinguish both, the number from an encoding as a
sequence of bits, and the abstract character from a particular graphical
representation(glyph).
DNSRootZone TherootzoneisthecentraldirectoryfortheDNS,whichisakeycomponentin
translatingreadablehostnamesintonumericIPaddresses.
EAI EmailAddress Internationalization is anemail address that requires theuseof
Unicodeinallpartsoftheemailaddress.
IANA InternetAssignedNumbersAuthority.Itsfunctionsinclude:
• MaintenanceoftheregistryoftechnicalInternetprotocolparameters
• Administrationofcertain responsibilitiesassociatedwith InternetDNS
rootzone
• AllocationofInternetnumberingresources
ICANN The Internet Corporation for Assigned Names and Numbers (ICANN) is an
internationally organized, non-profit corporation that has responsibility for
Internet Protocol (IP) address space allocation, protocol identifier assignment,
generic (gTLD) and country code (ccTLD) Top-Level Domain name system
management,androotserversystemmanagementfunctions.
IDN InternationalizedDomainNames.IDNsaredomainnamesthatincludecharacters
usedinthelocalrepresentationoflanguagesthatarenotwrittenwiththetwenty-
sixlettersofthebasicLatinalphabet“a-z”,thenumbers0-9,andthehyphen“-“.
IDNA InternationalizedDomainNamesinApplications.
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 33
IDNccTLD Country Code Top-level Domain that includes characters used in the local
representationoflanguagesthatarenorwrittenwiththetwenty-sixlettersofthe
basicLatinalphabet“a-z”.Examples:
• .рф(Russia)
(Egypt).صر •
(ArabiaSaudi).السعودیة •
IETF The Internet Engineering Task Force (IETF) is a large open international
communityofnetworkdesigners,operators,vendors,andresearchersconcerned
withtheevolutionoftheInternetarchitectureandthesmoothoperationofthe
Internet. It is open to any interested individual. The IETF develops Internet
StandardsandinparticularthestandardsrelatedtotheInternetProtocolSuite
(TCP/IP).
Language Themethodofhumancommunication,eitherspokenorwritten,consistingofthe
useofwordsinastructuredandconventionalway.
Punycode ItisanalgorithmtorepresentUnicodewiththelimitedcharactersubsetofASCII
supportedbytheDomainNameSystem.Punycodeisintendedfortheencoding
of labels in the Internationalized Domain Names in Applications (IDNA)
framework.
Registrar Anorganizationwheredomainnamesareregisteredbyusers.Theregistrarkeeps
records of the contact information and submits the technical information to a
centraldirectoryknownasthe“registry”.
Registry Theauthoritative,masterdatabaseofalldomainnamesregistered ineachTop
LevelDomain.
RFC A Request for Comments (RFC) is a formal document from the Internet
Engineering Task Force (IETF) that is the result of committee drafting and
subsequentreviewbyinterestedparties.
Script Thecollectionoflettersorcharactersusedinwriting,representingthesoundsof
alanguage.
Second-leveldomainname
IntheDomainNameSystem(DNS)hierarchy,asecond-leveldomain(SLDor2LD)
is a domain that is directly below a top-level domain (TLD). For example, in
example.com,exampleisthesecond-leveldomainofthe.comTLD.
U-label A"U-label" isan IDNA-valid stringofUnicodecharacters includingat leastone
non-ASCIIcharacter.ConversionsbetweenU-labelsandA-labelsareperformed
accordingtothePunycodespecification[RFC3492].
UA-ready SoftwareorUA-Readiness
Universal Acceptance Ready Software. It is a software that has the ability to
Accept,Store,Process,ValidateandDisplayallTopLevelDomainsequallyandall
IDNs,hyperlinkandemailaddressesequally.
Unicode Auniversalcharacterencodingstandard.Itdefinesthewayindividualcharacters
arerepresentedintextfiles,webpages,andothertypesofdocuments.Unicode
wasdesignedtosupportcharactersfromalllanguagesaroundtheworld.Itcan
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 34
support roughly 1,000,000 characters and supports up to 4 bytes for each
character.See:http://unicode.org
UTF UnicodeTransformationFormat.ItisawayoftransformingUnicodecodepoints
intoastreamofbytes.UTF-8isthepreferredUTFforhandlingIDNandEAI.UTF-
8convertsUnicodeto8-bitbytes.
M3AAWG TheMessaging,Malware andMobile Anti-AbuseWorkingGroup (M3AAWG) is
where the industry comes together to work against botnets, malware, spam,
viruses, DoS attacks and other online exploitation. See:
https://www.m3aawg.org/
W3C TheWorldWideWebConsortium (W3C) is an international communitywhere
Memberorganizations,afull-timestaff,andthepublicworktogethertodevelop
Webstandards.See:https://www.w3.org/
ZWJ Zero-WidthJoinerisnon-printingcharacterusedinthecomputerizedtypesetting
ofsomecomplexscriptssuchastheArabicscriptoranyIndicscript.Whenplaced
betweentwocharactersthatwouldotherwisenotbeconnected,aZWJcauses
themtobeprintedintheirconnectedforms.
ZWNJ Zero-WidthNon-Joinerisanon-printingcharacterusedinthecomputerizationof
writingsystemsthatmakeuseofligatures.Whenplacedbetweentwocharacters
thatwouldotherwisebe connected intoa ligature, aZWNJcauses them tobe
printedintheirfinalandinitialforms,respectively.Thisisalsoaneffectofaspace
character, but a ZWNJ is used when it is desirable to keep the words closer
togetherortoconnectawordwithitsmorpheme.
ForacompleteICANNglossary,goto:https://www.icann.org/resources/pages/glossary-2014-02-03-en
RFCsPUNYCODERFCs
RFC3492 Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names inApplications(IDNA)
RFC3492describesPunycodeas:
"a simple and efficient transfer encoding syntax designed for use withInternationalizedDomainNamesinApplications(IDNA)"
PunycodetransformsuniquelyandreversiblyaUnicodestringintoanASCIIstring.ThisRFC
definesageneralalgorithmcalledBootstring.Thisalgorithmallowsastringofbasiccode
pointstouniquelyrepresentanystringofcodepointsdrawnfromalargerset.
https://tools.ietf.org/html/rfc3492
IDNRFCs
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 35
RFC5890
Internationalized Domain Names for Applications (IDNA): Definitions and DocumentFramework
ThisRFCdescribestheusagecontextandprotocolforarevisionofInternationalizedDomain
NamesforApplications(IDNA).
https://tools.ietf.org/html/rfc5890
RFC5891 InternationalizedDomainNamesinApplications(IDNA)Protocol
This RFC specifies the protocol mechanism, called Internationalized Domain Names in
Applications (IDNA), for registering and looking up IDNs in a way that does not require
changestotheDNSitself.
https://tools.ietf.org/html/rfc5891
RFC5892 TheUnicodePointsandInternationalizedDomainNamesforApplications(IDNA)
TheRFC5892specifiesrulesfordecidingwhetheracodepoint,consideredinisolationorin
context,isacandidateforinclusioninanInternationalizedDomainName(IDN).
https://tools.ietf.org/html/rfc5892
RFC5893 Right-to-leftscriptsforInternationalizedDomainNamesforApplications(IDNA)
This RFC provides a new Bidi rule for Internationalized Domain Names for Applications
(IDNA)labels,fortheuseofright-to-leftscriptsinInternationalizedDomainNames.
https://tools.ietf.org/html/rfc5893
RFC5894 InternationalizedDomainNames forApplications (IDNA): Background, Explanation andRationale
Thisinformationaldocumentprovidesanoverviewofarevisedsystemtodealwithnewer
versionsofUnicodeandprovidesexplanatorymaterialforitscomponents.
https://tools.ietf.org/html/rfc5894
RFC5895 MappingCharactersforInternationalizedDomainNamesinApplications(IDNA)2008
ThisRFCdescribestheactionsthatcanbetakenbyanimplementationbetweenreceiving
userinputandpassingpermittedcodepointstothenewIDNAprotocol(2008).Itdescribes
anoperationthatistobeappliedtouserinputinordertopreparethatuserinputforusein
an “on the network” protocol. It also includes a general implementation procedure for
mapping.
https://tools.ietf.org/html/rfc5895
EAIRFCs
RFC6530 OverviewandFrameworkforInternationalizedEmail
This standard introduces a series of specifications that definemechanisms and protocol
extensions needed to fully support internationalized email addresses. This document
describes how the various elements of email internationalization fit together and the
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 36
relationshipsamongtheprimaryspecificationsassociatedwithmessagetransport,header
formats,andhandling.
https://tools.ietf.org/html/rfc6530
RFC6531 SMTPExtensionforInternationalizedEmail
ThedocumentdefinesaSimpleMailTransferProtocolextensionsoserverscanadvertise
the ability to accept and process internationalized email addresses and internationalized
emailheaders.
https://tools.ietf.org/html/rfc6531
RFC6532 InternationalizedEmailHeaders
ThisdocumentspecifiesanenhancementtotheInternetMessageFormatandtoMIMEthat
allows use of Unicode inmail addresses andmost header field content. This document
specifiesanenhancement to the InternetMessageFormat (RFC5322)and toMIME that
permitsthedirectuseofUTF-8,ratherthanonlyASCII,inheaderfieldvalues,includingmail
addresses. A new media type, message/global, is defined for messages that use this
extended format.This specificationalso lifts theMIME restrictiononhavingnon-identity
content-transfer-encodings on any subtype of the message top-level type so that
message/globalpartscanbesafelytransmittedacrossexistingmailinfrastructure.
https://tools.ietf.org/html/rfc6532
RFC6533 InternationalizedDeliveryStatusandDispositionNotifications
Thisspecificationaddsanewaddresstypeforinternationalemailaddressessoanoriginal
recipient address with non-ASCII characters can be correctly preserved even after
downgrading. This also provides updated content returnmedia types for delivery status
notificationsandmessagedispositionnotificationstosupportuseofthenewaddresstype.
https://tools.ietf.org/html/rfc6533
KeyStandardsISO10646(Unicode)
Toprovideacommontechnicalbasis for theprocessingofelectronic information in
various languages, the International Organization for Standardization (ISO) has
developedaninternationalcodingstandardcalledISO10646.TheISO10646provides
a unified standard for the coding of characters in allmajor languages in theworld
includingtraditionalandsimplifiedChinesecharacters.Thislargecharactersetiscalled
theUniversalCharacterSet(UCS).ThesamesetofcharactersisdefinedbytheUnicode
standard,whichfurtherdefinesadditionalcharacterpropertiesandotherapplication
detailsofgreatinteresttoimplementers.
UnicodeisacharactercodingsystemdesignedbytheUnicodeConsortiumtosupport
theinterchange,processinganddisplayofthewrittentextsofallmajorlanguagesin
theworld. ISO 10646 andUnicode define several encoding forms of their common
repertoire:UTF-8,UCS-2,UTF-16,UCS-4andUTF-32.
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 37
http://www.iso.org/iso/home/store/catalogue_ics/catalogue_detail_ics.htm?csnumb
er=63182
GB18030(China)
GB18030-2000isaChinesegovernmentstandardthatspecifiesanextendedcodepage
foruseintheChinesemarketinadditiontoUTF-8.Theinternalprocessingcodeforthe
characterrepertoirecanandshouldbeUnicode;however,thestandardstipulatesthat
softwareprovidersmustguaranteeasuccessfulround-tripbetweenGB18030andthe
internalprocessingcode.AllproductscurrentlysoldortobesoldinChinamustplan
the code page migration to support GB18030 without exception. GB18030 is a
“mandatorystandard”andtheChinesegovernmentregulatesthecertificationprocess
toreinforceGB18030deployment.
http://icu-project.org/docs/papers/unicode-gb18030-faq.html
UnicodeTechnicalStandard#46:UnicodeIDNACompatibilityProcessing
Thisspecificationdefinesamappingconsistentwiththenormativerequirementsofthe
IDNA2008protocol,andwhichisascompatibleaspossiblewithIDNA2003.Forclient
software, this provides behavior that is themost consistentwith user expectations
aboutthehandlingofdomainnameswithexistingdata.
http://unicode.org/reports/tr46/
OnlineResourcesAPIs WindowsAPIs
https://www.msdn.microsoft.com/enus/library/windows/desktop/ff818516%28v=vs.
85%29.aspx
SharePointAPIs
https://msdn.microsoft.com/en-us/library/office/jj860569.aspx
PublicSuffixList
https://publicsuffix.org/list/public_suffix_list.dat
ICANNAuthoritativeTLDlist
http://data.iana.org/TLD/tlds-alpha-by-domain.txt
AndroidAPIs
http://developer.android.com/guide/index.html
MACIOSAPIs
https://developer.apple.com/library/mac/navigation
.NetFramework
https://msdn.microsoft.com/en-us/library/system.text.encoding(v=vs.110).aspx
UnicodeSecurity
UnicodeSecurityconsiderations
http://www.unicode.org/reports/tr36
Unicodesecuritymechanisms
http://www.unicode.org/reports/tr39
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 38
Unicodecharactergroupings
Unicodecodeplanes
http://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes
OverviewofGB18030
http://en.wikipedia.org/wiki/GB_18030
AuthoritativemappingtablebetweenBG18038-2000andUnicode
http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-
2000.xml
Unicodenormalization
https://en.wikipedia.org/wiki/Unicode_equivalence
Unicodeexploits
Section3.1,“UTF-8Exploits”inUnicodeTechnicalReport#36
http://unicode.org/reports/tr36/#UTF-8_Exploit
M3AAWGBestPracticesforUnicodeAbusePrevention
https://www.m3aawg.org/sites/default/files/m3aawg-unicode-best-practices-2016-
02.pdf
M3AAWGUnicodeAbuseOverviewandTutorial
https://www.m3aawg.org/sites/default/files/m3aawg-unicode-tutorial-2016-02.pdf
Seealso:
http://www.unicode.org
Miscellaneous URIs
http://tools.ietf.org/html/rfc3986
TheDomainNameSystem:ANon-TechnicalExplanation–WhyUniversalResolvability
IsImportant
http://www.internic.net/faqs/authoritative-dns.html
ICANNglossary
https://www.icann.org/resources/pages/glossary-2014-02-03-en
IntroductiontoUniversalAcceptance(UASG007)
Version8-5May2016 39
AcknowledgementsTheauthorsgratefully acknowledge the followingpeople for their contributionsand collaborationon this
document:
EleezaAgopian
GwenCarlson
EdmonChung
SamanthaDickinson
DonHollander
ChantalLebrument
AntoniettaMangiacotti
RichardMerdinger
RamMohan
DavidMorrison
CarolynNguyen
MichaelD.Palage
KurtPritz
AndréSchappo
ZhengSong
LarsSteffen
AndrewSullivan
DennisTan
WinnieYu