82
Unicode: The hero or villain? Input Validation of free-form Unicode text in Web Applications Pawel Krawczyk

Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

  • Upload
    others

  • View
    23

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?Input Validation of free-form Unicode text in Web Applications

Pawel Krawczyk

Page 2: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

In application security since 90’s - pentesting, security architecture, SSDLC, DevSecOps

Active developer Python, C, Java https://github.com/kravietz

OWASP - SAML, PL/SQL, authentication cheatsheets

WebCookies.org - web privacy and security scanner

Immusec.com - competitive pentesting & incident response in UK

About

[email protected]

+44 7879 180015

https://www.linkedin.com/in/pawelkrawczyk

Pawel Krawczyk

Page 3: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Definition of the problem

Page 4: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Free-form text validation

Pawel Krawczyk

Page 5: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Free-form text validation

Pawel Krawczyk

Page 6: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Free-form text validation

Pawel Krawczyk

Page 7: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Free-form text validation

Pawel Krawczyk

Page 8: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode Primer

Page 9: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Author name her

Official name: A-OGONEK

“a letter in the Polish, Kashubian, Lithuanian, Creek, Navajo, Western Apache,

Chiricahua, Osage, Hocąk, Mescalero, Gwich'in, Tutchone, and Elfdalian

alphabets” (Wikipedia)

The rise and fall of letter “Ą”

Author name here

This is the abstract title

Ą

Page 10: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

ASCII: just write “Ą” as “A”

Pawel Krawczyk

Page 11: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

ASCII: just write “Ą” as “A”

Pawel Krawczyk

(Pol.) KĄT = (Eng.) ANGLE

(Pol.) KAT = (Eng.) HANGMAN

Contextual guessing, confusion, misunderstandings, we had lots of fun on IRC back in 90’s...

Page 12: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Windows-1250: “Ą” is 0xa5

Pawel Krawczyk

Page 13: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

ISO-8859-2: “Ą” is 0xa1

Pawel Krawczyk

Page 14: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

To możliwe!

Pawel Krawczyk

Source: Wikipedia

Page 15: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Confused beyond diacritics

Pawel Krawczyk

Page 16: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Confused beyond diacritics

Pawel Krawczyk

Page 17: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Confusing Unicode

Pawel Krawczyk

Examples: PythonHosted.org, Luciano Ramalho “Fluent Python”

Page 18: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Confusing Unicode

Pawel Krawczyk

Examples: PythonHosted.org, Luciano Ramalho “Fluent Python”

Page 19: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Confusing Unicode

Pawel Krawczyk

Examples: PythonHosted.org, Luciano Ramalho “Fluent Python”

Page 20: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Confusing Unicode

Pawel Krawczyk

Examples: PythonHosted.org, Luciano Ramalho “Fluent Python”

Page 21: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Stay Calm and Unicode

Page 22: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Forget everything

you’ve learned

about pre-Unicode

characters and strings*

*including MBCS and UCS

Rule #1

Pawel Krawczyk

Page 23: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?Pawel Krawczyk

Ą

Page 24: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?Pawel Krawczyk

Ą Character

Page 25: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?Pawel Krawczyk

Ą CharacterLATIN CAPITAL LETTER A WITH OGONEK

Page 26: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?Pawel Krawczyk

Ą CharacterLATIN CAPITAL LETTER A WITH OGONEK

Page 27: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?Pawel Krawczyk

Ą CharacterLATIN CAPITAL LETTER A WITH OGONEK

Code pointU+0104

Page 28: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?Pawel Krawczyk

Ą CharacterLATIN CAPITAL LETTER A WITH OGONEK

Code pointU+0104“Character’s catalog number”This is not encoding

Page 29: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?Pawel Krawczyk

Ą CharacterLATIN CAPITAL LETTER A WITH OGONEK

Code pointU+0104“Character’s catalog number”This is not encoding

Page 30: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?Pawel Krawczyk

Ą CharacterLATIN CAPITAL LETTER A WITH OGONEK

Code pointU+0104“Character’s catalog number”This is not encoding!!!

Page 31: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?Pawel Krawczyk

Ą CharacterLATIN CAPITAL LETTER A WITH OGONEK

Code pointU+0104“Character’s catalog number”This is encodingEncode as UTF-8

0xC4 0x84

Page 32: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?Pawel Krawczyk

Ą CharacterLATIN CAPITAL LETTER A WITH OGONEK

Code pointU+0104“Character’s catalog number”

Encode as UTF-8

0xC4 0x84Encode as UTF-16 BE

0x01 0x04

Page 33: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?Pawel Krawczyk

Ą CharacterLATIN CAPITAL LETTER A WITH OGONEK

Code pointU+0104“Character’s catalog number”

Encode as UTF-8

0xC4 0x84Encode as UTF-16 BE

0x01 0x04

Enco

de as

UTF

-16 L

E

0x04 0x01

Page 34: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?Pawel Krawczyk

Ą CharacterLATIN CAPITAL LETTER A WITH OGONEK

Code pointU+0104“Character’s catalog number”

Encode as UTF-8

0xC4 0x84Encode as UTF-16 BE

0x01 0x04

Enco

de as

UTF

-16 L

E

0x04 0x01

Enco

de a

s U

TF-3

2 BE

0x00 0x00 0x01 0x04

Page 35: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?Pawel Krawczyk

Ą CharacterLATIN CAPITAL LETTER A WITH OGONEK

Code pointU+0104“Character’s catalog number”

Encode as UTF-8

0xC4 0x84Encode as UTF-16 BE

0x01 0x04

Enco

de as

UTF

-16 L

E

0x04 0x01

Enco

de a

s U

TF-3

2 BE

0x00 0x00 0x01 0x04 0x04 0x01 0x00 0x00Encode as UTF-32 LE

Page 36: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?Pawel Krawczyk

Ą CharacterLATIN CAPITAL LETTER A WITH OGONEK

Code pointU+0104“Character’s catalog number”

Encode as UTF-8

0xC4 0x84Encode as UTF-16 BE

0x01 0x04

Enco

de as

UTF

-16 L

E

0x04 0x01

Enco

de a

s U

TF-3

2 BE

0x00 0x00 0x01 0x04 0x04 0x01 0x00 0x00Encode as UTF-32 LE

+AQQ-UTF-7

Page 37: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?Pawel Krawczyk

Ą CharacterLATIN CAPITAL LETTER A WITH OGONEK

Code pointU+0104“Character’s catalog number”

Encode as UTF-8

0xC4 0x84Encode as UTF-16 BE

0x01 0x04

Enco

de as

UTF

-16 L

E

0x04 0x01

Enco

de a

s U

TF-3

2 BE

0x00 0x00 0x01 0x04 0x04 0x01 0x00 0x00Encode as UTF-32 LE

+AQQ-UTF-7

+ADw-script+AD4-alert(+ACc-xss+ACc-)+ADw-+AC8-script+AD4-

Page 38: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Homoglyphs

Page 39: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?Pawel Krawczyk

ՍnᎥⅽоԁе

Page 40: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?Pawel Krawczyk

ՍnᎥⅽоԁе

ARMENIAN CAPITAL LETTER SEH

Page 41: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?Pawel Krawczyk

ՍnᎥⅽоԁе

ARMENIAN CAPITAL LETTER SEH

LATIN SMALL LETTER N

Page 42: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?Pawel Krawczyk

ՍnᎥⅽоԁе

ARMENIAN CAPITAL LETTER SEH

LATIN SMALL LETTER N

CHEROKEE LETTER V

Page 43: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?Pawel Krawczyk

ՍnᎥⅽоԁе

ARMENIAN CAPITAL LETTER SEH

LATIN SMALL LETTER N

CHEROKEE LETTER V

SMALL ROMAN NUMERAL ONE HUNDRED

Page 44: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?Pawel Krawczyk

ՍnᎥⅽоԁе

ARMENIAN CAPITAL LETTER SEH

LATIN SMALL LETTER N

CHEROKEE LETTER V

SMALL ROMAN NUMERAL ONE HUNDRED

CYRILLIC SMALL LETTER O

Page 45: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?Pawel Krawczyk

ՍnᎥⅽоԁе

ARMENIAN CAPITAL LETTER SEH

LATIN SMALL LETTER N

CHEROKEE LETTER V

SMALL ROMAN NUMERAL ONE HUNDRED

CYRILLIC SMALL LETTER O

CYRILLIC SMALL LETTER KOMI DE

Page 46: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?Pawel Krawczyk

ՍnᎥⅽоԁе

ARMENIAN CAPITAL LETTER SEH

LATIN SMALL LETTER N

CHEROKEE LETTER V

SMALL ROMAN NUMERAL ONE HUNDRED

CYRILLIC SMALL LETTER O

CYRILLIC SMALL LETTER KOMI DE

CYRILLIC SMALL LETTER IE

Page 47: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?Pawel Krawczyk

ՍnᎥⅽоԁе

Page 48: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Surviving in theUnicode world

Page 49: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Inside your application

think text composed of characters;

forget about bytes

Rule #2

Pawel Krawczyk

Page 50: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Decode bytes into text

on input

Encode text into bytes

on output

Rule #3

Pawel Krawczyk

Page 51: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Input decoding example

Pawel Krawczyk

Transport metadata

Decoding

Internal representation

Transport data

Text: Jolanta Kozak “Dziaberlak” (“Jabberwocky” by Lewis Carroll)

Exceptions!

Page 52: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Text processing, persistence* and fun

Pawel Krawczyk

Example text processing

* do not persist before watching this presentation till the end

Page 53: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Output encoding

Pawel Krawczyk

U+FEFF BYTE ORDER MARK (BOM)“Unicode signature”

Page 54: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

When things go south

Page 55: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Wrong decoder

Pawel Krawczyk

● Client told us so● Client told us nothing, so we assumed so● Client told us ‘utf-8’, but we ignored it and assumed ‘ascii’ because

we have been writing this software since 1986

Page 56: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

What to do - a policy decision!

Pawel Krawczyk

Reject incorrect information(fail closed)

Partially lose information(fail open)

Recover information(fail pretending it’s fine)

Page 57: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Policy decision

Pawel Krawczyk

Reject incorrect information(fail closed)

Partially lose information(fail open)

Recover information(fail pretending it’s fine)

Page 58: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Policy decision

Pawel Krawczyk

Reject incorrect information(fail closed)

Partialy lose information(fail open)

Recover information(fail pretending it’s fine)

Page 59: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Validation techniques

Page 60: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Character category enforcement

Pawel Krawczyk

Page 61: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Character category enforcement

Pawel Krawczyk

Source: Unicode Standard Annex #44

Page 62: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Enforce Unicode categories

Rule #5

Pawel Krawczyk

Page 63: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Character script enforcement

Pawel Krawczyk

Scripts used in the example:● LATIN● CYRILLIC● CJK● ARABIC

Page 64: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Enforce Unicode scripts

Rule #6

Pawel Krawczyk

Page 65: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Text direction enforcement

Pawel Krawczyk

RIGHT-TO-LEFT OVERRIDE U+202E

Visual spoof*

*now prevented by many client programs

Page 66: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Text direction enforcement

Pawel Krawczyk

Source: Unicode Standard Annex #44

Page 67: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Enforce consistent text direction

Rule #7

Pawel Krawczyk

Page 68: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Normalization

Page 69: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

When cafe ́is not café?

Pawel Krawczyk

Single character U+00E9

Two U+000E and U+0301 characters combined on display

Page 70: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

When it is a problem?

Pawel Krawczyk

● Collation● Sorting● Comparison● Persistence of data with different composition

Page 71: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Unicode normalization

Pawel Krawczyk

● NFC = Normalization Form “C”● Converts Unicode characters to a single, consistent form● U+000E U+0301 and U+00E9 are Unicode “canonical equivalents”

Page 72: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

NFC, NFD, NFKC, NFKD?

Pawel Krawczyk

● NFC will compose, make shorter - é becomes U+00E9● NFD will decompose, make longer - é becomes U+000E U+0301● NFKC and NFKD will also replace “compatibility characters”

○ Possible loss of information!

Page 73: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

NFKC, NFKD

Pawel Krawczyk

This is a significant loss of information!

Source: Luciano Ramalho “Fluent Python”

Page 74: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Compatibility normalization

Pawel Krawczyk

● Precomposed Roman numerals Ⅻ U+216B

Page 75: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Compatibility normalization

Pawel Krawczyk

● Precomposed Roman numerals Ⅻ U+216B● Ⅹ U+2169 Ⅰ U+2160 Ⅰ U+2160 (Roman numerals)Ⅻ

Page 76: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Compatibility normalization

Pawel Krawczyk

● Precomposed Roman numerals Ⅻ U+216B● Ⅹ U+2169 Ⅰ U+2160 Ⅰ U+2160 (Roman numerals)● X U+0058 I U+0049 I U+0049 (Latin letters)

Another typical example:● ¼ U+00BC → 1 U+0031 ⁄ U+2044 4 U+0034

Page 77: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

NFKC, NFKD

Pawel Krawczyk

NKFC replaces single Ⅻ U+216B ROMAN NUMERAL TWELVE character by three Roman digits represented by Latin letters X and I

Useful for search and comparison● is there “X” in “XII”?● is there “f” in “ffi”?

Many typesetting programs will replace popular “compatibility sequences” by appropriate Unicode characters:

--> replaced by → U+2192 RIGHTWARDS ARROW

Page 78: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Normalize Unicode text

Rule #8

Pawel Krawczyk

Page 79: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Validation strategies

Page 80: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Policy decisions

Pawel Krawczyk

Let’s talk about your user base for a moment…

● Are they expected to communicate in Chinese, English, Polish, Arabic…?● Do we expect text in Linear-B, Ugaritic, Klingon?

○ Can you process data in any of these languages?● What kind of text is expected where?● E.g. names - are they composed of letters only (“Portia Sutcliffe”)

○ Or maybe we also expect digits and punctuation (“Cynthia O'Keefe”, “Howard Upperton-Wildingham III”)● Define what is valid text● Define appropriate valid categories, scripts and text directions● Normalize Unicode input prior to validation

Page 81: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Policy decisions

Pawel Krawczyk

Let’s talk about your user base for a moment…

● Are they expected to communicate in Chinese, English, Polish, Arabic…?● Do we expect text in Linear-B, Ugaritic, Klingon?

○ Can you process data in any of these languages?● What kind of text is expected where?● E.g. names - are they composed of letters only (“Portia Sutcliffe”)

○ Or maybe we also expect digits and punctuation (“Cynthia O'Keefe”, “Howard Upperton-Wildingham III”)● Define what is valid text● Define appropriate valid categories, scripts and text directions● Normalize Unicode input prior to validation● Do we have free-text fields?

○ If yes, how free is the “free text”?○ Letters, digits, punctuation?○ Symbols?

■ Because if someone explains “I clicked File > Properties > General” it takes symbols● Is all this part of localization for given region?

○ Along with database collation, currency, numbers...

Page 82: Unicode: The hero or villain? - 2018.appsec.eu · Unicode: The hero or villain? Policy decisions Pawel Krawczyk Let’s talk about your user base for a moment… Are they expected

Unicode: The hero or villain?

Questions?

Pawel Krawczyk

[email protected]+44 7879 180015https://www.linkedin.com/in/pawelkrawczyk/