Patchwork UTF-8 : portabilit© unicode et graph¨mes clusters

  • View
    844

  • Download
    3

Embed Size (px)

Text of Patchwork UTF-8 : portabilit© unicode et graph¨mes clusters

PowerPoint Presentation

Patchwork\Utf8Unicode et graphmes clusters pour PHPUnicode : bases et concepts

Unicode : tat de lart

Patchwork\Utf8

SommairePHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8110 182 caractres, 100 scriptsPeacePHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8110 182 caractres, 100 scriptsPU+0050LATIN CAPITAL LETTER PU+0633ARABIC LETTER SEENU+548CCJK UNIFIED IDEOGRAPH-548CU+262EPEACE SYMBOLA chaque caractre son numro, un nom et des proprits (catgorie, script, etc.)PHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8Reprsentations binairesMajuscules, minuscules, foldingCompositions, ligaturesComparaison : normalisations et collationsSegmentation : caractres, mots, phrases et csures

Locales : conventions culturelles, translittrationsIdentifiants et scurit, confusablesAffichage : direction, largeurUnicode : bases et conceptsPHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8De point de code squence doctetsUTF-8 : 1, 2, 3 ou 4 octetsUTF-16 : 2 ou 4 octetsUTF-32 : 4 octets

BOM U+FEFFByte Order MarkReprsentations binairesUTF-32BE00 00 FE FFUTF-32LEFF FE 00 00UTF-16BEFE FFUTF-16LEFF FEUTF-8EF BB BF ()U+00E1LATIN SMALL LETTER A WITH ACUTEUTF-16BE00 E1UTF-8C3 A1U+3042HIRAGANA LETTER AUTF-16BE30 42UTF-8E3 81 82PHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8Sur-ensemble dASCIIAuto-synchronisCaractristiqueUTF-8Octet 1Octet 2Octet 3Octet 40xxxxxxx110xxxxx10xxxxxx1110xxxx10xxxxxx10xxxxxx11110xxx10xxxxxx10xxxxxx10xxxxxx

PHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8Concerne un peu plus de 1000 caractres

Folding comparaison insensible la casseComparer les chanes en minusculesUne majuscule, deux minuscules : /Exception turque : I i vs i et I Full folding : ssMajuscules, minuscles et foldingPHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8Composition, ligaturesU+00C5LATIN CAPITAL LETTER A WITH RING ABOVEA+0041, 030ALATIN CAPITAL LETTER A, COMBINING RING ABOVEf+iU+0066, U+0069LATIN SMALL LETTER F, LATIN SMALL LETTER IFB01LATIN SMALL LIGATURE FI++U+1103, U+116E, U+11A8HANGUL CHOSEONG TIKEUT, HANGUL JUNGSEONG U, HANGUL JONGSEONG KIYEOKU+B451HANGUL SYLLABE DUGComment tester lgalit ? (=, )PHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8Forme Compose, Dcompose, de Kompatiblit Normalisations+U+1E0B, U+0323LATIN SMALL LETTER D WITH DOT ABOVE,COMBINING DOT BELOWd++U+0064, U+0323, U+0307LATIN SMALL LETTER D,COMBINING DOT BELOW, COMBINING DOT ABOVE+U+1E0D, U+0307LATIN SMALL LETTER D WITH DOT BELOW,COMBINING DOT ABOVEf+iU+0066, U+0069LATIN SMALL LETTER F, LATIN SMALL LETTER IU+FB01LATIN SMALL LIGATURE FINFD

NFC

NFKD, NFKCNFD, NFCPHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8NFCDj U+0044U+00E9U+006AU+00E0NFDDe ja U+0044U+0065U+0301U+006AU+0061U+0300

Quel est le 2e caractre ? le 3e ?GraphmesPHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8ICU : Java et C/C++Licence X-like, soutenu par IBM, utilis comme implmentation de rfrence pour Unicode et +

Perl 6 Parrot : NFGNFC + Graphmes Clusters

JavaScript : Unicode (NFC)

Python : chanes types

PHP : iconv, mbstring, pcre, intlUnicode en pratiquePHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8iconv_set_encoding('UTF-8')

iconv($in_charset , $out_charset , $str)iconv_strlen($str)iconv_substr($str, $start, $length)iconv_strpos($haystack, $needle, $offset = 0)iconv_strrpos($haystack, $needle)

Manipulation de chanes UTF-8 : fait !IconvNFC : DjNFD : De ja PHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8mb_internal_encoding('UTF-8')

quivalents diconvmb_strtolower/upper (), pas de foldingmb_stripos(), folding simple

Manipulation de chanes UTF-8 : fait ! (bis)Manipulation de la casse : fait ! % foldingMbstringNFC : DjNFD : De ja PHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8Avec le modificateur u : /./uDonne accs aux proprits Unicode \x{00E9} ou simplement ssi source UTF-8 \p{Greek} \p{Mn} \X (?>\PM\pM*) pour PCRE < 8.32

Vrifier la validit UTF-8 de $str : preg_match('//u', $str)

Manipulation de chanes UTF-8 : fait ! (ter)Manipulation des proprits Unicode : fait !PCREPerl Compatible Regular ExpressionNFC : DjNFD : De ja PHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8Normalizer

Gestion des graphmes clusters

Collator, NumberFormatter, Locale, MessageFormatter, IntlDateFormatter, Spoofchecker, Transliterator, Fonctions IDN, Uconverter (PHP5.5)Intl ICU pour PHPPHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8use Normalizer as n;

n::isNormalized($str, $form = n::NFC)

n::normalize($str, $form = n::NFC)

n::NFC, n::NFD, n::NFKC, n::NFKC, n::NONE

Tester lgalit de chanes : fait !NormalizerNFC : DjNFD : De ja PHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8grapheme_extract Extrait un groupe de graphmes d'une chane UTF-8grapheme_striposgrapheme_stristrgrapheme_strlengrapheme_strposgrapheme_strriposgrapheme_strrposgrapheme_strstrgrapheme_substr

Manipulation par graphme cluster : fait !Encore un peu jeune, attention aux bugshttps://bugs.php.net/55562, 61860 et 62759Graphmes clustersNFC : DjNFD : De ja PHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8Chanes UTF-8 : iconv, mbstring, pcreCasse : mbstringNormalisation : intlGraphmes : intl et pcre

function toAscii($s) { if (preg_match("/[\x80-\xFF]/", $s)) { $s = Normalizer::normalize($s, Normalizer::NFKD); $s = preg_replace('/\p{Mn}+/u', '', $s); $s = iconv('UTF-8', 'ASCII//TRANSLIT', $s); }

return $s; }

Quid :Case folding ? Complexit ? Disponibilit ?PHP - RcapitulatifPHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8Patchwork\Utf8Unicode et graphmes clusters pour PHPPortabilit PHP, optimis pour UTF-8

Iconv : 99%

Mbstring : les 45% ncessaires

Intl : Normalizer et grapheme_*()

utf8_en/decode() - Windows-1252 enhancedPatchwork\Utf8PHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8Windows-1252x0x1x2x3x4x5x6x7x8x9xAxBxCxDxExF0xNULSOHSTXETXEOTENQACKBELBSHTLFVTFFCRSOSI1xDLEDC1DC2DC3DC4NAKSYNETBCANEMSUBESCFSGSRSUS2xSP!"#$%&'()*+,-./3x0123456789:;

?4x@ABCDEFGHIJKLMNO5xPQRSTUVWXYZ[\]^_6x`abcdefghijklmno7xpqrstuvwxyz{|}~DEL8x9xAxNBSPBxCxDxExFxHTML5 : se substitue ISO-8859-1PHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8DpendancesImplmentations PHPPatchwork\Utf8mbstringiconvgrapheme_*NormalizerpcrexmlPHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8Les fonctions PHP, version UTF-8

Limit au sous-ensemble qui le ncessite

Prfixe u:: pour dclarer lintention dans le codePatchwork\Utf8PHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8strlen - chr - ordsubstr - str_splitstrpos - strrpos - strstr - strrchr - stripos - strripos - stristr - strrichrstrtolower - strtoupper - ucfirst - lcfirst - ucwordstrim - ltrim - rtrimstrtr - str_ireplace - substr_replace - str_replace dj compatiblestrcmp - strnatcmp - strcasecmp - strnatcasecmp - strncasecmp - strncmpstrspn - strcspn - strpbrk - substr_compare - substr_count - str_word_count - count_charsnumber_format - wordwrap - str_pad - strrev - str_shuffleutf8_encode - utf8_decode Windows-1252 enhanced

Manque la famille printf

Ajoute isUtf8 - toAscii - strtocasefold - strtonatfold et nombreux workaroundInterfaceNFC : DjNFD : De ja PHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8Quelques extras, workarounds et bootstrapping

Couvert par de nombreux tests unitaires

Licences Apache-2.0 et GPL-2.0Patchwork\Utf8PHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8https://github.com/FSX/php-utf8Mediawiki, PhpBB, Drupal, etc.

Diffrences majeuresGestion des graphme clustersAPI dj documente : cf. documentation PHPBootstrapping via autoload possibleTestable et comparable en mme temps

ComparablesPHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8Editeur de code en mode UTF-8 sans BOM

Vrifiez la validit UTF-8 de vos entresMais ne supprimez pas les caractres erronsjava\xFFscript:alert("XSS")preg_match('//u', $v) or $v = u::utf8_encode($v);

Normalisez vos entres UTF-8 : NFC on demand

require 'bootup.utf8.php';use Patchwork\Utf8 as u;BootstrappingPHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8NFC repousse les graphmes clusters

Pour manipuler des donnes (cf. MySQL),pas les identifiants.

A utiliser avec discernementIntrt en pratiquePHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8Questions ?

https://github.com/nicolas-grekas/Patchwork-UTF8composer : {"require": {"patchwork/utf8": "1.0.*"}}

Unicode.orgWikipdiaPHP et UTF-8http://julp.lescigales.org/php/utf8/Handling UTF-8 with PHP http://www.phpwact.org/php/i18n/utf-8Merci !PHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8

Test du normalizerPHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8

Test du normalizer14Mo4Mo2Mo-PHP Tour Nantes 2012https://github.com/nicolas-grekas/Patchwork-UTF8