Upload
ravi-raj
View
8.617
Download
1
Embed Size (px)
DESCRIPTION
Character Encoding issue with PHP
Citation preview
Character Encoding issue with PHP
$customer =
array(
'id' => 'á é í ó ú, ñ, Ñ',
'name' => 'Iñtërnâtiônàlizætiøn',
'notes' => 'raviraj from infoEdge india Ltd.'
);
$var ="I ♥ Unicode, You ♥ Unicode.";
Main problem with using Unicode
• it's partially supported by some parts of any given tool chain.
• Sometimes it works great, and other times—due to a given piece of software's lack of implementation (or worse, a partial implementation), human error, or full-on bugs—the chain's weakest link shatters in a non-spectacular way.
Let's Take a Complex Case..
create file, edit file, commit file to svn, other developers edit file, others commit to svn,
release is rolled from svn, visitor browser requests page, httpd parses request, httpd delivers request to PHP,
PHP processes request, PHP (client) calls service to fulfill back-end portions of request (encodes the request in an
envelope—we use JSON most of the time), PHP (service) receives request, service retrieves and/or stores data in database,
service returns data to PHP client, PHP client processes returned data and in turn delivers it to httpd, httpd
returns data to browser
Let's Take a Complex Case.....
• any (one or more!) of the following could fail when handling unicode: developers' editors, developers'
transport (either upload or version control), user's browser, user's http proxy, client-side httpd,
client-side PHP, client-side encoder (JSON), service-side httpd (especially HTTP headers), service-side decoder,
service-side PHP, service-side database client, database protocol character set imbalance, database table charset,
database server, service-side encoder, client-side decoder, client-side PHP (again), client-side httpd
(including HTTP headers, again), user's proxy (again), and user's browser (again). I've probably even left some out.
Understand Basic..
A character is the smallest component of written language that has a semantic value. Examples of characters are letters, ideographs (e.g. Chinese characters), punctuation marks, digits etc.
A character set is a group of characters without associated numerical values. An example of a character set is the Latin alphabet or the Cyrillic alphabet.
Coded character sets are character sets in which each character is associated with a scalar value: a code point. For example, in ASCII, the uppercase letter “A” has the value 65. Examples for coded character sets are ASCII and Unicode. A coded character set is meant to be encoded, i.e. converted into a digital representation so that the characters can be serialized in files, databases, or strings. This is done through a character encoding scheme or encoding. The encoding method maps each character value to a given sequence of bytes.
In many cases, the encoding is just a direct projection of the scalar values, and there is no real distinction between the coded character set and its serialized representation. For example, in ISO 8859-1 (Latin 1), the character “A” (code point 65) is encoded as a byte 0×41 (i.e. 65). In other cases, the encoding method is more complex. For example, in UTF-8, an encoding of Unicode, the character “á” (225) is encoded as two bytes: 0xC3 and 0xA1.
Unicode -Universal Character Set
UTF-8 is a multibyte 8-bit encoding in which each Unicode scalar value is mapped to a sequence of one to four bytes. One of the main advantages of UTF-8 is its compatibility with ASCII. If no extended characters are present, there is no difference between a dencoded in ASCII and one encoded in UTF-8.
One thing to take into consideration when using UTF-8 with PHP is that characters are represented with a varying number of bytes. Some PHP functions do not take this into account and will not work as expected
PHP's Problem
<?php
echo strlen('Iñtërnâtiônàlizætiøn');
?>
It prints 27 characters. That’s because the string, encoded as UTF-8, contains multi-byte characters which PHP‘s strlen function will count as being multiple characters.
Correct answer is 20 characters !!!
So it's good time to switch over UTF-8 ...
Why UTF8 ??
• it’s an encoding of Unicode and, second, that it’s backwards compatible with ASCII.
Character codes less than 128 (effectively, the ASCII repertoire) are presented “as such”, using one octet for each code (character) All other codes are presented, according to a relatively complicated method, so that one code (character) is presented as a sequence of two to six octets, each of which is in the range 128 - 255. This means that in a sequence of octets, octets in the range 0 - 127 (”bytes with most significant bit set to 0”) directly represent ASCII characters, whereas octets in the range 128 - 255 (”bytes with most significant bit set to 1”) are to be interpreted as really encoded presentations of characters.
UTF8 and Codeigniter ??
• HTML Form should be support UTF8
<form accept-charset="utf-8" ...>
• HTML Meta Tag should support UTF8
<?php echo meta('Content-type', 'text/html; charset='.config_item('charset'), 'equiv');?>
• Put it on index.php
header('Content-Type: text/html; charset=utf-8');
UTF8 & Codeigniter
• change config.php file
$config['charset'] = "UTF-8";
• config DB settings
$db['default']['char_set'] = "utf8";
$db['default']['dbcollat'] = "utf8_unicode_ci";
UTF8 & CI
• ALTER DATABASE mydatabase
CHARACTER SET utf8
DEFAULT CHARACTER SET utf8
COLLATE utf8_general_ci
DEFAULT COLLATE utf8_general_ci ;
• ALTER TABLE mytable
DEFAULT CHARACTER SET utf8
COLLATE utf8_general_ci ;
End ...
Universal Unicode support is long battle.
I'm sure you are ready for it Now :D
RIGHT ?? :-)
THANKS
Reference Links
• http://www.lookuptables.com/
• http://en.wikipedia.org/wiki/UTF-8#Description
• http://hsivonen.iki.fi/php-utf8/
• http://www.intertwingly.net/blog/1761.html
• http://www.php.net/iconv
• http://www.php.net/mbstring
• http://www.gravitonic.com/do_download.php?download_file=talks/intlphpcon2005/php_unicode.pdf