Home  >  Article  >  Backend Development  >  Analysis of PHP string encoding issues

Analysis of PHP string encoding issues

WBOY
WBOYOriginal
2016-07-25 08:59:481024browse
  1. $encoding = mb_detect_encoding($string, array("ASCII",'UTF-8′,"GB2312′,"GBK",'BIG5′));
Copy the code

Then: mb_substr ( string $str , int $start [, int $length [, string $encoding ]] )

If you implement mb_substr yourself, the efficiency is not very good.

Usage of encoding-related php functions ord(substr($str, $i, 1)) > 0xa0)

ord($string) returns the ASC code of the first character of the string. This can be used to determine whether the first character of the intercepted string is a Chinese character, because for example, a text encoded by gb2312 is 2 bytes, and utf8 is three characters. Festival. That is, any code greater than 256 is a Chinese character.

Regular characters:

  1. Match Chinese characters: preg_match_all('/[x80-xff]?./', $string, $match);
  2. Match English: preg_match_all("/[/x01-/x7f]+/", $ string, $match);
Copy code

Encoding conversion

  1. iconv ( string $in_charset , string $out_charset , string $str )
  2. Such as GB2312 to UTF-8: iconv("GB2312","UTF-8",$text)
Copy code

url encoding urlencode

All non-alphanumeric characters except -_. in the returned string after encoding will be replaced with a percent sign (%) followed by two hexadecimal digits, and spaces are encoded as plus signs (+). This encoding is the same as the encoding of WWW form POST data, and the same encoding as the application/x-www-form-urlencoded media type.

Note: Only part of the URL should be encoded when encoding, otherwise colons and backslashes in the URL will also be escaped.

URLEncode generally has two methods, one is the traditional Encode based on GB2312, and the other is the Encode based on UTF-8. For example:

  1. $url = 'China';
  2. echo urlencode($url );
  3. //UTF-8: %E4%B8%AD%E5%9B%BD
  4. //GB2312:%D6%D0% B9%FA
Copy code

For example, we use the browser to open Baidu and search for "China". In the address bar we see: http://www.baidu.com/s?wd=%E4%B8%AD%E5%9B%BD&rsv_bp=0&ch=&tn=baidu&bar=&rsv_spt=3&ie=utf-8&rsv_sug3=16&rsv_sug=0&rsv_sug4=302&rsv_sug1=11&inputT=22928

That is, we see that "China" is automatically converted by the browser to: %E4%B8%AD%E5%9B%BD. The difference between urlencode and rawurlencode: urlencode encodes spaces as a plus sign "+", and rawurlencode encodes spaces as a plus sign "%20".

url decoding urldecode and rawurldecode 1. When decoding, you can use the corresponding urldecode() and rawurldecode(). Correspondingly, rawurldecode() will not decode the plus sign ('+') into a space, but urldecode() can. 2. The decoded string by urldecode() and rawurldecode() is encoded in UTF-8 format. If the URL contains non-UTF-8 encoded Chinese, the decoded string must be converted. As follows, first set the php file to gb2312 encoding. You will see that part of it is garbled and part of it is normal.

  1. $url = 'China';
  2. echo $a = urldecode(urlencode($url)) ,' ';
  3. echo iconv('gb2312', 'utf-8', $a);
  4. ? й?China
Copy code


Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn