Home > Article > Backend Development > What does it mean that php does not support unicode?

What does it mean that php does not support unicode?

藏色散人Original: 2021-07-27 09:35:522657browse

php does not support unicode, which means that PHP strings do not save the encoding information of characters, so the native operation function does not know how binary data corresponds to text, and can only assume that one character corresponds to a single byte; in this way, during processing It is sufficient for English and other ASCII codes, but for multi-byte characters such as Chinese, errors will occur.

The operating environment of this article: windows7 system, PHP7.1 version, DELL G3 computer

What does it mean that php does not support unicode? Why does it say that PHP does not support Unicode encoding?

I often see claims that PHP does not support Unicode, or that PHP does not support Unicode at the bottom level. Although I know that PHP encoding is very painful and the various string processing functions are very non-standard, it can still display Chinese. I have never understood what it means that it does not support Unicode. Spent some time sorting through this information.

Let’s start with an example:

A PHP script is as follows. Assume that the encoding of the file is UTF-8:

//文件编码UTF-8
echo strlen("中文"); // 6
echo substr("中文",0,1) // 乱码
echo substr("中文",0,3) // 中

It’s very strange. From the above, it seems that One Chinese character is regarded as 3 characters. This starts with PHP's storage of strings.

I summarized it as follows:

PHP’s string is composed of an array of bytes. In other words, similar to C language char a[3] = "abc", one character occupies one byte.

In addition, there is no encoding information for storing text, which means that PHP does not know what encoding the binary data of these strings should correspond to.

Going one step further, PHP will determine the encoding of the string according to the encoding of the script file. For example: $string = "Chinese";, if the script file is UTF-8, the Chinese UTF-8 encoding: E4B8ADE69687 will be saved.

Furthermore, as mentioned before, PHP does not save the encoding information of the string. So even if the Chinese is saved as: E4B8ADE69687, from the perspective of the string native function, it is just a string of binary numbers. Therefore, PHP native string functions can only operate on single-byte characters! Just treat a byte as a character!

If you understand the above points, the above code example will naturally be understood:

//文件编码UTF-8
echo bin2hex("中文"); // 可以看到，"中文"对应的二进制就是：e4b8ade69687
echo strlen("中文"); // 所以按照单字节来统计长度，就是6 
echo substr("中文",0,1) // 取0到1个字节，也就是e4，并不对应某个字符的编码，所以乱码
echo substr("中文",0,3) // 取0到3个字节，刚好把`中`的编码取出来

Similarly, if you change the file encoding to GBK or other, you will get similar results after further experiments. The result is that one Chinese character in GBK occupies 2 bytes.

So now, you can basically understand what the bottom layer of PHP does not support unicode. The summary is as follows:

PHP strings do not save the encoding information of characters, so native The operating function does not know how binary data corresponds to text, and can only [assume] that one character corresponds to a single byte. This is sufficient when processing English and other ASCII codes, but for Chinese and other [multi-byte characters], errors will occur.

As the opposite, we can look at the so-called underlying languages that support Unicode:

var string = "中文"
console.log(string.length); // 2
string.substr(0,1) // 中

You can see that in JS, multi-byte characters can be correctly recognized and processed. . That is to say, when storing, the encoding information of the text is also stored. (My guess here is that the Unicode value of the text is saved, but I am not sure because I don’t understand the underlying principles of JS)

Then there is a question here, how can multi-byte characters be correctly processed in PHP? ? The answer is the mbstring extension (for details, see: http://php.net/manual/zh/book.mbstring.php). The so-called mbstring is: multi-byte string, multi-byte string.

In this set of extensions, there are a series of functions corresponding to the native string functions, which can be used to correctly handle multi-byte characters. For example: strlen corresponds to mb_strlen... Among these corresponding functions, they are basically the same as the native functions, except that they usually have an additional optional parameter: encoding.

Examples are as follows:

// 脚本类型为UTF-8
echo strlen("中文"); // 6
echo mb_strlen("中文","UTF-8"); //2  使用mb_strlen ，并传入编码 utf-8, 就会把二进制E4B8ADE69687当做utf-8的处理能正确处理
echo mb_strlen("中文"); //2  如果不传编码UTF-8,则函数会自动确定编码，文档说：如果省略，则使用内部字符编码。所以这里也当做UTF-8来处理。
echo mb_strlen("中文","GBK"); //3，如果传入编码GBK，则：e4b8ade69687会被当做gbk来处理，一个gbk字符占2字节，所以为：3

Recommended learning: "PHP Video Tutorial"

The above is the detailed content of What does it mean that php does not support unicode?. For more information, please follow other related articles on the PHP Chinese website!

Statement：

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Previous article：How to set up php curl ssl without opening itNext article：How to set up php curl ssl without opening it

See more

What does it mean that php does not support unicode?

Related articles