String str1="a";
String str2="b";
String str3="c";
String str4="abc";
System.out.println(str1.getBytes("UTF-16").length);//4
System.out.println(str2.getBytes("UTF-16").length);//4
System.out.println(str3.getBytes("UTF-16").length);//4
System.out.println(str4.getBytes("UTF-16").length);//8
System.out.println(str1.getBytes("UTF-8").length);//1
System.out.println(str2.getBytes("UTF-8").length);//1
System.out.println(str3.getBytes("UTF-8").length);//1
System.out.println(str4.getBytes("UTF-8").length);//3
System.out.println(str1.getBytes("UTF-32").length);//4
System.out.println(str2.getBytes("UTF-32").length);//4
System.out.println(str3.getBytes("UTF-32").length);//4
System.out.println(str4.getBytes("UTF-32").length);//12
Unicde编码中不明白的,像UTF-8/UTF-32中str4编码后字节数都是str1+str2+str3,但UTF-16则不然,UTF-16到底是如何编码的呢?求赐教。
UTF-16 is a variable-length encoding format with a minimum of two bytes. Because it is two bytes, Big Endian and Small Endian are involved. In your example above, because endianess is not specified, a two-byte BOM is added. Plus the two bytes of the original character (ASCII) encoding, so it's 4 bytes. If you use utf-16-le or utf-16-be, it will be two bytes. Please check yourself for specific Java representation.
After UTF-16 decoding
There is feff at the beginning, which is used to indicate that the string is big-endian (the high-order byte is placed in the front). The reason for this mark is that there are two modes: big-endian and little-endian (the high-order byte is placed at the back) in the system. 0x01 0x02 is read as 0x0102 in big endian, and the same value is read as 0x0201 in little endian, which is different, so it needs to be marked in feff.