How to Read UTF-16 Text Files in Go
Understanding the Problem
Many file formats encode textual data using UTF-16 encoding, which is a two-byte Unicode encoding. When you read a UTF-16 file in Go, it's important to decode the bytes correctly to obtain the actual text content. However, the default behavior in Go is to treat UTF-16 bytes as ASCII, which can lead to incorrect results.
Decoding UTF-16 Files
To read a UTF-16 file correctly, you need to specify the encoding when reading the file. Go provides the unicode.UTF16 decoder for this purpose. Here is an updated version of the code you provided:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
|
This code uses unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM) to create a decoder for UTF-16 with big-endian byte order and ignoring any Byte Order Mark (BOM). The BOM is used to indicate the byte order of the file, but since we ignore it, the code will work correctly regardless of the BOM.
The decoded bytes are then converted to a string using the string() function. Finally, any Windows-style line endings are removed using strings.Replace().
Using New Scanner for UTF-16 Files
If you need to read the file line by line, you can use the New ScannerUTF16 function from the golang.org/x/text package instead of ioutil.ReadFile. Here is an example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
|
This code uses the bufio.NewScanner() function to create a scanner that reads from the transformed reader, which decodes the UTF-16 bytes. By using a scanner, you can iterate over the lines of the file without having to read the entire file into memory.
The above is the detailed content of How to Correctly Read and Decode UTF-16 Text Files in Go?. For more information, please follow other related articles on the PHP Chinese website!