How to Correctly Read and Decode UTF-16 Text Files in Go?-Golang-php.cn

How to Correctly Read and Decode UTF-16 Text Files in Go?

Linda Hamilton

Release： 2024-12-20 17:42:10

Original

627 people have browsed it

How to Correctly Read and Decode UTF-16 Text Files in Go?

How to Read UTF-16 Text Files in Go

Understanding the Problem

Many file formats encode textual data using UTF-16 encoding, which is a two-byte Unicode encoding. When you read a UTF-16 file in Go, it's important to decode the bytes correctly to obtain the actual text content. However, the default behavior in Go is to treat UTF-16 bytes as ASCII, which can lead to incorrect results.

Decoding UTF-16 Files

To read a UTF-16 file correctly, you need to specify the encoding when reading the file. Go provides the unicode.UTF16 decoder for this purpose. Here is an updated version of the code you provided:

package main
 
import (
    "bytes"
    "fmt"
    "io/ioutil"
    "os"
    "strings"
 
    "golang.org/x/text/encoding/unicode"
)
 
func main() {
    // Read the file into a []byte
    raw, err := ioutil.ReadFile("test.txt")
    if err != nil {
        fmt.Printf("error opening file: %v\n", err)
        os.Exit(1)
    }
 
    // Create a Unicode UTF-16 decoder
    utf16be := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
 
    // Create a transformer to decode the data
    transformer := utf16be.NewDecoder()
 
    // Decode the text using the transformer
    decoded, err := transformer.Bytes(raw)
    if err != nil {
        fmt.Printf("error decoding file: %v\n", err)
        os.Exit(1)
    }
 
    // Convert the decoded bytes to a string
    text := string(decoded)
 
    // Remove any Windows-style line endings (CR+LF)
    final := strings.Replace(text, "\r\n", "\n", -1)
 
    // Print the final text
    fmt.Println(final)
}

Copy after login

This code uses unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM) to create a decoder for UTF-16 with big-endian byte order and ignoring any Byte Order Mark (BOM). The BOM is used to indicate the byte order of the file, but since we ignore it, the code will work correctly regardless of the BOM.

The decoded bytes are then converted to a string using the string() function. Finally, any Windows-style line endings are removed using strings.Replace().

Using New Scanner for UTF-16 Files

If you need to read the file line by line, you can use the New ScannerUTF16 function from the golang.org/x/text package instead of ioutil.ReadFile. Here is an example:

package main
 
import (
    "bufio"
    "fmt"
    "os"
 
    "golang.org/x/text/encoding/unicode"
    "golang.org/x/text/transform"
)
 
func NewScannerUTF16(filename string) (*bufio.Scanner, error) {
    // Read the file into a []byte
    raw, err := os.ReadFile(filename)
    if err != nil {
        return nil, err
    }
 
    // Create a Unicode UTF-16 decoder
    utf16be := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
 
    // Create a transformer to decode the data
    transformer := utf16be.NewDecoder()
 
    // Create a scanner that uses the transformer
    scanner := bufio.NewScanner(transform.NewReader(bytes.NewReader(raw), transformer))
    return scanner, nil
}
 
func main() {
    // Create a scanner for the UTF-16 file
    scanner, err := NewScannerUTF16("test.txt")
    if err != nil {
        fmt.Printf("error opening file: %v\n", err)
        os.Exit(1)
    }
 
    // Read the file line by line
    for scanner.Scan() {
        fmt.Println(scanner.Text())
    }
}

Copy after login

This code uses the bufio.NewScanner() function to create a scanner that reads from the transformed reader, which decodes the UTF-16 bytes. By using a scanner, you can iterate over the lines of the file without having to read the entire file into memory.

The above is the detailed content of How to Correctly Read and Decode UTF-16 Text Files in Go?. For more information, please follow other related articles on the PHP Chinese website!