How Can I Detect Invalid UTF-8 Byte Sequences in Go?-Golang-php.cn

How Can I Detect Invalid UTF-8 Byte Sequences in Go?

DDD

Release： 2024-12-14 22:17:11

Original

287 people have browsed it

How Can I Detect Invalid UTF-8 Byte Sequences in Go?

Detecting Invalid Byte Sequences in Go

In Go, when converting a byte slice ([]byte) to a string, it's possible to encounter invalid byte sequences that cannot be translated into Unicode. This arises from the fact that not all byte sequences represent valid UTF-8 characters.

To detect such occurrences, two approaches are available:

UTF-8 Validity Check:

As Tim Cooper mentions, the utf8.Valid function can be utilized to test if a byte slice contains valid UTF-8 bytes. If the result is false, it indicates the presence of invalid byte sequences.

String Conversion Considerations:

Contrary to common assumptions, Go permits the conversion of non-UTF-8 byte slices to strings. However, it's important to note that a string in Go is essentially a read-only byte slice and can therefore accommodate bytes that are not valid UTF-8.

It is only in specific situations that Go automatically performs UTF-8 decoding:

When iterating over a string using the for i, r := range s syntax, the r variable represents a Unicode code point (rune) and is always valid.
When converting from a string to a slice of runes (i.e., []rune(s)), Go decodes the entire string to runes.

In both cases, invalid UTF-8 characters are replaced with the U FFFD replacement character. This replacement may not be acceptable in all applications, so it's recommended to perform explicit UTF-8 validation if necessary.

Example:

Consider the following Go program:

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    a := []byte{0xff}
    s := string(a)

    // Check UTF-8 validity
    if utf8.Valid(a) {
        fmt.Println("Valid UTF-8")
    } else {
        fmt.Println("Invalid UTF-8")
    }

    // Output string
    fmt.Println(s)
}

Copy after login

Output:

Invalid UTF-8
�

Copy after login

In this example, the byte slice a contains an invalid byte sequence, resulting in an "Invalid UTF-8" message. Subsequently, when converting it to a string, the invalid byte is represented by the replacement character "�".

The above is the detailed content of How Can I Detect Invalid UTF-8 Byte Sequences in Go?. For more information, please follow other related articles on the PHP Chinese website!