Home > Backend Development > Golang > How Can I Detect Invalid UTF-8 Byte Sequences in Go?

How Can I Detect Invalid UTF-8 Byte Sequences in Go?

DDD
Release: 2024-12-14 22:17:11
Original
287 people have browsed it

How Can I Detect Invalid UTF-8 Byte Sequences in Go?

Detecting Invalid Byte Sequences in Go

In Go, when converting a byte slice ([]byte) to a string, it's possible to encounter invalid byte sequences that cannot be translated into Unicode. This arises from the fact that not all byte sequences represent valid UTF-8 characters.

To detect such occurrences, two approaches are available:

UTF-8 Validity Check:

As Tim Cooper mentions, the utf8.Valid function can be utilized to test if a byte slice contains valid UTF-8 bytes. If the result is false, it indicates the presence of invalid byte sequences.

String Conversion Considerations:

Contrary to common assumptions, Go permits the conversion of non-UTF-8 byte slices to strings. However, it's important to note that a string in Go is essentially a read-only byte slice and can therefore accommodate bytes that are not valid UTF-8.

It is only in specific situations that Go automatically performs UTF-8 decoding:

  • When iterating over a string using the for i, r := range s syntax, the r variable represents a Unicode code point (rune) and is always valid.
  • When converting from a string to a slice of runes (i.e., []rune(s)), Go decodes the entire string to runes.

In both cases, invalid UTF-8 characters are replaced with the U FFFD replacement character. This replacement may not be acceptable in all applications, so it's recommended to perform explicit UTF-8 validation if necessary.

Example:

Consider the following Go program:

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    a := []byte{0xff}
    s := string(a)

    // Check UTF-8 validity
    if utf8.Valid(a) {
        fmt.Println("Valid UTF-8")
    } else {
        fmt.Println("Invalid UTF-8")
    }

    // Output string
    fmt.Println(s)
}
Copy after login

Output:

Invalid UTF-8
�
Copy after login

In this example, the byte slice a contains an invalid byte sequence, resulting in an "Invalid UTF-8" message. Subsequently, when converting it to a string, the invalid byte is represented by the replacement character "�".

The above is the detailed content of How Can I Detect Invalid UTF-8 Byte Sequences in Go?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template