How to Reliably Determine a File's Encoding Using its Byte Order Mark (BOM)?-C++-php.cn

How to Reliably Determine a File's Encoding Using its Byte Order Mark (BOM)?

Linda Hamilton

Release： 2025-01-17 01:32:09

Original

424 people have browsed it

How to Reliably Determine a File's Encoding Using its Byte Order Mark (BOM)?

Accurately identify file encoding: practical methods

Correct identification of file encoding is crucial for text processing. However, the StreamReader.CurrentEncoding attribute often does not provide accurate results. To solve this problem, a more reliable method is to analyze the file's Byte Order Mark (BOM).

The role of BOM

The BOM is a sequence of bytes that indicates the endianness and encoding of a text file. Common BOMs include:

UTF-8: EF BB BF
UTF-16LE: FF FE
UTF-16BE: FE FF
UTF-32LE: FF FE 00 00
UTF-32BE: 00 00 FE FF
ASCII: No BOM

Determine file encoding based on BOM

The following C# code provides a detailed implementation:

<code class="language-csharp">public static Encoding GetEncoding(string filename)
{
    // 读取 BOM
    byte[] bom = new byte[4];
    using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
    {
        file.Read(bom, 0, 4);
    }

    // 分析 BOM
    if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
    if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
    if (bom[0] == 0xff && bom[1] == 0xfe && bom[2] == 0 && bom[3] == 0) return Encoding.UTF32; //UTF-32LE
    if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
    if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
    if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return new UTF32Encoding(true, true);  //UTF-32BE

    // 如果未检测到 BOM，则回退到 ASCII
    return Encoding.ASCII;
}</code>

Copy after login

Using this method, you can accurately identify the encoding of any text file, ensuring correct data interpretation and text processing.

The above is the detailed content of How to Reliably Determine a File's Encoding Using its Byte Order Mark (BOM)?. For more information, please follow other related articles on the PHP Chinese website!