Accurately identify file encoding: practical methods
Correct identification of file encoding is crucial for text processing. However, the StreamReader.CurrentEncoding
attribute often does not provide accurate results. To solve this problem, a more reliable method is to analyze the file's Byte Order Mark (BOM).
The role of BOM
The BOM is a sequence of bytes that indicates the endianness and encoding of a text file. Common BOMs include:
Determine file encoding based on BOM
The following C# code provides a detailed implementation:
<code class="language-csharp">public static Encoding GetEncoding(string filename) { // 读取 BOM byte[] bom = new byte[4]; using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read)) { file.Read(bom, 0, 4); } // 分析 BOM if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7; if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8; if (bom[0] == 0xff && bom[1] == 0xfe && bom[2] == 0 && bom[3] == 0) return Encoding.UTF32; //UTF-32LE if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return new UTF32Encoding(true, true); //UTF-32BE // 如果未检测到 BOM,则回退到 ASCII return Encoding.ASCII; }</code>
Using this method, you can accurately identify the encoding of any text file, ensuring correct data interpretation and text processing.
The above is the detailed content of How to Reliably Determine a File's Encoding Using its Byte Order Mark (BOM)?. For more information, please follow other related articles on the PHP Chinese website!