Unicode (UTF-8) File I/O in Python
In Python, handling Unicode text in files involves encoding and decoding operations. However, understanding these concepts can be challenging, as exemplified by a common issue:
Decoding Confusion:
Consider the following code in Python 2.4:
<code class="python">ss = u'Capit\xe1n' ss8 = ss.encode('utf8') print(ss, ss8)</code>
This code outputs:
Capit\xe1n b'Capit\xc3\xa1n'
The a-acute character (á) is represented differently in Unicode (u'Capitxe1n') and UTF-8 (ss8 = 'Capitxc3xa1n'). When printing ss8, Python defaults to an ASCII representation, hence the xc3xa1n sequence.
Opening the file 'f1' in write mode and writing ss8 to it results in 'Capitxc3xa1nn' being written to the file. Conversely, when writing ss to another file 'f2', Python attempts to interpret the a-acute character as an escape sequence, resulting in 'Capitxc3xa1nn'.
Decoding Solution:
To resolve this confusion, specify the encoding explicitly when opening the file. In Python 2.6 and later, the io.open function can be used:
<code class="python">import io f = io.open("test", mode="r", encoding="utf-8")</code>
This approach ensures that the file is read and written in UTF-8, eliminating the need for manual encoding and decoding. In Python 3.x, the io.open function is an alias for the built-in open function, which also supports the encoding argument.
Alternatively, the codecs module can be used:
<code class="python">import codecs f = codecs.open("test", "r", "utf-8")</code>
It's important to note that mixing read() and readline() methods may cause issues when usingcodecs.open.
The above is the detailed content of How Can I Read and Write Unicode (UTF-8) Files Correctly in Python?. For more information, please follow other related articles on the PHP Chinese website!