Python - Unicode Text Files

Introduction

To access files containing non-ASCII Unicode text, pass in an encoding name.

In this mode, Python text files automatically encode on writes and decode on reads per the encoding scheme name you provide.

In Python 3.X:

Demo

S =  'test\xc4m'                                          # Non-ASCII Unicode text 
print( S ) 
print( S[2] )                                                   # Sequence of characters 
file = open('unidata.txt', 'w', encoding='utf-8')      # Write/encode UTF-8 text 
file.write(S)                                          # 4 characters written 
file.close() #  w ww .ja v a2s  . c  om
text = open('unidata.txt', encoding='utf-8').read()    # Read/decode UTF-8 text 
print( text ) 
print( len(text) )                                     # 4 chars (code points)

Result

You can see what's truly stored in your file by stepping into binary mode:

Demo

raw = open('unidata.txt', 'rb').read()      # Read raw encoded bytes 
print( raw ) 
print( len(raw) )                           # Really 5 bytes in UTF-8
# w w w .ja v a2s  . co  m

Result

You can encode and decode manually if you get Unicode data from a source other than a file:

Demo

raw = open('unidata.txt', 'rb').read()      # Read raw encoded bytes 
text = "test"
print( text.encode('utf-8') )                                   # Manual encode to bytes 
print( raw.decode('utf-8') )                                    # Manual decode to str
# from   w  w w.  ja v a  2 s . c o  m

Result

To see how text files would automatically encode the same string under different encoding names:

Demo

text = "test"
print( text.encode('latin-1') )                                 # Bytes differ in others 
print( text.encode('utf-16') )
print( len(text.encode('latin-1')), len(text.encode('utf-16')) )
print( b'\xff\xfed\x00p\x00\xc4\x00m\x00'.decode('utf-16') )    # But same string decoded
# from w  w w .j av a  2  s  . c  om

Result

Related Topic