encoding - Why don't I see the hebrew characters, when I print text from an utf-8 file in Python? -
i'm trying read hebrew text file:
def task1(): f = open('c:\\users\\royi\\desktop\\final project\\corpus-haaretz.txt', 'r',"utf-8") print 'success' return f = task1()
when read it shows me this:
'[\xee\xe0\xee\xf8 \xee\xf2\xf8\xeb\xfa \xf9\xec \xe4\xf0\xe9\xe5-\xe9\xe5\xf8\xf7 \xe8\xe9\xe9\xee\xf1: \xf2\xec \xe1\xe9\xfa \xe4\xee\xf9\xf4\xe8 \xec\xe1\xe8\xec \xe0\xfa \xe7\xe5\xf7 \xe4\xe7\xf8\xed, \xec\xe8\xe5\xe1\xfa \xe9\xf9\xf8\xe0\xec \xee\xe0\xfa \xf0\xe9\xe5
and many more.
how read it?
you print this:
print task1().encode('your terminal encoding here')
you must sure terminal able display hebrew characters. exemple, under full utf-8 linux distrib with hebrew locales installed:
print task1().encode('utf-8')
careful open
:
- with python 2.7, have no encoding parameter. use
codecs
module. - with python 3+, encoding parameter fourth one, not third do. may mean
open(path, 'r', encoding='utf-8')
. can omit'r'
.
so why use encode
?
well, when read file , tell python encoding, returns unicode object, not string object. example on system:
>>> import codecs >>> content = codecs.open('/etc/fstab', encoding='utf-8').read() >>> type(content) <type 'unicode'> >>> type('') <type 'str'> >>> type(u'') <type 'unicode'>
you need encode string if want make printable string if contains non ascii characters:
>>> type(content.encode('utf-8')) <type 'str'>
we use encode
because here talking more or less generic text object (unicode generic can text manipulation), , turn (encode) in specific representation (utf-8).
and need specifi representation because system doesn't nkow python internal , can print ascii characters if don't specify encoding. when ouput, encode encoding system can understand. me it's luckly 'utf-8', it's easy. if on windows, can tricky.
Comments
Post a Comment