encoding - Why don't I see the hebrew characters, when I print text from an utf-8 file in Python? -


i'm trying read hebrew text file:

def task1():     f = open('c:\\users\\royi\\desktop\\final project\\corpus-haaretz.txt', 'r',"utf-8")     print 'success'     return f  = task1() 

when read it shows me this:

'[\xee\xe0\xee\xf8 \xee\xf2\xf8\xeb\xfa \xf9\xec \xe4\xf0\xe9\xe5-\xe9\xe5\xf8\xf7 \xe8\xe9\xe9\xee\xf1: \xf2\xec \xe1\xe9\xfa \xe4\xee\xf9\xf4\xe8 \xec\xe1\xe8\xec \xe0\xfa \xe7\xe5\xf7 \xe4\xe7\xf8\xed, \xec\xe8\xe5\xe1\xfa \xe9\xf9\xf8\xe0\xec \xee\xe0\xfa \xf0\xe9\xe5  

and many more.

how read it?

you print this:

print task1().encode('your terminal encoding here') 

you must sure terminal able display hebrew characters. exemple, under full utf-8 linux distrib with hebrew locales installed:

print task1().encode('utf-8') 

careful open:

  • with python 2.7, have no encoding parameter. use codecs module.
  • with python 3+, encoding parameter fourth one, not third do. may mean open(path, 'r', encoding='utf-8'). can omit 'r'.

so why use encode ?

well, when read file , tell python encoding, returns unicode object, not string object. example on system:

>>> import codecs >>> content = codecs.open('/etc/fstab', encoding='utf-8').read() >>> type(content) <type 'unicode'> >>> type('') <type 'str'> >>> type(u'') <type 'unicode'> 

you need encode string if want make printable string if contains non ascii characters:

>>> type(content.encode('utf-8')) <type 'str'> 

we use encode because here talking more or less generic text object (unicode generic can text manipulation), , turn (encode) in specific representation (utf-8).

and need specifi representation because system doesn't nkow python internal , can print ascii characters if don't specify encoding. when ouput, encode encoding system can understand. me it's luckly 'utf-8', it's easy. if on windows, can tricky.


Comments

Popular posts from this blog

linux - Using a Cron Job to check if my mod_wsgi / apache server is running and restart -

actionscript 3 - TweenLite does not work with object -

jQuery Ajax Render Fragments OR Whole Page -