encoding - python BeautifulSoup find span id name without using string\re methods -
i'm trying id name of span tags.
<td valign="top" colspan="2"><img height="25" src="images/spacer.gif" width="1"><br> <!--start table details--> <table cellspacing="1" cellpadding="5" width="100%" bgcolor="#a18c42" border="0" id="compdetails"> <tr bgcolor="white"> <td class="rowname" nowrap>מספר תאגיד:</td> <td width="100%" colspan="3"><span id="lblcompanynumber">520000472</span></td> </tr> <tr bgcolor="white"> <td class="rowname" nowrap>שם תאגיד (עברית):</td> <td width="50%"><span id="lblcompanynameheb">חברת החשמל לישראל בעמ</span></td> <td class="rowname" nowrap>שם תאגיד (אנגלית):</td> <td width="50%"><span id="lblcompanynameen"></span></td> </tr> <tr bgcolor="white"> <td class="rowname" nowrap>סטטוס:</td> <td width="50%"><span id="lblstatus">פעילה</span></td> <td class="rowname" nowrap>סוג תאגיד:</td> <td width="50%"><span id="lblcorporationtype">חברה ציבורית</span></td> </tr> <tr bgcolor="white"> <td class="rowname" nowrap>סוג חברה ממשלתית:</td> <td width="50%"><span id="lblgovcompanytype">חברה ממשלתית</span></td> <td class="rowname" nowrap>סוג מגבלות:</td> <td width="50%"><span id="lbllimittype">מוגבלת</span></td>
lets htmlspan contains html above -
soup = beautifulsoup(htmlspan , fromencoding="windows-1255") # want use windows-1255 , not utf8 spans = soup('span', limit=30)
that's output -
[<span class="maintitle">╫¿╫⌐╫¥ ╫פ╫ק╫ס╫¿╫ץ╫¬</span>, <span class="subtitle">╫ñ╫¿╫ר╫ש ╫ק╫ס╫¿╫פ/╫⌐╫ץ╫¬╫ñ╫ץ╫¬</span>, <span id="lblcompanynumber">514568245</span>, <span id="lblcompanynameheb">╫£╫ס╫ש╫נ ╫נ╫ש╫á╫ר╫ע╫¿╫ª╫ש╫פ ╫ץ╫á╫ש╫¬╫ץ╫ק ╫₧╫ó╫¿╫¢╫ ץ╫¬ ╫ס╫ó"╫₧</span>, <span id="lblcompanynameen">lavi integration &system; analysis ltd</span>, <span id="lblstatus">╫ñ╫ó╫ש╫£╫פ</span>, <span id="lblcorporationtype">╫ק╫ס╫¿╫פ ╫ñ╫¿╫ר╫ש╫¬</span>, <span id="lblgovcompanytype">╫ק╫ס╫¿╫פ ╫£╫נ ╫₧╫₧╫⌐╫£╫¬╫ש╫¬</span>, <span id="lbllimittype">╫₧╫ץ╫ע╫ס╫£╫¬</span>, <span id="lblstatusmafera"><b><font color="red"></font></b></span>, <span id="lblmaferadate"></span>, <span id="lblstatusmafera1"><b><font color="red"></font></b></span>, <span id="lblcountry">╫ש╫⌐╫¿╫נ╫£</span>, <span id="lblcity">╫ק╫ף╫¿╫פ</span>, <span id="lblstreet">╫פ╫£╫£ ╫ש╫ñ╫פ</span>, <span id="lblstreetnumber">34</span>, <span id="lblzipcode">38424</span>, <span id="lblpob"></span>, <span id="lbllocatedat"></span>, <span id="lblcompanygoal">╫£╫ó╫í╫ץ╫º ╫ס╫¢╫£ ╫ó╫ש╫í╫ץ╫º ╫ק╫ץ╫º╫ש</span>, <span id="lblcompanydesc"></span>, <span id="lbldochshana"></span>]
i know how span content can't span id name ('lblstatus' ex').
how can beautifulsoup's methods?
i'm having trouble saving spans content without beautifulsoup converting (charset) utf8 (or gibberish) in end need save the span id name , content csv, , i'm having utf8 problems it.
thanks
i can't span id name ('lblstatus' ex').
using spans
set own code:
for span in spans: print span['id']
i'm having trouble saving spans content without beautifulsoup converting utf8 or gibberish
i not replicate this: output of spans
me not gibberish, same chars in html. sure page trying parse encoded in "windows-1255"? have proper utf-8 encoding declaration (# -*- coding: utf-8 -*-
) python file?
utf-8 pretty standard in python nowadays , beautifulsoup uses internally. suggestion work in utf-8 in code , change encoding (if need it) when output/dump data.
in end need save the span id name , content csv...
this rough idea should tweak per need:
import csv file_ = open('output.csv', 'w') writer = csv.writer(file_) span in spans: writer.writerow([span['id'], span.string])
...and i'm having utf8 problems it.
could specify problems are? on system (gnu/linux) works fine.
Comments
Post a Comment