nested - Use python to crawl a website -
so looking dynamic way crawl website , grab links each page. decided experiment beauitfulsoup. 2 questions: how do more dynamically using nested while statements searching links. want links site. don't want continue put nested while loops.
toplevellinks = self.getalluniquelinks(baseurl) listoflinks = list(toplevellinks) length = len(listoflinks) count = 0 while(count < length): twolevellinks = self.getalluniquelinks(listoflinks[count]) twolistoflinks = list(twolevellinks) twocount = 0 twolength = len(twolistoflinks) twolinks in twolistoflinks: listoflinks.append(twolinks) count = count + 1 while(twocount < twolength): threelevellinks = self.getalluniquelinks(twolistoflinks[twocount]) threelistoflinks = list(threelevellinks) threelinks in threelistoflinks: listoflinks.append(threelinks) twocount = twocount +1 print '--------------------------------------------------------------------------------------' #remove duplicates finallist = list(set(listoflinks)) print finallist
my second questions there anyway tell if got links site. please forgive me, new python (year or so) , know of processes , logic might childish. have learn somehow. want more dynamic using nested while loop. in advance insight.
the problem of spidering on web site , getting links common problem. if google search "spider web site python" can find libraries you. here's 1 found:
http://pypi.python.org/pypi/spider.py/0.5
even better, google found question asked , answered here on stackoverflow:
Comments
Post a Comment