nested - Use python to crawl a website -


so looking dynamic way crawl website , grab links each page. decided experiment beauitfulsoup. 2 questions: how do more dynamically using nested while statements searching links. want links site. don't want continue put nested while loops.

    toplevellinks = self.getalluniquelinks(baseurl)     listoflinks = list(toplevellinks)             length = len(listoflinks)     count = 0             while(count < length):          twolevellinks = self.getalluniquelinks(listoflinks[count])         twolistoflinks = list(twolevellinks)         twocount = 0         twolength = len(twolistoflinks)          twolinks in twolistoflinks:             listoflinks.append(twolinks)          count = count + 1          while(twocount < twolength):             threelevellinks = self.getalluniquelinks(twolistoflinks[twocount])               threelistoflinks = list(threelevellinks)              threelinks in threelistoflinks:                 listoflinks.append(threelinks)              twocount = twocount +1        print '--------------------------------------------------------------------------------------'     #remove duplicates     finallist = list(set(listoflinks))       print finallist 

my second questions there anyway tell if got links site. please forgive me, new python (year or so) , know of processes , logic might childish. have learn somehow. want more dynamic using nested while loop. in advance insight.

the problem of spidering on web site , getting links common problem. if google search "spider web site python" can find libraries you. here's 1 found:

http://pypi.python.org/pypi/spider.py/0.5

even better, google found question asked , answered here on stackoverflow:

anyone know of python based web crawler use?


Comments

Popular posts from this blog

linux - Using a Cron Job to check if my mod_wsgi / apache server is running and restart -

actionscript 3 - TweenLite does not work with object -

jQuery Ajax Render Fragments OR Whole Page -