function - Python recursive crawling for urls -


i have method when supplied list of links child links , on , forth:

def crawlsite(self, linkslist):     finallist = []     link in list(linkslist):         if link not in finallist:             print link                         finallist.append(link)             childlinks = self.getalluniquelinks(link)             length = len(childlinks)             print 'total links page: ' + str(length)          self.crawlsite(childlinks)     return finallist 

it repeat same set of links , can't seem figure out. when move self.crawlsite(childlinks) inside of if statement. first item in list repeated on , over.

background on self.getalluniquelinks(link) method list of links given page. filters click-able links within given domain. trying click-able links website. if isn't desired approach. recommend better method can exact same thing. please consider new python , might not understand more complex approaches. please explain thought processes. if don't mind :)

you need

finallist.extend(self.crawlsite(childlinks)) 

not

self.crawlsite(childlinks) 

you need merge list returned inner crawlsite()s list extant in outer crawlsite(). though they're called finallist, have different list in each scope.

the alternative (and better) solution have finallist instance variable (or nonlocal variable of type) instead of local variable, it's shared scopes of crawlsite()s:

def __init__(self, *args, **kwargs):     self.finallist = set()  def crawlsite(self, linkslist):     link in linkslist:         if link not in self.finallist:             print link                         self.finallist.add(link)             childlinks = self.getalluniquelinks(link)             length = len(childlinks)             print 'total links page: ' + str(length)             self.crawlsite(childlinks) 

you need make sure self.finallist = [] if want start on scratch same instance.

edit: fixed code putting recursive call in if block. used set. also, linkslist doesn't need list, iterable object, removed list() call for loop. set suggested @ray-toal


Comments

Popular posts from this blog

linux - Using a Cron Job to check if my mod_wsgi / apache server is running and restart -

actionscript 3 - TweenLite does not work with object -

jQuery Ajax Render Fragments OR Whole Page -