function - Python recursive crawling for urls -
i have method when supplied list of links child links , on , forth:
def crawlsite(self, linkslist): finallist = [] link in list(linkslist): if link not in finallist: print link finallist.append(link) childlinks = self.getalluniquelinks(link) length = len(childlinks) print 'total links page: ' + str(length) self.crawlsite(childlinks) return finallist
it repeat same set of links , can't seem figure out. when move self.crawlsite(childlinks)
inside of if statement. first item in list repeated on , over.
background on self.getalluniquelinks(link)
method list of links given page. filters click-able links within given domain. trying click-able links website. if isn't desired approach. recommend better method can exact same thing. please consider new python , might not understand more complex approaches. please explain thought processes. if don't mind :)
you need
finallist.extend(self.crawlsite(childlinks))
not
self.crawlsite(childlinks)
you need merge list returned inner crawlsite()
s list extant in outer crawlsite()
. though they're called finallist
, have different list in each scope.
the alternative (and better) solution have finallist instance variable (or nonlocal variable of type) instead of local variable, it's shared scopes of crawlsite()
s:
def __init__(self, *args, **kwargs): self.finallist = set() def crawlsite(self, linkslist): link in linkslist: if link not in self.finallist: print link self.finallist.add(link) childlinks = self.getalluniquelinks(link) length = len(childlinks) print 'total links page: ' + str(length) self.crawlsite(childlinks)
you need make sure self.finallist = []
if want start on scratch same instance.
edit: fixed code putting recursive call in if
block. used set. also, linkslist
doesn't need list, iterable object, removed list()
call for
loop. set suggested @ray-toal
Comments
Post a Comment