PySpider中,通过一个函数,实现了根据当前页面号码,递归获取下一个页面:
相关部分代码是:
<code> # @every(minutes=24 * 60) def on_start(self): self.crawl( 'http://xxxa=audition&act_id=3', callback=self.index_page) # @config(age=10 * 24 * 60 * 60) def index_page(self, response): # <ul class="list-user list-user-1" id="list-user-1"> for each in response.doc('ul[id^="list-user"] li a[href^="http"]').items(): self.crawl(each.attr.href, callback=self.showVideoCallback) curPageNum = 1 self.getNextPageShowParentChild(response, curPageNum) self.getNextPageShowFriend(response, curPageNum) def getNextPageShowParentChild(self, response, curPageNum): """ for 亲子组: recursively get next page shows until fail """ # <ul class="list-user list-user-1" id="list-user-1"> self.getNextPageShow(response, curPageNum, 1) def getNextPageShowFriend(self, response, curPageNum): """ for 好友组: recursively get next page shows until fail """ # <ul class="list-user list-user-2" id="list-user-2"> self.getNextPageShow(response, curPageNum, 2) def getNextPageShow(self, response, curPageNum, order): """ recursively get next page shows until fail """ print("getNextPageShow: curPageNum=%s, order=%s" % (curPageNum, order)) getShowsUrl = "http://xxxc=match_new&a=get_shows" headerDict = { "Content-Type": "application/x-www-form-urlencoded" } dataDict = { "counter": curPageNum, "order": order, "match_type": 2, "match_name": "", "act_id": 3 } curPageDict = { "curPageNum": curPageNum, "order": order } self.crawl( getShowsUrl, method="POST", headers=headerDict, data=dataDict, cookies=response.cookies, callback=self.parseGetShowsCallback, save=curPageDict ) def parseGetShowsCallback(self, response): print("parseGetShowsCallback: self=%s, response=%s"%(self, response)) respJson = response.json prevPageDict = response.save print("prevPageDict=%s, respJson=%s" % (prevPageDict, respJson)) if respJson["status"] == 1: respData = respJson["data"] # recursively try get next page shows curPageNum = prevPageDict["curPageNum"] + 1 self.getNextPageShow(response, curPageNum, prevPageDict["order"]) for eachData in respData: # print("type(eachData)=" % type(eachData)) showId = eachData["show_id"] href = eachData["href"] fullUrl = QupeiyinUrlRoot + href print("[%s] fullUrl=%s" % (showId, fullUrl)) curShowInfoDict = eachData self.crawl( fullUrl, callback=self.showVideoCallback, save=curShowInfoDict) else: print("!!! Fail to get shows json from %s" % response.url) </code>
但是后来发现:
getNextPageShow
parseGetShowsCallback
在调试期间可以继续执行到:
但是RUN运行时,就无法执行到了
最终只保存出2页的60个result:
而实际上希望是,几十几百个才对。
然后也去注释掉了:
<code># @every(minutes=24 * 60) # @config(age=10 * 24 * 60 * 60) </code>
也还是无法执行到。
然后才想到,估计是代码中此处是递归调用,函数和url没有变化:
getNextPageShow
中的:
http:/xxxc=match_new&a=get_shows
始终是一样的,虽然调用时POST参数不同
估计就是这个原因导致无法继续运行的。
所以要去想办法强制执行
pyspider 函数 不继续执行
pyspider 重复不继续执行
Pyspider 函数不执行 – 足兆叉虫的回答 – SegmentFault 思否
pyspider run状态下result没有数据,而且没有继续向下执行,为什么? – 足兆叉虫的回答 – SegmentFault 思否
那么抽空去加上随机的hash试试
关于重复爬取出现问题 · Issue #598 · binux/pyspider
找官网文档看看是否有关重复url的强制执行的设置
self.crawl – pyspider中文文档 – pyspider中文网
“age
本参数用来指定任务的有效期,在有效期内不会重复抓取.默认值是-1(永远不过期,意思是只抓一次)”
或许可以设置极低的age实现继续重复抓取?
“itag
任务标记值,此标记会在抓取时对比,如果这个值发生改变,不管有效期有没有到都会重新抓取新内容.多数用来动态判断内容是否修改或强制重爬.默认值是:None.”
貌似用itag实现强制重复抓取。
<code> def getNextPageShow(self, response, curPageNum, order): """ recursively get next page shows until fail """ print("getNextPageShow: curPageNum=%s, order=%s" % (curPageNum, order)) getShowsUrl = "http://xxxmatch_new&a=get_shows" headerDict = { "Content-Type": "application/x-www-form-urlencoded" } dataDict = { "counter": curPageNum, "order": order, "match_type": 2, "match_name": "", "act_id": 3 } curPageDict = { "curPageNum": curPageNum, "order": order } fakeItagForceRecrawl = "%s_%s" % (curPageNum, order) self.crawl( getShowsUrl, itag=fakeItagForceRecrawl, # To force re-crawl for next page method="POST", headers=headerDict, data=dataDict, cookies=response.cookies, callback=self.parseGetShowsCallback, save=curPageDict ) </code>
试试效果
调试期间,可以看到itag值是每次都不同了:
还是60个结果,没反应啊
算了,清空之前数据,重新运行试试
貌似可以了,至少超过60个了,就是等待继续爬取了:
【总结】
PySpider中,如果后续要爬取crawl的url,有重复,默认是不会重新爬取的。
想要强制重新爬取的话,做法有:
给URL后面加上无关紧要的hash值,比如:
http://xxx?m=home&c=match_new&a=get_shows#123
http://xxx?m=home&c=match_new&a=get_shows#456
当然hash值的选取,也可以选取时间戳(带毫秒的,不容易重复)
设置itag值
默认为None,所以默认不强制抓取
每次设置不同itag值,比如此处每次调用
getNextPageShow,要爬取的url是:
http://xxx?m=home&c=match_new&a=get_shows
设置itag可以选择:random的值,或者此处和实际逻辑有关系的,比如此处的每次爬取的页面数curPageNum和order值,则使用:2_1,3_1,4_1等等
参考代码:
<code> def getNextPageShow(self, response, curPageNum, order): """ recursively get next page shows until fail """ print("getNextPageShow: curPageNum=%s, order=%s" % (curPageNum, order)) getShowsUrl = "http://xxxmatch_new&a=get_shows" ... fakeItagForceRecrawl = "%s_%s" % (curPageNum, order) self.crawl( getShowsUrl, itag=fakeItagForceRecrawl, # To force re-crawl for next page method="POST", headers=headerDict, data=dataDict, cookies=response.cookies, callback=self.parseGetShowsCallback, save=curPageDict ) </code>
【后记】
但是结果也不错,只有137个结果:
应该是好几百,上千个才对
pyspider 重复 不爬取
算了,去加上#hash值试试
<code>from datetime import datetime,timedelta import time timestampStr = datetime.now().strftime("%Y%m%d_%H%M%S_%f") fakeItagForceRecrawl = "%s_%s_%s" % (timestampStr, curPageNum, order) getShowsUrlWithHash = getShowsUrl + "#" + timestampStr self.crawl( getShowsUrlWithHash, itag=fakeItagForceRecrawl, # To force re-crawl for next page method="POST", headers=headerDict, data=dataDict, cookies=response.cookies, callback=self.parseGetShowsCallback, save=curPageDict ) </code>
既把itag更加复杂,不会重复了,又加了hash了。
看看结果:
调试时可以看到itag不同:
<code> "project": "xxx", "schedule": { "itag": "20180712_153856_329293_6_1" }, "taskid": "36cb9d54f6a82215e66d268aaac65848", "url": "http://xxxa=get_shows#20180712_153856_329293" } </code>
然后重新清空项目数据,重新运行,看看是否能强制执行所有的代码
首次执行,就对了,生成2个url:
<code> "project": "xx", "schedule": { "itag": "20180712_154134_660231_1_1" }, "taskid": "da109ba37f77ca5983d376c0f791cf72", "url": "http://xxxa=get_shows#20180712_154134_660231" } "project": "xx", "schedule": { "itag": "20180712_154134_660436_1_2" }, "taskid": "fc8d90bf8dff1ac7f9b7384cc779c4fd", "url": "http://xxxa=get_shows#20180712_154134_660436" } </code>
然后后续再去调试看看
调试看起来都是对的。
再去DEBUG或RUN看看结果:
目前有1000多个了,这才是对的:
目前结果也有了24页了:
是正常的。
【总结2】
此处,不仅要加上itag值,也要加上#hash值:
<code> from pyspider.libs.base_handler import * import re import os import codecs import json from datetime import datetime,timedelta import time QupeiyinUrlRoot = "http://xxx" OutputFullPath = "/Users/crifan/dev/dev_root/xxx/output" class Handler(BaseHandler): crawl_config = { } # @every(minutes=24 * 60) def on_start(self): self.crawl( 'http://xxxaudition&act_id=3', callback=self.index_page) # @config(age=10 * 24 * 60 * 60) def index_page(self, response): # <ul class="list-user list-user-1" id="list-user-1"> for each in response.doc('ul[id^="list-user"] li a[href^="http"]').items(): self.crawl(each.attr.href, callback=self.showVideoCallback) curPageNum = 1 self.getNextPageShowParentChild(response, curPageNum) self.getNextPageShowFriend(response, curPageNum) def getNextPageShowParentChild(self, response, curPageNum): """ for 亲子组: recursively get next page shows until fail """ # <ul class="list-user list-user-1" id="list-user-1"> self.getNextPageShow(response, curPageNum, 1) def getNextPageShowFriend(self, response, curPageNum): """ for 好友组: recursively get next page shows until fail """ # <ul class="list-user list-user-2" id="list-user-2"> self.getNextPageShow(response, curPageNum, 2) def getNextPageShow(self, response, curPageNum, order): """ recursively get next page shows until fail """ print("getNextPageShow: curPageNum=%s, order=%s" % (curPageNum, order)) getShowsUrl = "http://xxxa=get_shows" headerDict = { "Content-Type": "application/x-www-form-urlencoded" } dataDict = { "counter": curPageNum, "order": order, "match_type": 2, "match_name": "", "act_id": 3 } curPageDict = { "curPageNum": curPageNum, "order": order } timestampStr = datetime.now().strftime("%Y%m%d_%H%M%S_%f") fakeItagForceRecrawl = "%s_%s_%s" % (timestampStr, curPageNum, order) getShowsUrlWithHash = getShowsUrl + "#" + timestampStr self.crawl( getShowsUrlWithHash, itag=fakeItagForceRecrawl, # To force re-crawl for next page method="POST", headers=headerDict, data=dataDict, cookies=response.cookies, callback=self.parseGetShowsCallback, save=curPageDict ) </code>
才可以,实现url不重复,使得实现强制重新抓取的效果。
itag值和带hash的url是这种:
<code> "project": "xxx", "schedule": { "itag": "20180712_154134_660436_1_2" }, "taskid": "fc8d90bf8dff1ac7f9b7384cc779c4fd", "url": "http://xxxget_shows#20180712_154134_660436" } </code>
转载请注明:在路上 » 【已解决】PySpider中如何强制让重复的url地址继续爬取