折腾:
【记录】用PySpider去爬取scholastic的绘本书籍数据
期间,加载页面偶尔异常不返回数据:
[I 181010 15:45:25 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0 [I 181010 15:46:22 tornado_fetcher:188] [200] ScholasticStorybook:data:,on_start data:,on_start 0s [I 181010 15:46:25 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0 console: Error, missing Report Suite ID in AppMeasurement initialization Error: Unexpected token '}' Function@[native code] compile@ https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:213:122 parse@ https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:238:288 https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:117:343 $watch@ https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:127:350 link@ https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:168:435 ea@ https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:73:294 D@ https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:62:192 g@ https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:55:106 https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:54:250 https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:56:80 k@ https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:60:378 https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:254:336 $digest@ https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:131:151 https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core/scripts/adrum.js:14:478 $apply@ https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:134:85 g@ https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:87:450 https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core/scripts/adrum.js:14:478 T@ https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:92:51 onload@ https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:93:79 <a href="#" class="s-page back-to-top-link pagination-arrow-right ng-scope" ng-if="!SearchResults.finalPage" ng-click="!SearchResults.loading && SearchResults.loadNextPage();pageChange();" ng-class="{'disabled':SearchResults.finalPage, 'disabled-link':SearchResults.finalPage,'loading':SearchResults.loading,'disabled-link':SearchResults.actualPage==SearchResults.lastPage}" target="_self"> https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:107 console: [object Object] Request error: #65 [5=Operation canceled] https://scholasticinc.tt.omtrdc.net/m2/scholasticinc/mbox/json?mbox=target-global-mbox&mboxSession=b119c34814ce4b5f8b3a2795c8c09526&mboxPC=&mboxPage=5b873fda78cd40208a278c1e41c26ac9&mboxVersion=1.1.0&mboxCount=1&mboxTime=1539186388962&mboxHost=www.scholastic.com&mboxURL=https%3A%2F%2Fwww.scholastic.com%2Fteachers%2Fbookwizard%2F&mboxReferrer=&browserHeight=2304&browserWidth=1024&browserTimeOffset=480&screenHeight=768&screenWidth=1024&colorDepth=32&vst.trk=stats.scholastic.com&vst.trks=sstats.scholastic.com&mboxMCSDID=6EE7248D918315FD-6DFE09D5B6547ECF&teachersbetaCutover=false&SPS_ID=not+logged+in console: AT: [getOffer()] request failed [object Object] console: AT: Rendering mbox failed target-global-mbox error timeout [304] https://www.scholastic.com/teachers/bookwizard/ 20.146 [I 181010 15:46:44 tornado_fetcher:520] [304] ScholasticStorybook:34b1c45f09fa84805dd1697c1809e8c9 https://www.scholastic.com/teachers/bookwizard/ 20.15s
偶然又可以:
而开了科学上网的浏览器打开页面是没问题的。
所以希望去加上代理,看看是否可以保证每次都能正常打开页面。
pyspider 添加代理
好像是可以直接给crawl设置proxy?
或者配置到全局的crawl_config?
PySpider proxy
“proxy
proxy server of username:password@hostname:port to use, only http proxy is supported currently.
class Handler(BaseHandler):
crawl_config = {
‘proxy’: ‘localhost:8080’
}
Handler.crawl_config can be used with proxy to set a proxy for whole project.”
去试试ss的
crawl_config = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36", "proxy": "127.0.0.1:1087", }
结果
还是打开页面出错:
[304] https://www.scholastic.com/teachers/bookwizard/ 13.279 [I 181010 15:53:32 tornado_fetcher:520] [304] ScholasticStorybook:34b1c45f09fa84805dd1697c1809e8c9 https://www.scholastic.com/teachers/bookwizard/ 13.28s [I 181010 15:53:47 tornado_fetcher:188] [200] ScholasticStorybook:data:,on_start data:,on_start 0s console: Error, missing Report Suite ID in AppMeasurement initialization Error: Unexpected token '}' Function@[native code] compile@ https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:213:122 parse@ https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:238:288 https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:117:343 $watch@ https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:127:350 link@ https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:168:435 ea@ https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:73:294 D@ https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:62:192 g@ https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:55:106 https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:54:250 https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:56:80 k@ https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:60:378 https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:254:336 $digest@ https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:131:151 https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core/scripts/adrum.js:14:478 $apply@ https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:134:85 g@ https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:87:450 https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core/scripts/adrum.js:14:478 T@ https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:92:51 onload@ https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:93:79 <a href="#" class="s-page back-to-top-link pagination-arrow-right ng-scope" ng-if="!SearchResults.finalPage" ng-click="!SearchResults.loading && SearchResults.loadNextPage();pageChange();" ng-class="{'disabled':SearchResults.finalPage, 'disabled-link':SearchResults.finalPage,'loading':SearchResults.loading,'disabled-link':SearchResults.actualPage==SearchResults.lastPage}" target="_self"> https://www.scholastic.com/etc/designs/scholastic/teachers/clientlibs/core.min.js:107 console: [object Object] Request error: #80 [202=Error downloading https://shop.pe/widget/main/init/params?siteid=59d3b490d559308d854e75a8&product=Book Wizard%3A Teachers%2C Find and Level Books for Your Classroom %7C Scholastic&product_url=http%3A%2F% 2Fwww.scholastic.com%2Fteachers%2Fbookwizard%2F&image=http%3A%2F%2Fwww.scholastic.com%2F%2F&price=¤cy=undefined&rating=0&rating_count=0&review_count=0&stock_status=&description=Level your classroom library or find books at just the right level for students with Book Wizard%2C the book finder from Scholastic with Guided Reading%2C Lexile® Measure%2C an&update_product=true&subcategory=&url=https%3A%2F% 2Fwww.scholastic.com%2Fteachers%2Fbookwizard%2F&callback=AddShoppersWidget.load_widget&no_cookie_callback=AddShoppersWidget.load_no_cookie&rand=85958&cookie=&referer= - server replied: Forbidden] https://shop.pe/widget/main/init/params?siteid=59d3b490d559308d854e75a8&product=Book%20Wizard%3A%20Teachers%2C%20Find%20and%20Level%20Books%20for%20Your%20Classroom%20%7C%20Scholastic&product_url=http%3A%2F%2Fwww.scholastic.com%2Fteachers%2Fbookwizard%2F&image=http%3A%2F%2Fwww.scholastic.com%2F%2F&price=¤cy=undefined&rating=0&rating_count=0&review_count=0&stock_status=&description=Level%20your%20classroom%20library%20or%20find%20books%20at%20just%20the%20right%20level%20for%20students%20with%20Book%20Wizard%2C%20the%20book%20finder%20from%20Scholastic%20with%20Guided%20Reading%2C%20Lexile%C2%AE%20Measure%2C%20an&update_product=true&subcategory=&url=https%3A%2F%2Fwww.scholastic.com%2Fteachers%2Fbookwizard%2F&callback=AddShoppersWidget.load_widget&no_cookie_callback=AddShoppersWidget.load_no_cookie&rand=85958&cookie=&referer= [304] https://www.scholastic.com/teachers/bookwizard/ 15.054 [I 181010 15:54:04 tornado_fetcher:520] [304] ScholasticStorybook:34b1c45f09fa84805dd1697c1809e8c9 https://www.scholastic.com/teachers/bookwizard/ 15.05s
全局翻墙试试:
直接报错error:
放弃全局翻墙。
看到:
“validate_cert
For HTTPS requests, validate the server’s certificate? default: True”
难道此处和https的证书验证有关系?
另外去搜:
PySpider Error, missing Report Suite ID in AppMeasurement initialization
没找到相关的。去看看
不过先去看看:
【基本解决】PySpider打开页面出现304
此处为了确认上述代理是否生效,故意随便改动了端口,结果发现:
还是可以打开页面(虽然问题依旧)
-》证明了前面的:
proxy是无效的。
pyspider proxy not work
换成:
# "proxy": "127.0.0.1:10870", "proxy": "localhost:1087",
结果好像成功率高很多。
后来经过测试是:
【总结】
PySpider中,网络请求,貌似是走的当前(Mac本地)系统的网络的:
- Mac本身,用了ss代理,则PySpider可以正常打开youtube等(需要翻墙的)网站
- 即使PySpider本身没有设置代理:
crawl_config = { # "proxy": "127.0.0.1:10870", # "proxy": "127.0.0.1:1087", # "proxy": "localhost:1087", }
所以感觉是:
在此处Mac本地开启了ss代理的前提下,暂时,不需要,且开启了PySpider中proxy也没用
所以对于,PySpider中能访问翻墙的网站,
在此处Mac本地已开启ss的前提下,暂时算是解决了。
如果还有其他问题,到时候再说。
转载请注明:在路上 » 【暂时解决】给PySpider中用科学上网的代理打开需要翻墙的页面