折腾:
【记录】用PySpider去爬取scholastic的绘本书籍数据
期间,加载页面偶尔异常不返回数据:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | [I 181010 15 : 45 : 25 scheduler: 586 ] in 5m : new: 0 ,success: 0 ,retry: 0 ,failed: 0 [I 181010 15 : 46 : 22 tornado_fetcher: 188 ] [ 200 ] ScholasticStorybook:data:,on_start data:,on_start 0s [I 181010 15 : 46 : 25 scheduler: 586 ] in 5m : new: 0 ,success: 0 ,retry: 0 ,failed: 0 console: Error, missing Report Suite ID in AppMeasurement initialization Error: Unexpected token '}' Function@[native code] compile @ https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 213 : 122 parse@ https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 238 : 288 https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 117 : 343 $watch@ https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 127 : 350 link@ https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 168 : 435 ea@ https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 73 : 294 D@ https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 62 : 192 g@ https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 55 : 106 https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 54 : 250 https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 56 : 80 k@ https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 60 : 378 https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 254 : 336 $digest@ https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 131 : 151 https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core / scripts / adrum.js: 14 : 478 $ apply @ https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 134 : 85 g@ https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 87 : 450 https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core / scripts / adrum.js: 14 : 478 T@ https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 92 : 51 onload@ https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 93 : 79 <a href = "#" class = "s-page back-to-top-link pagination-arrow-right ng-scope" ng - if = "!SearchResults.finalPage" ng - click = "!SearchResults.loading && SearchResults.loadNextPage();pageChange();" ng - class = "{'disabled':SearchResults.finalPage, 'disabled-link':SearchResults.finalPage,'loading':SearchResults.loading,'disabled-link':SearchResults.actualPage==SearchResults.lastPage}" target = "_self" > https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 107 console: [ object Object ] Request error: #65 [5=Operation canceled] https: / / scholasticinc.tt.omtrdc.net / m2 / scholasticinc / mbox / json?mbox = target - global - mbox&mboxSession = b119c34814ce4b5f8b3a2795c8c09526&mboxPC = &mboxPage = 5b873fda78cd40208a278c1e41c26ac9 &mboxVersion = 1.1 . 0 &mboxCount = 1 &mboxTime = 1539186388962 &mboxHost = www.scholastic.com&mboxURL = https % 3A % 2F % 2Fwww .scholastic.com % 2Fteachers % 2Fbookwizard % 2F &mboxReferrer = &browserHeight = 2304 &browserWidth = 1024 &browserTimeOffset = 480 &screenHeight = 768 &screenWidth = 1024 &colorDepth = 32 &vst.trk = stats.scholastic.com&vst.trks = sstats.scholastic.com&mboxMCSDID = 6EE7248D918315FD - 6DFE09D5B6547ECF &teachersbetaCutover = false&SPS_ID = not + logged + in console: AT: [getOffer()] request failed [ object Object ] console: AT: Rendering mbox failed target - global - mbox error timeout [ 304 ] https: / / www.scholastic.com / teachers / bookwizard / 20.146 [I 181010 15 : 46 : 44 tornado_fetcher: 520 ] [ 304 ] ScholasticStorybook: 34b1c45f09fa84805dd1697c1809e8c9 https: / / www.scholastic.com / teachers / bookwizard / 20.15s |

偶然又可以:

而开了科学上网的浏览器打开页面是没问题的。
所以希望去加上代理,看看是否可以保证每次都能正常打开页面。
pyspider 添加代理
好像是可以直接给crawl设置proxy?
或者配置到全局的crawl_config?
PySpider proxy
“proxy
proxy server of username:password@hostname:port to use, only http proxy is supported currently.
class Handler(BaseHandler):
crawl_config = {
‘proxy’: ‘localhost:8080’
}
Handler.crawl_config can be used with proxy to set a proxy for whole project.”
去试试ss的

1 2 3 4 | crawl_config = { "User-Agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36" , "proxy" : "127.0.0.1:1087" , } |
结果
还是打开页面出错:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | [ 304 ] https: / / www.scholastic.com / teachers / bookwizard / 13.279 [I 181010 15 : 53 : 32 tornado_fetcher: 520 ] [ 304 ] ScholasticStorybook: 34b1c45f09fa84805dd1697c1809e8c9 https: / / www.scholastic.com / teachers / bookwizard / 13.28s [I 181010 15 : 53 : 47 tornado_fetcher: 188 ] [ 200 ] ScholasticStorybook:data:,on_start data:,on_start 0s console: Error, missing Report Suite ID in AppMeasurement initialization Error: Unexpected token '}' Function@[native code] compile @ https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 213 : 122 parse@ https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 238 : 288 https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 117 : 343 $watch@ https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 127 : 350 link@ https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 168 : 435 ea@ https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 73 : 294 D@ https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 62 : 192 g@ https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 55 : 106 https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 54 : 250 https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 56 : 80 k@ https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 60 : 378 https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 254 : 336 $digest@ https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 131 : 151 https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core / scripts / adrum.js: 14 : 478 $ apply @ https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 134 : 85 g@ https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 87 : 450 https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core / scripts / adrum.js: 14 : 478 T@ https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 92 : 51 onload@ https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 93 : 79 <a href = "#" class = "s-page back-to-top-link pagination-arrow-right ng-scope" ng - if = "!SearchResults.finalPage" ng - click = "!SearchResults.loading && SearchResults.loadNextPage();pageChange();" ng - class = "{'disabled':SearchResults.finalPage, 'disabled-link':SearchResults.finalPage,'loading':SearchResults.loading,'disabled-link':SearchResults.actualPage==SearchResults.lastPage}" target = "_self" > https: / / www.scholastic.com / etc / designs / scholastic / teachers / clientlibs / core. min .js: 107 console: [ object Object ] Request error: #80 [202=Error downloading https: / / shop.pe / widget / main / init / params?siteid = 59d3b490d559308d854e75a8 &product = Book Wizard % 3A Teachers % 2C Find and Level Books for Your Classroom % 7C Scholastic&product_url = http % 3A % 2F % 2Fwww .scholastic.com % 2Fteachers % 2Fbookwizard % 2F &image = http % 3A % 2F % 2Fwww .scholastic.com % 2F % 2F &price = ¤cy = undefined&rating = 0 &rating_count = 0 &review_count = 0 &stock_status = &description = Level your classroom library or find books at just the right level for students with Book Wizard % 2C the book finder from Scholastic with Guided Reading % 2C Lexile® Measure % 2C an&update_product = true&subcategory = &url = https % 3A % 2F % 2Fwww .scholastic.com % 2Fteachers % 2Fbookwizard % 2F &callback = AddShoppersWidget.load_widget&no_cookie_callback = AddShoppersWidget.load_no_cookie&rand = 85958 &cookie = &referer = - server replied: Forbidden] https: / / shop.pe / widget / main / init / params?siteid = 59d3b490d559308d854e75a8 &product = Book % 20Wizard % 3A % 20Teachers % 2C % 20Find % 20and % 20Level % 20Books % 20for % 20Your % 20Classroom % 20 % 7C % 20Scholastic &product_url = http % 3A % 2F % 2Fwww .scholastic.com % 2Fteachers % 2Fbookwizard % 2F &image = http % 3A % 2F % 2Fwww .scholastic.com % 2F % 2F &price = ¤cy = undefined&rating = 0 &rating_count = 0 &review_count = 0 &stock_status = &description = Level % 20your % 20classroom % 20library % 20or % 20find % 20books % 20at % 20just % 20the % 20right % 20level % 20for % 20students % 20with % 20Book % 20Wizard % 2C % 20the % 20book % 20finder % 20from % 20Scholastic % 20with % 20Guided % 20Reading % 2C % 20Lexile % C2 % AE % 20Measure % 2C % 20an &update_product = true&subcategory = &url = https % 3A % 2F % 2Fwww .scholastic.com % 2Fteachers % 2Fbookwizard % 2F &callback = AddShoppersWidget.load_widget&no_cookie_callback = AddShoppersWidget.load_no_cookie&rand = 85958 &cookie = &referer = [ 304 ] https: / / www.scholastic.com / teachers / bookwizard / 15.054 [I 181010 15 : 54 : 04 tornado_fetcher: 520 ] [ 304 ] ScholasticStorybook: 34b1c45f09fa84805dd1697c1809e8c9 https: / / www.scholastic.com / teachers / bookwizard / 15.05s |
全局翻墙试试:

直接报错error:

放弃全局翻墙。
看到:
“validate_cert
For HTTPS requests, validate the server’s certificate? default: True”
难道此处和https的证书验证有关系?
另外去搜:
PySpider Error, missing Report Suite ID in AppMeasurement initialization
没找到相关的。去看看
不过先去看看:
【基本解决】PySpider打开页面出现304
此处为了确认上述代理是否生效,故意随便改动了端口,结果发现:
还是可以打开页面(虽然问题依旧)
-》证明了前面的:
proxy是无效的。
pyspider proxy not work
换成:
1 2 | # "proxy": "127.0.0.1:10870", "proxy" : "localhost:1087" , |
结果好像成功率高很多。
后来经过测试是:
【总结】
PySpider中,网络请求,貌似是走的当前(Mac本地)系统的网络的:
- Mac本身,用了ss代理,则PySpider可以正常打开youtube等(需要翻墙的)网站
- 即使PySpider本身没有设置代理:
1 2 3 4 5 | crawl_config = { # "proxy": "127.0.0.1:10870", # "proxy": "127.0.0.1:1087", # "proxy": "localhost:1087", } |
所以感觉是:
在此处Mac本地开启了ss代理的前提下,暂时,不需要,且开启了PySpider中proxy也没用
所以对于,PySpider中能访问翻墙的网站,
在此处Mac本地已开启ss的前提下,暂时算是解决了。
如果还有其他问题,到时候再说。
转载请注明:在路上 » 【暂时解决】给PySpider中用科学上网的代理打开需要翻墙的页面