折腾:
【未解决】用Python爬取汽车之家的车型车系详细数据
期间,
去写代码调试
class Handler(BaseHandler): UserAgent_Mac_Chrome = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36" crawl_config = { "User-Agent": UserAgent_Mac_Chrome, } # @every(minutes=24 * 60) def on_start(self): autohomeEntryUrl = "https://www.autohome.com.cn/car/" self.crawl(autohomeEntryUrl, callback=self.carBrandListCallback)
还是会
HTTP 403: Forbidden
[E 200814 20:38:23 base_handler:203] HTTP 403: Forbidden Traceback (most recent call last): File "/Users/xxx/.pyenv/versions/3.6.5/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 196, in run_task result = self._run_task(task, response) File "/Users/xxx/.pyenv/versions/3.6.5/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 175, in _run_task response.raise_for_status() File "/Users/xxx/.pyenv/versions/3.6.5/lib/python3.6/site-packages/pyspider/libs/response.py", line 184, in raise_for_status raise http_error requests.exceptions.HTTPError: HTTP 403: Forbidden
看了看html是:
<html> <head> <title> ERROR: ACCESS DENIED </title> </head> <body> <center> <h1> ERROR: ACCESS DENIED </h1> </center> <hr> <center> Fri, 14 Aug 2020 12:38:23 GMT (taikoo/BC232_dx-jiangsu-xuzhou-4-cache-1) </center> </BODY> </HTML> <!-- web cache -->
去看了看:
request headers还是不少的
尤其是cookie,感觉还是不容易解决。
在想,万一不好解决,就换:puppeteer
先去加上一些其他header试试
写错了,改为:
crawl_config = { "headers": { "User-Agent": UserAgent_Mac_Chrome, } }
结果:
还真就可以了。。。
【总结】
此处加上User-Agent:
class Handler(BaseHandler): UserAgent_Mac_Chrome = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36" crawl_config = { "headers": { "User-Agent": UserAgent_Mac_Chrome, } } # @every(minutes=24 * 60) def on_start(self): autohomeEntryUrl = "https://www.autohome.com.cn/car/" self.crawl(autohomeEntryUrl, callback=self.carBrandListCallback)
即可。
转载请注明:在路上 » 【已解决】PySpider访问汽车之家报错:requests.exceptions.HTTPError HTTP 403 Forbidden