最新消息:20210816 当前crifan.com域名已被污染,为防止失联,请关注(页面右下角的)公众号

【已解决】Scrapy如何添加本地socks代理以便能打开Youtube网页

Scrapy crifan 9534浏览 0评论

折腾:

【记录】用Python的Scrapy去爬取Youtube中Humf的字幕

期间,由于youtube网页本身需要翻墙才能打开。

而此处Mac中已有Shadowsocks-NG的ss代理了。

现在需要给Scrapy去添加代理。

scrapy add proxy

scrapy 添加 翻墙 代理

scrapy: 使用HTTP代理绕过网站反爬虫机制 | 藏经阁

Python爬虫从入门到放弃(十七)之 Scrapy框架中Download Middleware用法 – python修行路

scrapy设置代理proxy – CSDN博客

scrapy代理的设置 – 简书

scrapy绕过反爬虫 – 简书

Scrapy框架之如何给你的请求添加代理 – 简书

python – Scrapy and proxies – Stack Overflow

Using Scrapy with Proxies | 草原上的狼

aivarsk/scrapy-proxies: Random proxy middleware for Scrapy

如何让你的scrapy爬虫不再被ban – 秋楓 – 博客园

scrapy-rotating-proxies 0.5 : Python Package Index

Make Scrapy work with socket proxy | Michael Yin’s Blog

Scrapy – Web Crawling with a Proxy Network | The Elancer

adding http proxy in Scrapy program – Google Groups

Integrate Scrapoxy to Scrapy — Scrapoxy 3.0.0 documentation

Downloader Middleware — Scrapy 1.5.0 documentation

HttpProxyMiddleware

New in version 0.8.

class scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware

This middleware sets the HTTP proxy to use for requests, by setting the proxy meta value for Request objects.

Like the Python standard library modules urllib and urllib2, it obeys the following environment variables:

  • http_proxy

  • https_proxy

  • no_proxy

You can also set the meta key proxy per-request, to a value likehttp://some_proxy_server:port or http://username:password@some_proxy_server:port. Keep in mind this value will take precedence over http_proxy/https_proxy environment variables, and it will also ignore no_proxy environment variable.

HttpProxyMiddleware这个插件可以用来设置代理

其使用和Python的标准库urllib和urllib2一样的规则:支持环境变量

  • http_proxy

  • https_proxy

  • no_proxy

-》如果设置了这些参数,则使用这些参数

-〉可以在Request中,设置proxy参数为你要的值:

如果设置了request.meta[‘proxy’]的话,则优先级高于上面的http_proxy和https_proxy,也会忽略掉no_proxy

在参考了一堆,尤其是:

Scrapy框架之如何给你的请求添加代理 – 简书

和源码:

后,先去试试:

  • 直接重写Spider的start_requests,其中添加proxy

    • 简单,直接,方便

  • 后面再去试试添加middlewares.py中用ProxyMiddleware的process_request中去添加proxy

    • 虽然稍微复杂点,但是灵活度更高

去看看此处本地ss的代理的地址:

貌似是:

http://127.0.0.1:1086

?

好像应该是:

http://127.0.0.1:1087

所以去试试

结果好像不行:

没有起效果。

scrapy start_requests proxy

How to setting proxy in Python Scrapy – Stack Overflow

Spiders — Scrapy 1.5.0 documentation

connection pooling do not work when using proxy · Issue #2743 · scrapy/scrapy

好像scrapy有bug?

ansenhuang/scrapy-zhihu-users: scrapy爬取知乎用户数据

Scrapy代理的配置方法 | GuiJu blog

    def start_requests(self):

        yield scrapy.Request(

            url = self.domain,

            headers = self.headers,

            meta = {

                ‘proxy’: UsersConfig[‘proxy’],

                ‘cookiejar’: 1

            },

            callback = self.request_captcha

        )

我这里也是类似写法,应该没问题啊

换成sock5代理:

def start_requests(self):

    “””This is our first request to grab all the urls of the profiles.

    “””

    for url in self.start_urls:

        yield scrapy.Request(

            url=url,

            meta={

                “proxy”: “http://127.0.0.1:1086”

            },

            callback=self.parse,

        )

试试

还是不行。

算了,去换ProxyMiddleware试试

python爬虫scrapy之downloader_middleware设置proxy代理 – Luckyboy_LHD – 博客园

Scrapy代理的配置方法 | GuiJu blog

最后是

class YoutubesubtitleSpider

中用

def start_requests(self):

    “””This is our first request to grab all the urls of the profiles.

    “””

    for url in self.start_urls:

        self.logger.info(“url=%s”, url)

        yield scrapy.Request(

            url=url,

            meta={

                “proxy”: “http://127.0.0.1:1087”

            },

            callback=self.parse,

        )

是不起效果的:

感觉是scrapy的bug?

因为貌似别人同样设置好像是可以的。

【总结】

最终是:

通过

/Users/crifan/dev/dev_root/company/naturling/projects/scrapy/youtubeSubtitle/youtubeSubtitle/settings.py

中设置:

DOWNLOADER_MIDDLEWARES = {

    # ‘scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware’: 1,

    # ‘scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware’: None,

    “youtubeSubtitle.middlewares.ProxyMiddleware” : 1

}

/Users/crifan/dev/dev_root/company/naturling/projects/scrapy/youtubeSubtitle/youtubeSubtitle/middlewares.py

# Start your middleware class

class ProxyMiddleware(object):

    def process_request(self, request, spider):

        spider.logger.info(“YoutubesubtitleSpiderMiddleware process_request: request=%s, spider=%s”, request, spider)

        request.meta[‘proxy’] = “http://127.0.0.1:1087”

        spider.logger.info(“request.meta%s”, request.meta)

然后就可以http的代理就生效了,就可以获取youtube内容了:

2018-01-13 12:28:35 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2018-01-13 12:28:35 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023

2018-01-13 12:28:35 [YoutubeSubtitle] INFO: YoutubesubtitleSpiderMiddleware process_request: request=<GET https://www.youtube.com/user/theofficialhumf/playlists>, spider=<YoutubesubtitleSpider ‘YoutubeSubtitle’ at 0x1117645d0>

2018-01-13 12:28:35 [YoutubeSubtitle] INFO: request.meta{‘proxy’: ‘http://127.0.0.1:1087’}

2018-01-13 12:28:35 [YoutubeSubtitle] INFO: YoutubesubtitleSpiderMiddleware process_request: request=<GET https://www.youtube.com/robots.txt>, spider=<YoutubesubtitleSpider ‘YoutubeSubtitle’ at 0x1117645d0>

2018-01-13 12:28:35 [YoutubeSubtitle] INFO: request.meta{‘dont_obey_robotstxt’: True, ‘proxy’: ‘http://127.0.0.1:1087’}

2018-01-13 12:28:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.youtube.com/robots.txt> (referer: None)

2018-01-13 12:28:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.youtube.com/user/theofficialhumf/playlists> (referer: None)

respUrl=https://www.youtube.com/user/theofficialhumf/playlists

保存出来的html用chrome打开:

转载请注明:在路上 » 【已解决】Scrapy如何添加本地socks代理以便能打开Youtube网页

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
93 queries in 0.424 seconds, using 22.11MB memory