最新消息:20210816 当前crifan.com域名已被污染,为防止失联,请关注(页面右下角的)公众号

【记录】用Python的Scrapy去爬取Youtube中Humf的字幕

Python crifan 6798浏览 0评论

对于之前的手动操作去 找小毛怪 Humf 的字幕:

但是找到个好(工具)网站,支持提取youtube中视频的字幕的,具体步骤:

针对于Humf的youtube官网的每个系列:

Humf – Official Channel – YouTube – YouTube

点击进去后:

https://www.youtube.com/watch?v=eDaHu2vnvlM&list=PLHOR8x-IicVIHXV3Dkro99d4KyBAuVx9_

即 对于 每个系列中的每个视频:

的页面地址,粘贴到:

http://www.yousubtitles.com

中,然后点击Download后,即可看到有很多语言的srt字幕,点击English的下载即可得到srt文件。

其余的100多个视频的字幕:

要么我抽空写爬虫脚本 自动化去下载

现在希望用scrapy去实现

之前已经折腾了一点scrapy了:

【记录】用Python的Scrapy去爬取cbeebies.com

现在去试试上面这个爬取。

所以入口是:

Humf – Official Channel – YouTube – YouTube

去创建项目:

➜  scrapy scrapy startproject youtubeSubtitle

New Scrapy project ‘youtubeSubtitle’, using template directory ‘/usr/local/lib/python2.7/site-packages/scrapy/templates/project’, created in:

    /Users/crifan/dev/dev_root/company/naturling/projects/scrapy/youtubeSubtitle

You can start your first spider with:

    cd youtubeSubtitle

    scrapy genspider example example.com

➜  scrapy cd youtubeSubtitle

然后看到提示,所以去看看命令:

➜  youtubeSubtitle scrapy –help

Scrapy 1.4.0 – project: youtubeSubtitle

Usage:

  scrapy <command> [options] [args]

Available commands:

  bench         Run quick benchmark test

  check         Check spider contracts

  crawl         Run a spider

  edit          Edit spider

  fetch         Fetch a URL using the Scrapy downloader

  genspider     Generate new spider using pre-defined templates

  list          List available spiders

  parse         Parse URL (using its spider) and print the results

  runspider     Run a self-contained spider (without creating a project)

  settings      Get settings values

  shell         Interactive scraping console

  startproject  Create new project

  version       Print Scrapy version

  view          Open URL in browser, as seen by Scrapy

Use “scrapy <command> -h” to see more info about a command

果然genspider是用来生成默认的爬虫的,然后就去创建:

➜  youtubeSubtitle scrapy genspider YoutubeSubtitle youtube.com

Created spider ‘YoutubeSubtitle’ using template ‘basic’ in module:

  youtubeSubtitle.spiders.YoutubeSubtitle

自动生成的内容是:

# -*- coding: utf-8 -*-

import scrapy

class YoutubesubtitleSpider(scrapy.Spider):

    name = ‘YoutubeSubtitle’

    allowed_domains = [‘youtube.com’]

    start_urls = [‘http://youtube.com/’]

    def parse(self, response):

        pass

然后继续添加内容:

# -*- coding: utf-8 -*-

import scrapy

class YoutubesubtitleSpider(scrapy.Spider):

    name = ‘YoutubeSubtitle’

    allowed_domains = [‘youtube.com’, “yousubtitles.com”]

    start_urls = [

        “https://www.youtube.com/user/theofficialhumf/playlists”

    ]

    def parse(self, response):

        respUrl = response.url

        print “respUrl=%s”%(respUrl)

        filename = respUrl.split(“/”)[-2] + ‘.html’

        with open(filename, ‘wb’) as f:

            f.write(response.body)

然后去执行:

➜  youtubeSubtitle scrapy crawl YoutubeSubtitle

2018-01-13 10:49:45 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: youtubeSubtitle)

2018-01-13 10:49:45 [scrapy.utils.log] INFO: Overridden settings: {‘NEWSPIDER_MODULE’: ‘youtubeSubtitle.spiders’, ‘SPIDER_MODULES’: [‘youtubeSubtitle.spiders’], ‘ROBOTSTXT_OBEY’: True, ‘BOT_NAME’: ‘youtubeSubtitle’}

2018-01-13 10:49:45 [scrapy.middleware] INFO: Enabled extensions:

[‘scrapy.extensions.memusage.MemoryUsage’,

‘scrapy.extensions.logstats.LogStats’,

‘scrapy.extensions.telnet.TelnetConsole’,

‘scrapy.extensions.corestats.CoreStats’]

2018-01-13 10:49:45 [scrapy.middleware] INFO: Enabled downloader middlewares:

[‘scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware’,

‘scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware’,

‘scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware’,

‘scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware’,

‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware’,

‘scrapy.downloadermiddlewares.retry.RetryMiddleware’,

‘scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware’,

‘scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware’,

‘scrapy.downloadermiddlewares.redirect.RedirectMiddleware’,

‘scrapy.downloadermiddlewares.cookies.CookiesMiddleware’,

‘scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware’,

‘scrapy.downloadermiddlewares.stats.DownloaderStats’]

2018-01-13 10:49:45 [scrapy.middleware] INFO: Enabled spider middlewares:

[‘scrapy.spidermiddlewares.httperror.HttpErrorMiddleware’,

‘scrapy.spidermiddlewares.offsite.OffsiteMiddleware’,

‘scrapy.spidermiddlewares.referer.RefererMiddleware’,

‘scrapy.spidermiddlewares.urllength.UrlLengthMiddleware’,

‘scrapy.spidermiddlewares.depth.DepthMiddleware’]

2018-01-13 10:49:45 [scrapy.middleware] INFO: Enabled item pipelines:

[]

2018-01-13 10:49:45 [scrapy.core.engine] INFO: Spider opened

2018-01-13 10:49:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2018-01-13 10:49:45 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023

好像卡死了

难道是因为还没翻墙导致的?

果然超时了:

2018-01-13 10:49:45 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023

2018-01-13 10:50:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2018-01-13 10:51:00 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.youtube.com/robots.txt> (failed 1 times): TCP connection timed out: 60: Operation timed out.

2018-01-13 10:51:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2018-01-13 10:52:16 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.youtube.com/robots.txt> (failed 2 times): TCP connection timed out: 60: Operation timed out.

2018-01-13 10:52:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2018-01-13 10:53:31 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www.youtube.com/robots.txt> (failed 3 times): TCP connection timed out: 60: Operation timed out.

2018-01-13 10:53:31 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET https://www.youtube.com/robots.txt>: TCP connection timed out: 60: Operation timed out.

TCPTimedOutError: TCP connection timed out: 60: Operation timed out.

2018-01-13 10:53:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2018-01-13 10:54:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2018-01-13 10:54:48 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.youtube.com/user/theofficialhumf/playlists> (failed 1 times): TCP connection timed out: 60: Operation timed out.

2018-01-13 10:55:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2018-01-13 10:56:04 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.youtube.com/user/theofficialhumf/playlists> (failed 2 times): TCP connection timed out: 60: Operation timed out.

2018-01-13 10:56:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2018-01-13 10:57:20 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www.youtube.com/user/theofficialhumf/playlists> (failed 3 times): TCP connection timed out: 60: Operation timed out.

2018-01-13 10:57:20 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.youtube.com/user/theofficialhumf/playlists>: TCP connection timed out: 60: Operation timed out.

2018-01-13 10:57:20 [scrapy.core.engine] INFO: Closing spider (finished)

2018-01-13 10:57:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

{‘downloader/exception_count’: 6,

‘downloader/exception_type_count/twisted.internet.error.TCPTimedOutError’: 6,

‘downloader/request_bytes’: 1398,

‘downloader/request_count’: 6,

‘downloader/request_method_count/GET’: 6,

‘finish_reason’: ‘finished’,

‘finish_time’: datetime.datetime(2018, 1, 13, 2, 57, 20, 551337),

‘log_count/DEBUG’: 7,

‘log_count/ERROR’: 2,

‘log_count/INFO’: 14,

‘memusage/max’: 50995200,

‘memusage/startup’: 50167808,

‘retry/count’: 4,

‘retry/max_reached’: 2,

‘retry/reason_count/twisted.internet.error.TCPTimedOutError’: 4,

‘scheduler/dequeued’: 3,

‘scheduler/dequeued/memory’: 3,

‘scheduler/enqueued’: 3,

‘scheduler/enqueued/memory’: 3,

‘start_time’: datetime.datetime(2018, 1, 13, 2, 49, 45, 656662)}

2018-01-13 10:57:20 [scrapy.core.engine] INFO: Spider closed (finished)

所以要去:

【已解决】Scrapy如何添加本地socks代理以便能打开Youtube网页

然后从网页中提取出多个系列的地址

保存出来的页面,希望提取的是:

‘//div[@class=“yt-lockup-thumbnail”]/a[starts-with(@href, “/watch”)]’

去参考:

【记录】尝试Scrapy shell去提取cbeebies.com页面中的子url

中的scrapy shell去试试xpath能不能找到我们要的内容

先去:

➜  youtubeSubtitle scrapy shell

2018-01-13 21:05:57 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: youtubeSubtitle)

2018-01-13 21:05:57 [scrapy.utils.log] INFO: Overridden settings: {‘NEWSPIDER_MODULE’: ‘youtubeSubtitle.spiders’, ‘ROBOTSTXT_OBEY’: True, ‘DUPEFILTER_CLASS’: ‘scrapy.dupefilters.BaseDupeFilter’, ‘SPIDER_MODULES’: [‘youtubeSubtitle.spiders’], ‘BOT_NAME’: ‘youtubeSubtitle’, ‘LOGSTATS_INTERVAL’: 0}

2018-01-13 21:05:57 [scrapy.middleware] INFO: Enabled extensions:

[‘scrapy.extensions.memusage.MemoryUsage’,

‘scrapy.extensions.telnet.TelnetConsole’,

‘scrapy.extensions.corestats.CoreStats’]

2018-01-13 21:05:57 [scrapy.middleware] INFO: Enabled downloader middlewares:

[‘youtubeSubtitle.middlewares.ProxyMiddleware’,

‘scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware’,

‘scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware’,

‘scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware’,

‘scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware’,

‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware’,

‘scrapy.downloadermiddlewares.retry.RetryMiddleware’,

‘scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware’,

‘scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware’,

‘scrapy.downloadermiddlewares.redirect.RedirectMiddleware’,

‘scrapy.downloadermiddlewares.cookies.CookiesMiddleware’,

‘scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware’,

‘scrapy.downloadermiddlewares.stats.DownloaderStats’]

2018-01-13 21:05:57 [scrapy.middleware] INFO: Enabled spider middlewares:

[‘scrapy.spidermiddlewares.httperror.HttpErrorMiddleware’,

‘scrapy.spidermiddlewares.offsite.OffsiteMiddleware’,

‘scrapy.spidermiddlewares.referer.RefererMiddleware’,

‘scrapy.spidermiddlewares.urllength.UrlLengthMiddleware’,

‘scrapy.spidermiddlewares.depth.DepthMiddleware’]

2018-01-13 21:05:57 [scrapy.middleware] INFO: Enabled item pipelines:

[]

2018-01-13 21:05:57 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023

[s] Available Scrapy objects:

[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)

[s]   crawler    <scrapy.crawler.Crawler object at 0x111cd2b90>

[s]   item       {}

[s]   settings   <scrapy.settings.Settings object at 0x111cd2b10>

[s] Useful shortcuts:

[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)

[s]   fetch(req)                  Fetch a scrapy.Request and update local objects

[s]   shelp()           Shell help (print this help)

[s]   view(response)    View response in a browser

研究了下,最后是:

>>> fetch(“https://www.youtube.com/user/theofficialhumf/playlists“)

2018-01-13 21:07:09 [scrapy.core.engine] INFO: Spider opened

2018-01-13 21:07:09 [default] INFO: YoutubesubtitleSpiderMiddleware process_request: request=<GET https://www.youtube.com/user/theofficialhumf/playlists>, spider=<DefaultSpider ‘default’ at 0x111f751d0>

2018-01-13 21:07:09 [default] INFO: request.meta{‘handle_httpstatus_list’: <scrapy.utils.datatypes.SequenceExclude object at 0x111f75090>, ‘proxy’: ‘http://127.0.0.1:1087’}

2018-01-13 21:07:09 [default] INFO: YoutubesubtitleSpiderMiddleware process_request: request=<GET https://www.youtube.com/robots.txt>, spider=<DefaultSpider ‘default’ at 0x111f751d0>

2018-01-13 21:07:09 [default] INFO: request.meta{‘dont_obey_robotstxt’: True, ‘proxy’: ‘http://127.0.0.1:1087’}

2018-01-13 21:07:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.youtube.com/robots.txt> (referer: None)

2018-01-13 21:07:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.youtube.com/user/theofficialhumf/playlists> (referer: None)

fetch(“your_url”)

得到了结果

无意间用:

>>> view(response)

True

调用了浏览器打开结果:

然后继续response.body可以得到结果

这样就可以接续试试xpath了

但是上面view出来的网页中,想要的地址是:

https://www.youtube.com/watch?v=eDaHu2vnvlM&list=PLHOR8x-IicVIHXV3Dkro99d4KyBAuVx9_

但是此处的:

>>> response.xpath(‘//div[@class=”yt-lockup-thumbnail”]’)

[<Selector xpath=’//div[@class=”yt-lockup-thumbnail”]’ data=u'<div class=”yt-lockup-thumbnail”>\n      ‘>, <Selector xpath=’//div[@class=”yt-lockup-thumbnail”]’ data=u'<div class=”yt-lockup-thumbnail”>\n      ‘>, <Selector xpath=’//div[@class=”yt-lockup-thumbnail”]’ data=u'<div class=”yt-lockup-thumbnail”>\n      ‘>, <Selector xpath=’//div[@class=”yt-lockup-thumbnail”]’ data=u'<div class=”yt-lockup-thumbnail”>\n      ‘>, <Selector xpath=’//div[@class=”yt-lockup-thumbnail”]’ data=u'<div class=”yt-lockup-thumbnail”>\n      ‘>, <Selector xpath=’//div[@class=”yt-lockup-thumbnail”]’ data=u'<div class=”yt-lockup-thumbnail”>\n      ‘>]

>>> response.xpath(‘//div[@class=”yt-lockup-thumbnail”]/a’)

[<Selector xpath=’//div[@class=”yt-lockup-thumbnail”]/a’ data=u'<a href=”/watch?v=23t1f8d2ISs&amp;list=P’>, <Selector xpath=’//div[@class=”yt-lockup-thumbnail”]/a’ data=u'<a href=”/watch?v=eDaHu2vnvlM&amp;list=P’>, <Selector xpath=’//div[@class=”yt-lockup-thumbnail”]/a’ data=u'<a href=”/watch?v=Ensz3fh8608&amp;list=P’>, <Selector xpath=’//div[@class=”yt-lockup-thumbnail”]/a’ data=u'<a href=”/watch?v=I7xxF6iTCTc&amp;list=P’>, <Selector xpath=’//div[@class=”yt-lockup-thumbnail”]/a’ data=u'<a href=”/watch?v=gpwWkfIyOao&amp;list=P’>, <Selector xpath=’//div[@class=”yt-lockup-thumbnail”]/a’ data=u'<a href=”/watch?v=gpwWkfIyOao&amp;list=P’>]

>>> response.xpath(‘//div[@class=”yt-lockup-thumbnail”]/a[starts-with(@href, “/watch”)]’)

[<Selector xpath=’//div[@class=”yt-lockup-thumbnail”]/a[starts-with(@href, “/watch”)]’ data=u'<a href=”/watch?v=23t1f8d2ISs&amp;list=P’>, <Selector xpath=’//div[@class=”yt-lockup-thumbnail”]/a[starts-with(@href, “/watch”)]’ data=u'<a href=”/watch?v=eDaHu2vnvlM&amp;list=P’>, <Selector xpath=’//div[@class=”yt-lockup-thumbnail”]/a[starts-with(@href, “/watch”)]’ data=u'<a href=”/watch?v=Ensz3fh8608&amp;list=P’>, <Selector xpath=’//div[@class=”yt-lockup-thumbnail”]/a[starts-with(@href, “/watch”)]’ data=u'<a href=”/watch?v=I7xxF6iTCTc&amp;list=P’>, <Selector xpath=’//div[@class=”yt-lockup-thumbnail”]/a[starts-with(@href, “/watch”)]’ data=u'<a href=”/watch?v=gpwWkfIyOao&amp;list=P’>, <Selector xpath=’//div[@class=”yt-lockup-thumbnail”]/a[starts-with(@href, “/watch”)]’ data=u'<a href=”/watch?v=gpwWkfIyOao&amp;list=P’>]

地址貌似只是:

/watch?v=23t1f8d2ISs&amp;list=P

后面地址不全的感觉

或许是此处调试期间的后续数据没有完全显示?

>>> response.xpath(‘//div[@class=”yt-lockup-thumbnail”]/a[starts-with(@href, “/watch”)]’)[0]

<Selector xpath=’//div[@class=”yt-lockup-thumbnail”]/a[starts-with(@href, “/watch”)]’ data=u'<a href=”/watch?v=23t1f8d2ISs&amp;list=P’>

继续调试看看

试了试a属性,结果没有:

>>> response.xpath(‘//div[@class=”yt-lockup-thumbnail”]/a[starts-with(@href, “/watch”)]’)[0].a

Traceback (most recent call last):

  File “<console>”, line 1, in <module>

AttributeError: ‘Selector’ object has no attribute ‘a’

无意间试了试extract,结果可以看到希望的全部内容了:

>>> response.xpath(‘//div[@class=”yt-lockup-thumbnail”]/a[starts-with(@href, “/watch”)]’)[0].extract()

u'<a href=”/watch?v=23t1f8d2ISs&amp;list=PLHOR8x-IicVJDAmJWZmJ-IMu1x3lTAld5″ class=”yt-pl-thumb-link yt-uix-sessionlink  spf-link ” data-sessionlink=”ei=eQRaWs-OIsK9qQHosZ3IBA&amp;feature=c4-videos&amp;ved=CDcQlx4iEwiPr5_2gNXYAhXCXioKHehYB0komxw”>\n      <span class=” yt-pl-thumb”>\n      \n    <span class=”video-thumb  yt-thumb yt-thumb-196″>\n    <span class=”yt-thumb-default”>\n      <span class=”yt-thumb-clip”>\n        \n  <img width=”196″ data-ytimg=”1″ src=”https://i.ytimg.com/vi/23t1f8d2ISs/hqdefault.jpg?sqp=-oaymwEWCMQBEG5IWvKriqkDCQgBFQAAiEIYAQ==&;rs=AOn4CLAlYOJF6LvCxIQsdzkOtRxyA3l1pw” onload=”;window.__ytRIL &amp;&amp; __ytRIL(this)” alt=”” aria-hidden=”true”>\n\n        <span class=”vertical-align”></span>\n      </span>\n    </span>\n  </span>\n\n\n    <span class=”sidebar”>\n      <span class=”yt-pl-sidebar-content yt-valign”>\n        <span class=”yt-valign-container”>\n            <span class=”formatted-video-count-label”>\n      <b>66</b> videos\n  </span>\n\n            <span class=”yt-pl-icon yt-pl-icon-reg yt-sprite”></span>\n        </span>\n      </span>\n    </span>\n\n      <span class=”yt-pl-thumb-overlay”>\n        <span class=”yt-pl-thumb-overlay-content”>\n          <span class=”play-icon yt-sprite”></span>\n          <span class=”yt-pl-thumb-overlay-text”>\nPlay all\n          </span>\n        </span>\n      </span>\n\n        <span class=”yt-pl-watch-queue-overlay”>\n      <span class=”thumb-menu dark-overflow-action-menu video-actions”>\n    <button aria-expanded=”false” aria-haspopup=”true” onclick=”;return false;” class=”yt-uix-button-reverse flip addto-watch-queue-menu spf-nolink hide-until-delayloaded yt-uix-button yt-uix-button-dark-overflow-action-menu yt-uix-button-size-default yt-uix-button-has-icon no-icon-markup yt-uix-button-empty” type=”button”><span class=”yt-uix-button-arrow yt-sprite”></span><ul class=”watch-queue-thumb-menu yt-uix-button-menu yt-uix-button-menu-dark-overflow-action-menu hid”><li role=”menuitem” class=”overflow-menu-choice addto-watch-queue-menu-choice addto-watch-queue-play-now yt-uix-button-menu-item” data-action=”play-now” onclick=”;return false;” data-list-id=”PLHOR8x-IicVJDAmJWZmJ-IMu1x3lTAld5″><span class=”addto-watch-queue-menu-text”>Play now</span></li></ul></button>\n  </span>\n\n      <button class=”yt-uix-button yt-uix-button-size-small yt-uix-button-default yt-uix-button-empty yt-uix-button-has-icon no-icon-markup addto-button addto-queue-button video-actions spf-nolink hide-until-delayloaded addto-tv-queue-button yt-uix-tooltip” type=”button” onclick=”;return false;” title=”Queue” data-style=”tv-queue” data-list-id=”PLHOR8x-IicVJDAmJWZmJ-IMu1x3lTAld5″></button>\n\n  </span>\n\n  </span>\n\n  </a>’

其中的内容,就是之前从正常的网页中拷贝出来的html:

所以是可以正常获取需要的div下的a节点的

用代码:

def parse(self, response):

    respUrl = response.url

    print “respUrl=%s”%(respUrl)

    filename = respUrl.split(“/”)[-2] + ‘.html’

    with open(filename, ‘wb’) as f:

        f.write(response.body)

    # extract group/collection url

    lockupElemList = response.xpath(‘//div[@class=”yt-lockup-thumbnail”]/a[starts-with(@href, “/watch”)]’)

    self.logger.info(“lockupElemList=%s”, lockupElemList)

    for eachLockupElem in lockupElemList:

        self.logger.info(“eachLockupElem=%s”, eachLockupElem)

是可以获得对应的div的element的:

接着就需要去搞清楚:

【已解决】如何从Scrapy的Selector中获取html元素a的href属性的值

然后接着:

【已解决】scrapy中警告:DEBUG: Forbidden by robots.txt

接着又遇到:

【已解决】Scrapy如何加载全部网页内容

然后接着就是去:

【已解决】Scrapy的Python中如何解析部分的html字符串并格式化为html网页源码

接着再去想办法看看能否传递参数下去:

【已解决】Scrapy如何向后续Request的callback中传递参数

以及:

【已解决】Python中json的dumps出错:raise TypeError(repr(o) + ” is not JSON serializable”)

结果发现:

【已解决】Scrapy中丢失部分url链接没有抓取

转载请注明:在路上 » 【记录】用Python的Scrapy去爬取Youtube中Humf的字幕

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
82 queries in 0.185 seconds, using 22.11MB memory