最新消息:20210816 当前crifan.com域名已被污染,为防止失联,请关注(页面右下角的)公众号

【已解决】PySpider中如何下载mp4视频文件到本地

pyspider crifan 3291浏览 0评论

折腾:

【已解决】使用PySpider去爬取某网站中的视频

期间,此处已经抓取到mp4视频地址了:

showVideoCallback: response.url=http://xx.xx?m=home&c=match_new&a=video&show_id=1xxx1

curShowInfoDict={‘act_id’: ‘3’, ‘child_type’: ‘1’, ‘course_id’: ‘xx’, ‘cover_img’: ‘https://xx.xx/2018-06-04/5b14e22b8850a.jpg’, ‘create_time’: ‘1512737780’, ‘head_img’: ‘https://xx.xx/2018-06-23/5b2de55693ad9.jpg’, ‘href’: ‘/index.php?m=home&c=match_new&a=video&show_id=1xxx1’, ‘id’: ‘489’, ‘match_type’: ‘2’, ‘name’: ‘唐xx’, ‘rewards’: ‘0’, ‘scores’: ‘63.20’, ‘shares’: ‘2’, ‘show_id’: ‘1xx1’, ‘show_score’: ‘0’, ‘status’: ‘1’, ‘supports’: ‘104’, ‘uid’: ‘5xx’}

title=学学xx

{‘name’: ‘唐xx’,

‘scores’: ‘63.20’,

‘shares’: ‘2’,

‘show_id’: ‘1xxx1’,

‘supports’: ‘104’,

‘title’: ‘学学xx’,

‘url’: ‘http://xx.xx?m=home&c=match_new&a=video&show_id=1xxx1’,

‘videoUrl’: ‘https://xx.xx/2017-12-08/id15xxxuxx.mp4’}

然后就需要去搞清楚:

PySpider中如何下载文件

目前能想到的是:

难道要用requests第三方库去直接下载和保存文件?

pyspider 下载文件

pyspider示例代码七:自动登陆并获得PDF文件下载地址 – microman – 博客园

pyspider – pysipder下载文件超时 – SegmentFault 思否

用的是:urllib2.urlopen去下载文件的

[python]使用pyspider下载meizitu的图片 – 简书

结果 – pyspider中文文档 – pyspider中文网

[python]使用pyspider下载meizitu的图片 – 简书

倒是可以考虑借用PySpider的自带response,把content保存到本地文件中去的。

常见问题 · pyspider文档 · 看云

pyspider—爬取下载图片 – silianpan – 博客园

【总结】

PySpider中用:

import re
import os
import codecs
import json
OutputFullPath = "/Users/crifan/dev/dev_root/xxx/output"
    def downloadVideoCallback(self, response):
        itemUrl = response.url
        print("downloadVideoCallback: itemUrl=%s,response=%s" % (itemUrl, response))
        if not os.path.exists(OutputFullPath):
            os.makedirs(OutputFullPath)
            print("Ok to create folder %s" % OutputFullPath)
        itemInfoDict = response.save
        filename = "%s-%s-%s" % (
            itemInfoDict["show_id"],
            itemInfoDict["name"],
            itemInfoDict["title"])
        print("filename=%s" % filename)
        jsonFilename = filename + ".json"
        videoSuffix = itemUrl.split(".")[-1]
        videoFileName = filename + "." + videoSuffix
        print("jsonFilename=%s,videoSuffix=%s,videoFileName=%s" % (jsonFilename, videoSuffix, videoFileName))
        # http://xx.xx/index.php?m=home&c=match_new&a=video&show_id=xxx4
        # "梁x"
        # "57.20"
        # "2"
        # "1xx34"
        # "94"
        # "跟xx心"
        # "http://xx.x.x/index.php?m=home&c=match_new&a=video&show_id=xx4"
        # "https://xx.x.x./2017-12-08/15126857208283977349.mp4"
        jsonFilePath = os.path.join(OutputFullPath, jsonFilename)
        print("jsonFilePath=%s" % jsonFilePath)
        self.saveJsonToFile(jsonFilePath, itemInfoDict)
        videoBinData = response.content
        videoFilePath = os.path.join(OutputFullPath, videoFileName)
        self.saveDataToFile(videoFilePath, videoBinData)
    def saveDataToFile(self, fullFilename, binaryData):
        with open(fullFilename, ‘wb’) as fp:
            fp.write(binaryData)
            fp.close()
            print("Complete save file %s" % fullFilename)
    def saveJsonToFile(self, fullFilename, jsonValue):
        with codecs.open(fullFilename, ‘w’, encoding="utf-8") as jsonFp:
            json.dump(jsonValue, jsonFp, indent=2, ensure_ascii=False)
            print("Complete save json %s" % fullFilename)

实现了保存文件到对应目录:

【后记20190411】

后来整理出下载二进制文件的函数:

def saveDataToFile(fullFilename, binaryData):
    """save binary data info file"""
    with open(fullFilename, 'wb') as fp:
        fp.write(binaryData)
        fp.close()
        print("Complete save file %s" % fullFilename)

class Handler(BaseHandler):

    def downloadFileCallback(self, response):
        fileInfo = response.save
        print("fileInfo=%s" % fileInfo)

        binData = response.content
        fileFullPath = os.path.join(fileInfo["saveFolder"], fileInfo["filename"])
        print("fileFullPath=%s" % fileFullPath)
        saveDataToFile(fileFullPath, binData)

    def downloadFile(self, fileInfo):
        urlToDownload = fileInfo["fileUrl"]
        print("urlToDownload=%s" % urlToDownload)
        self.crawl(urlToDownload,
            callback=self.downloadFileCallback,
            save=fileInfo)

调用举例:

      # download audio file
      #   "path": "Audio/1808/20180911222516379.mp3",
      audioFileUrlTail = singleAudioDict["path"]
      print("audioFileUrlTail=%s" % audioFileUrlTail)
      if audioFileUrlTail:
          audioFileInfo = {
              "fileUrl": gResourcesRoot + "/" + audioFileUrlTail,
              "filename": ("Aduios_%s_" % audioId) + audioFileUrlTail.replace("/", "_"),
              "saveFolder": curSingleAudioFolder,
          }
          self.downloadFile(audioFileInfo)

转载请注明:在路上 » 【已解决】PySpider中如何下载mp4视频文件到本地

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
83 queries in 0.167 seconds, using 22.14MB memory