折腾:
【记录】爬虫 爬数据 义务教育教科书 义教教科书
期间,先去爬取:
义教教科书英语八年级下册
先去用Chrome去分析看看:
是否简单能找到api和请求
简单分析了下,貌似是:
第三张图片的thumb:
https://bp.pep.com.cn/ebook/yybanjxc/files/thumb/3.jpg?200209175611
第三张图片的移动端的大图:
https://bp.pep.com.cn/ebook/yybanjxc/files/mobile/3.jpg?200209175611
然后之后其他几张图片地址都是类似的:
https://bp.pep.com.cn/ebook/yybanjxc/files/thumb/1.jpg?200209175611 https://bp.pep.com.cn/ebook/yybanjxc/files/mobile/1.jpg?200209175611 https://bp.pep.com.cn/ebook/yybanjxc/files/thumb/2.jpg?200209175611 https://bp.pep.com.cn/ebook/yybanjxc/files/mobile/2.jpg?200209175611 https://bp.pep.com.cn/ebook/yybanjxc/files/mobile/4.jpg?200209175611
然后去找找200209175611,是怎么得来的
拷贝相关内容出来,放到VSCode中,搜索看看:
相关部分是:
bookConfig.totalPageCount=147; bookConfig.largePageWidth=1024; bookConfig.largePageHeight=1432;; bookConfig.securityType="1"; bookConfig.CreatedTime ="200209175611";bookConfig.bookTitle="义教教科书英语八年级下册"; bookConfig.bookmarkCR="fedde07e3aa4fb28b08228ec8a994da9f421c6dd"; bookConfig.productName="名编辑企业版"; bookConfig.homePage="http://www.mingbianji.com";
去看看:
-》很明显,这个电子书就是这家公司,或相关技术制作的。
所以可以去爬取去试试了。
然后先去:
【已解决】Python的requests中如何下载二进制数据保存为图片文件
再去批量运行,也是OK的:
【总结】
最后完整代码是:
# 下载在线电子书: # 义教教科书英语八年级下册 # https://bp.pep.com.cn/ebook/yybanjxc/mobile/index.html # 的图片 # Author: Crifan Li # Update: 20200302 import os import requests # bookConfig.bookTitle="义教教科书英语八年级下册"; gBookTitle = "义教教科书英语八年级下册" # bookConfig.CreatedTime ="200209175611"; gCreateTimeStr = "200209175611" # bookConfig.totalPageCount=147; gTotalPageCount=147 UserAgent_Mac_Chrome = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36" gHeaders = { "User-Agent": UserAgent_Mac_Chrome, } gSaveFolder = os.path.join("output", gBookTitle) def createFolder(folderFullPath): """ create folder, even if already existed Note: for Python 3.2+ """ os.makedirs(folderFullPath, exist_ok=True) createFolder(gSaveFolder) for curPageIdx in range(gTotalPageCount): curPageNum = curPageIdx + 1 # https://bp.pep.com.cn/ebook/yybanjxc/files/thumb/1.jpg?200209175611 # https://bp.pep.com.cn/ebook/yybanjxc/files/mobile/1.jpg?200209175611 # curImageType = "thumb" curImageType = "mobile" curPictureUrl = "https://bp.pep.com.cn/ebook/yybanjxc/files/%s/%d.jpg?%s" % (curImageType, curPageNum, gCreateTimeStr) print("[%d] url=%s" % (curPageNum, curPictureUrl)) saveFilename = "%s_%d.jpg" % (curImageType, curPageNum) saveFullPath = os.path.join(gSaveFolder, saveFilename) resp = requests.get(curPictureUrl, headers=gHeaders) if resp.ok: with open(saveFullPath, 'wb') as saveFp: saveFp.write(resp.content) print("Saved to %s" % saveFullPath) else: print("!!! fail to open url: %s, reason: %s, status_code" % (curPictureUrl, resp.reason, resp.status_code))
继续运行后,即可下载全部147张图片:
效果不错。
转载请注明:在路上 » 【已解决】爬取bp.pep.com.cn中的义务教育教科书资源