折腾:
【记录】爬虫 爬数据 义务教育教科书 义教教科书 电子书
期间,去打开:
后,需要输入账号和密码:

登录后,去打开:

然后去看看如何抓取。
需要爬取:

进入:

先去:
【已解决】分析tch.ityxb.com页面内部获取电子书图片的逻辑
然后再去写代码
写了代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 | # Function: # 下载在线电子书:(需要登录) # java 入门,第二版 # 的图片 # Author: Crifan Li # Update: 20200426 import os import requests gBookTitle = "Java基础入门第2版" gDomain = "tch.ityxb.com" gTotalPageCount = 427 UserAgent_Mac_Chrome = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36" gHeaders = { "User-Agent" : UserAgent_Mac_Chrome, } # gSaveFolder = os.path.join("output", gBookTitle) gSaveFolder = os.path.join( "output" , gDomain, gBookTitle) def createFolder(folderFullPath): """ create folder, even if already existed Note: for Python 3.2+ """ os.makedirs(folderFullPath, exist_ok = True ) createFolder(gSaveFolder) GetPageUrl = "%s/PW/GetPage" % Host GetImgUrl = "%s/img" % Host curPageToken = "" def downloadImgage(imgUrl, saveFullPath): resp = requests.get(imgUrl) if resp.ok: with open (saveFullPath, 'wb' ) as saveFp: saveFp.write(resp.content) print ( "Saved imgUrl=%s to saveFullPath=%s" % (imgUrl, saveFullPath)) else : print ( "Fail to download image from %s" % imgUrl) for curPageIdx in range (gTotalPageCount): curPageNum = curPageIdx + 1 print ( "[%d] " % (curPageNum)) queryDict = { 'f' : "YXR0YWNobWVudC1jZW50ZXIuYm94dWVndS5jb20uODBcMThmNWJiOTZhM2I4NGM3NzllZDJhNTY4MzM3ZWFkNjAucGRm" , "img" : curPageToken, "isMobile" : "false" , "vid" : "@ouvAGlwulktavhIGppyKg==" , "dk" : "0" , "ver" : "2" , "sn" : "0" , } resp = requests.get(GetPageUrl, headers = gHeaders, params = queryDict) if resp.ok: respText = resp.text # print("respText=%s" % respText) respJson = resp.json() # print("respJson=%s" % respJson) """ { "NextPage": "IDcMbrrMGOWvOQVTWydwR6WWz0UVpg2zB9VFJh7jsnp5byBCqeJ6jribHO0GQGIZ1exJW4aembE=", "PageCount": 427, "ErrorMsg": "", "PageIndex": 1, "PageWidth": 880, "Width": 880, "Height": 1237 } """ curImgToken = respJson[ "NextPage" ] curPageIndex = respJson[ "PageIndex" ] saveFilename = "%d.png" % curPageNum saveFullPath = os.path.join(gSaveFolder, saveFilename) curImgUrl = "%s?img=%s&tp=" % (GetImgUrl, curImgToken) downloadImgage(curImgUrl, saveFullPath) else : print ( "!!! fail to open url: %s, reason: %s, status_code" % (GetPageUrl, resp.reason, resp.status_code)) print ( "resp.text=%s" % resp.text) |
其中调试第一张图片
结果
1 | https: //vip .ow365.cn /img ?img=IDcMbrrMGOWvOQVTWydwR6WWz0UVpg2zB9VFJh7jsnp5byBCqeJ6jribHO0GQGIZ1exJW4aembE=&tp= |
却保存图片是错误的

去找找缺哪些参数
去看了之前的
1 2 3 4 5 6 7 8 9 10 11 12 | - H 'authority: vip.ow365.cn' \ - H 'pragma: no-cache' \ - H 'cache-control: no-cache' \ - H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36' \ - H 'accept: image/webp,image/apng,image/*,*/*;q=0.8' \ - H 'sec-fetch-site: same-origin' \ - H 'sec-fetch-mode: no-cors' \ - H 'sec-fetch-dest: image' \ - H 'accept-language: zh-CN,zh;q=0.9,en;q=0.8,la;q=0.7' \ - - compressed |
试了试:
authority: vip.ow365.cn
无效,猜测是referer
果然是:

所以去加上referer即可。
但是发现此处值不是当前getpage的url
不过发现是:
中的url
所以:此处可以写死:
1 | https: / / vip.ow365.cn / ?i = 11311 &ssl = 1 &furl = 0As6WW @zSHIfqZy_0miBI1NfVmqplNkx4osgxUapgos7zntvq_BluwUV5DjSGRhsHRFJwyGpvHi9cjUTIGzm3WHgnjJ2lFd1wVPaQXBaorIzE0K0J_OXwbwK6qlOrtb@@GhMGaxrje5AeipdhF4tvw = = |
试试。
尤其是其中的i值11311之类的(和furl=from url?),就是:此处对应的当前这个电子书的网页打开时的url
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | RefererI = "11311" RefererSsl = "1" RefererFurl = "0As6WW@zSHIfqZy_0miBI1NfVmqplNkx4osgxUapgos7zntvq_BluwUV5DjSGRhsHRFJwyGpvHi9cjUTIGzm3WHgnjJ2lFd1wVPaQXBaorIzE0K0J_OXwbwK6qlOrtb@@GhMGaxrje5AeipdhF4tvw==" ImgHeaderDict = { "referer" : ImgReferer } curPageToken = "" def downloadImgage(imgUrl, saveFullPath): resp = requests.get(imgUrl, headers = ImgHeaderDict) |
结果:

即可获取到图片了。
转载请注明:在路上 » 【未解决】爬取tch.ityxb.com中电子书《java 入门》