最新消息:20210816 当前crifan.com域名已被污染,为防止失联,请关注(页面右下角的)公众号

【未解决】爬取tch.ityxb.com中电子书《java 入门》

Java crifan 1200浏览 0评论
折腾:
【记录】爬虫 爬数据 义务教育教科书 义教教科书 电子书
期间,去打开:
http://tch.ityxb.com/personal/course
后,需要输入账号和密码:
登录后,去打开:
java 入门,第二版
然后去看看如何抓取。
需要爬取:
进入:
传智 | 高校教辅平台-电子教材
先去:
【已解决】分析tch.ityxb.com页面内部获取电子书图片的逻辑
然后再去写代码
写了代码:
# Function:
#   下载在线电子书:(需要登录)
#     java 入门,第二版
#     http://tch.ityxb.com/textbook/detail/8a9aec126449fa350164590f5e4d0063
#   的图片
# Author: Crifan Li
# Update: 20200426


import os
import requests


gBookTitle = "Java基础入门第2版"
gDomain = "tch.ityxb.com"
gTotalPageCount=427


UserAgent_Mac_Chrome = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36"


gHeaders = {
  "User-Agent": UserAgent_Mac_Chrome,
}


# gSaveFolder = os.path.join("output", gBookTitle)
gSaveFolder = os.path.join("output", gDomain, gBookTitle)


def createFolder(folderFullPath):
    """
        create folder, even if already existed
        Note: for Python 3.2+
    """
    os.makedirs(folderFullPath, exist_ok=True)


createFolder(gSaveFolder)


Host = "https://vip.ow365.cn"
GetPageUrl = "%s/PW/GetPage" % Host
GetImgUrl = "%s/img" % Host


curPageToken = ""


def downloadImgage(imgUrl, saveFullPath):
  resp = requests.get(imgUrl)
  if resp.ok:
    with open(saveFullPath, 'wb') as saveFp:
      saveFp.write(resp.content)
      print("Saved imgUrl=%s to saveFullPath=%s" % (imgUrl, saveFullPath))
  else:
    print("Fail to download image from %s" % imgUrl)


for curPageIdx in range(gTotalPageCount):
  curPageNum = curPageIdx + 1
  print("[%d] " % (curPageNum))
  queryDict = {
    'f': "YXR0YWNobWVudC1jZW50ZXIuYm94dWVndS5jb20uODBcMThmNWJiOTZhM2I4NGM3NzllZDJhNTY4MzM3ZWFkNjAucGRm",
    "img": curPageToken,
    "isMobile": "false",
    "vid": "@ouvAGlwulktavhIGppyKg==",
    "dk": "0",
    "ver": "2",
    "sn": "0",
  }
  resp = requests.get(GetPageUrl, headers=gHeaders, params=queryDict)
  if resp.ok:
    respText = resp.text
    # print("respText=%s" % respText)
    respJson = resp.json()
    # print("respJson=%s" % respJson)
    """
      {
        "NextPage": "IDcMbrrMGOWvOQVTWydwR6WWz0UVpg2zB9VFJh7jsnp5byBCqeJ6jribHO0GQGIZ1exJW4aembE=",
        "PageCount": 427,
        "ErrorMsg": "",
        "PageIndex": 1,
        "PageWidth": 880,
        "Width": 880,
        "Height": 1237
      }
    """
    curImgToken = respJson["NextPage"]
    curPageIndex = respJson["PageIndex"]


    saveFilename = "%d.png" % curPageNum
    saveFullPath = os.path.join(gSaveFolder, saveFilename)
    curImgUrl = "%s?img=%s&tp=" % (GetImgUrl, curImgToken)
    # https://vip.ow365.cn/img?img=IDcMbrrMGOWvOQVTWydwR6WWz0UVpg2zB9VFJh7jsnp5byBCqeJ6jmsfp2y28J9E9JreoGwZvNk=&tp=
    # https://vip.ow365.cn/img?img=IDcMbrrMGOWvOQVTWydwR6WWz0UVpg2zB9VFJh7jsnp5byBCqeJ6jribHO0GQGIZ1exJW4aembE=&tp=
    downloadImgage(curImgUrl, saveFullPath)
  else:
    print("!!! fail to open url: %s, reason: %s, status_code" % (GetPageUrl, resp.reason, resp.status_code))
    print("resp.text=%s" % resp.text)
其中调试第一张图片
结果
https://vip.ow365.cn/img?img=IDcMbrrMGOWvOQVTWydwR6WWz0UVpg2zB9VFJh7jsnp5byBCqeJ6jribHO0GQGIZ1exJW4aembE=&tp= 
却保存图片是错误的
去找找缺哪些参数
去看了之前的
curl 'https://vip.ow365.cn/img?img=IDcMbrrMGOWvOQVTWydwR6WWz0UVpg2zB9VFJh7jsnp5byBCqeJ6jribHO0GQGIZ1exJW4aembE=&tp=' \
  -H 'authority: vip.ow365.cn' \
  -H 'pragma: no-cache' \
  -H 'cache-control: no-cache' \
  -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36' \
  -H 'accept: image/webp,image/apng,image/*,*/*;q=0.8' \
  -H 'sec-fetch-site: same-origin' \
  -H 'sec-fetch-mode: no-cors' \
  -H 'sec-fetch-dest: image' \
  -H 'referer: https://vip.ow365.cn/?i=11311&ssl=1&furl=0As6WW@zSHIfqZy_0miBI1NfVmqplNkx4osgxUapgos7zntvq_BluwUV5DjSGRhsHRFJwyGpvHi9cjUTIGzm3WHgnjJ2lFd1wVPaQXBaorIzE0K0J_OXwbwK6qlOrtb@@GhMGaxrje5AeipdhF4tvw==' \
  -H 'accept-language: zh-CN,zh;q=0.9,en;q=0.8,la;q=0.7' \
  --compressed
试了试:
authority: vip.ow365.cn
无效,猜测是referer
果然是:
所以去加上referer即可。
但是发现此处值不是当前getpage的url
不过发现是:
传智 | 高校教辅平台-电子教材
中的url
所以:此处可以写死:
https://vip.ow365.cn/?i=11311&ssl=1&furl=0As6WW@zSHIfqZy_0miBI1NfVmqplNkx4osgxUapgos7zntvq_BluwUV5DjSGRhsHRFJwyGpvHi9cjUTIGzm3WHgnjJ2lFd1wVPaQXBaorIzE0K0J_OXwbwK6qlOrtb@@GhMGaxrje5AeipdhF4tvw==
试试。
尤其是其中的i值11311之类的(和furl=from url?),就是:此处对应的当前这个电子书的网页打开时的url
RefererI = "11311"
RefererSsl = "1"
RefererFurl = "0As6WW@zSHIfqZy_0miBI1NfVmqplNkx4osgxUapgos7zntvq_BluwUV5DjSGRhsHRFJwyGpvHi9cjUTIGzm3WHgnjJ2lFd1wVPaQXBaorIzE0K0J_OXwbwK6qlOrtb@@GhMGaxrje5AeipdhF4tvw=="
ImgReferer = "https://vip.ow365.cn/?i=%s&ssl=%s&furl=%s" % (RefererI, RefererSsl, RefererFurl)


ImgHeaderDict = {
  "referer": ImgReferer
}
curPageToken = ""


def downloadImgage(imgUrl, saveFullPath):
  resp = requests.get(imgUrl, headers=ImgHeaderDict)
结果:
即可获取到图片了。

转载请注明:在路上 » 【未解决】爬取tch.ityxb.com中电子书《java 入门》

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
82 queries in 0.240 seconds, using 22.59MB memory