最新消息:20210816 当前crifan.com域名已被污染,为防止失联,请关注(页面右下角的)公众号

【记录】给BlogsToWordPress添加支持导出网易的心情随笔

Crawl_EmulateLogin crifan 5594浏览 0评论

【背景】

之前的

BlogsToWordpress

不支持网易的心情随笔。

现在去添加此功能。

【解决过程】

1.结果使用:

BlogsToWordpress.py -s http://blog.163.com/ni_chen

竟然结果连第一个帖子地址都找不到了。

2.所以去用Firebug调试网易博客,发现,原先的获得对应的帖子信息的访问,从之前的GET变成现在的POST了。

所以,把旧的GET的代码,

getBlogUrl = genGetBlogsUrl(userId, startBlogIdx, onceGetNum);
logging.info("getBlogUrl=%s", getBlogUrl);
# get blogs
blogsResp = crifanLib.getUrlRespHtml(getBlogUrl);
 
 
#------------------------------------------------------------------------------
# generate get blogs URL
def genGetBlogsUrl(userId, startBlogIdx, onceGetNum):
    getBlogsUrl = '';
 
    try :
        # http://api.blog.163.com/againinput4/dwr/call/plaincall/BlogBeanNew.getBlogs.dwr
        # callCount=1
        # scriptSessionId=${scriptSessionId}187
        # c0-scriptName=BlogBeanNew
        # c0-methodName=getBlogs
        # c0-id=0
        # c0-param0=number:172799491
        # c0-param1=number:0
        # c0-param2=number:20
        # batchId=955290
 
        paraDict = {
            'callCount'     :   '1',
            'scriptSessionId':  '${scriptSessionId}187',
            'c0-scriptName' :   'BlogBeanNew',
            'c0-methodName' :   'getBlogs',
            'c0-id'         :   '0',
            'c0-param0'     :   '',
            'c0-param1'     :   '',
            'c0-param2'     :   '',
            'batchId'       :   '1',
        };
        paraDict['c0-param0'] = "number:" + str(userId);
        paraDict['c0-param1'] = "number:" + str(startBlogIdx);
        paraDict['c0-param2'] = "number:" + str(onceGetNum);
         
        mainUrl = gConst['blogApi163'] + '/' + gVal['blogUser'] + '/' + 'dwr/call/plaincall/BlogBeanNew.getBlogs.dwr';
        getBlogsUrl = crifanLib.genFullUrl(mainUrl, paraDict);
 
        logging.debug("Generated get blogs url %s", getBlogsUrl);
    except :
        logging.debug("Can not generate get blog url.");
 
    return getBlogsUrl;

希望改为最新的,POST的代码。

3。调试期间发现:

其中的post data,还不是普通的以&分隔的,而是以换行分隔的:

request body is special with seperate via crlf

callCount=1
scriptSessionId=${scriptSessionId}187
c0-scriptName=BlogBeanNew
c0-methodName=getBlogs
c0-id=0
c0-param0=number:186541395
c0-param1=number:0
c0-param2=number:1
batchId=123756

所以, 还是比较特殊的。

结果折腾期间,还出错了:

LINE 959  : INFO     getBlogsDwrMainUrl=http://api.blog.163.com/ni_chen/dwr/call/plaincall/BlogBeanNew.getBlogs.dwr          
LINE 971  : INFO     postData=c0-id=0          
batchId=1          
c0-param1=number:0          
scriptSessionId=${scriptSessionId}187          
c0-param2=number:400          
c0-methodName=getBlogs          
c0-param0=number:186541395          
c0-scriptName=BlogBeanNew          
callCount=1          
LINE 973  : INFO     req=<urllib2.Request instance at 0x0000000002F45C88>          
LINE 1011 : INFO     resp=<addinfourl at 49572616L whose fp = <socket._fileobject object at 0x0000000002F2B408>>          
LINE 1016 : INFO     gVal[‘cj’]=<_LWPCookieJar.LWPCookieJar[<Cookie USERTRACK=221.224.111.74.1366017130803085 for .163.com/>, <Cookie NTESBLOGSI=C625520A2DCC43BB15C722EA6984CD64.app-64-8010 for .blog.163.com/>]>          
LINE 1013 : INFO     blogsDwrRespHtml=//#DWR-REPLY          
if (window.dwr) dwr.engine._remoteHandleBatchException({ name:’org.directwebremoting.extend.ServerException’, message:’The specified call count is not a number’ });          
else if (window.parent.dwr) window.parent.dwr.engine._remoteHandleBatchException({ name:’org.directwebremoting.extend.ServerException’, message:’The specified call count is not a number’ });

参考:

httpclient 调用DWR应用时发生The specified call count is not a number 错误

去把默认的:

req.add_header('Content-Type', "application/x-www-form-urlencoded");

改为:

req.add_header('Content-Type', "text/plain");

才最终获得对应的DWR返回的内容:

//#DWR-INSERT

//#DWR-REPLY

var s0={};var s1={};var s2={};var s3=[];s0.abstractSysGen=1;s0.accessCount=20;s0.allowComment=-100;s0.allowView=-100;s0.blogAbstract="<div><br></div><div><div style=\"line-height: 22px;\"   ><font size=\"4\"   style=\"line-height: 28px;\"   ><br style=\"line-height: 28px;\"   ></font></div><div style=\"line-height: 22px;\"   ><font size=\"4\"   style=\"line-height: 28px;\"   ><br style=\"line-height: 28px;\"   ></font></div></div><font size=\"4\"   >\u535A\u5BA2\u5DF2\u7ECF\u5168\u90E8\u642C\u8D70\u4E86\uFF0C\u8BF7\u79FB\u9A7E\u81F3\uFF1A</font><a target=\"_blank\" rel=\"nofollow\" href=\"http://nichen.info/blogs\"   >http://nichen.info</a><br><div><font size=\"4\"   ><br></font></div><div><font size=\"4\"   ><br></font></div>";s0.blogAttachments=null;s0.blogCount=s1;s0.blogExt=s2;s0.circleCount=0;s0.circleIdList=s3;s0.circleIds=null;s0.classId="fks_084064092094082071087087087095085094087070080087085074081";s0.className="\u968F\u7B14";s0.commentCount=0;s0.comments=null;s0.content="<div><br></div><div><div style=\"line-height: 22px;\"   ><font size=\"4\"   style=\"line-height: 28px;\"   ><br style=\"line-height: 28px;\"   ></font></div><div style=\"line-height: 22px;\"   ><font size=\"4\"   style=\"line-height: 28px;\"   ><br style=\"line-height: 28px;\"   ></font></div></div><font size=\"4\"   >\u535A\u5BA2\u5DF2\u7ECF\u5168\u90E8\u642C\u8D70\u4E86\uFF0C\u8BF7\u79FB\u9A7E\u81F3\uFF1A</font><a target=\"_blank\" rel=\"nofollow\" href=\"http://nichen.info/blogs\"   >http://nichen.info</a><br><div><font size=\"4\"   ><br></font></div><div><font size=\"4\"   ><br></font></div>";s0.contentPlainText=null;s0.id="fks_087065081087083065081085085071072087089069081082087064093083";s0.ip="147.46.115.126";s0.isBlogAbstractComplete=false;s0.isPublished=1;s0.keyName="ID";s0.keyWordCheckedState=0;s0.lastAccessCountUpdateTime=1365142180844;s0.matchedKeyWord=false;s0.modifyTime=1365142180643;s0.moveFrom="NONE";s0.permaSerial="18654139520132782258253";s0.permalink="blog/static/18654139520132782258253";s0.photoIds=null;s0.photoStoreTypes=null;s0.publishTime=1362615778253;s0.publishTimeStr="8:22:58";s0.publisherId=0;s0.publisherNickname=null;s0.publisherUsername=null;s0.rank=5;s0.recomBlogHome=false;s0.ref=false;s0.shortPublishDateStr="2013-3-7";s0.synchMiniBlog=-1;s0.tag="";s0.title="\u642C\u8D70\u5566";s0.trackbackCount=0;s0.trackbackUrl="blog/18654139520132782258253.track";s0.userId=186541395;s0.userName="ni_chen";s0.userNickname="Neysa";s0.valid=0;s0.zipContent=null;

s1.accessCount=20;s1.blogId=1251225334;s1.commentCount=0;s1.mainCommentCount=0;s1.permaSerial="18654139520132782258253";s1.recommendCount=0;s1.trackbackCount=0;s1.userId=186541395;

s2.blogId=1251225334;s2.doubanResourceInfo=null;s2.miniBlogCard=0;s2.userId=186541395;s2.voteId=0;

dwr.engine._remoteHandleCallback(‘1′,’0’,[s0]);

接下来,就可以正常的去解析此内容了。

4.最后,终于可以使用POST的代码,获得对应的内容了:

blogsDwrRespHtml = getBlogsDwrRespHtml(userId, startBlogIdx, onceGetNum);
logging.debug("blogsDwrRespHtml=%s", blogsDwrRespHtml);
 
def getBlogsDwrRespHtml(userId, startBlogIdx, onceGetNum):
    # getBlogUrl = genGetBlogsUrl(userId, startBlogIdx, onceGetNum);
    # logging.info("getBlogUrl=%s", getBlogUrl);
    # # get blogs
    # blogsRespHtml = crifanLib.getUrlRespHtml(getBlogUrl);
     
    #change GET to POST
 
    #http://api.blog.163.com/ni_chen/dwr/call/plaincall/BlogBeanNew.getBlogs.dwr
    #callCount=1
    #scriptSessionId=${scriptSessionId}187
    #c0-scriptName=BlogBeanNew
    #c0-methodName=getBlogs
    #c0-id=0
    #c0-param0=number:186541395
    #c0-param1=number:0
    #c0-param2=number:1
    #batchId=494302
     
    # http://api.blog.163.com/againinput4/dwr/call/plaincall/BlogBeanNew.getBlogs.dwr
    # callCount=1
    # scriptSessionId=${scriptSessionId}187
    # c0-scriptName=BlogBeanNew
    # c0-methodName=getBlogs
    # c0-id=0
    # c0-param0=number:172799491
    # c0-param1=number:0
    # c0-param2=number:20
    # batchId=955290
 
    postDict = {
        'callCount'     :   '1',
        'scriptSessionId':  '${scriptSessionId}187',
        'c0-scriptName' :   'BlogBeanNew',
        'c0-methodName' :   'getBlogs',
        'c0-id'         :   '0',
        'c0-param0'     :   '',
        'c0-param1'     :   '',
        'c0-param2'     :   '',
        'batchId'       :   '1',
    };
    postDict['c0-param0'] = "number:" + str(userId);
    postDict['c0-param1'] = "number:" + str(startBlogIdx);
    postDict['c0-param2'] = "number:" + str(onceGetNum);
    #http://api.blog.163.com/ni_chen/dwr/call/plaincall/BlogBeanNew.getBlogs.dwr
    getBlogsDwrMainUrl = gConst['blogApi163'] + '/' + gVal['blogUser'] + '/' + 'dwr/call/plaincall/BlogBeanNew.getBlogs.dwr';
    logging.debug("getBlogsDwrMainUrl=%s", getBlogsDwrMainUrl);
         
    #Referer    http://api.blog.163.com/crossdomain.html?t=20100205
    headerDict = {
        'Referer'       :   'http://api.blog.163.com/crossdomain.html?t=20100205',
        'Content-Type'  :   "text/plain",
    };
    blogsRespHtml = crifanLib.getUrlRespHtml(getBlogsDwrMainUrl, postDict=postDict, headerDict=headerDict, postDataDelimiter='\r\n');
 
    logging.debug("blogsRespHtml=%s", blogsRespHtml);
         
    return blogsRespHtml;

5.等待后续再去添加 网易的 心情随笔。


【后记 2013-09-22】

1.后来,实现了对应的抓取心情随笔:

【教程】以抓取网易博客帖子中的最近读者信息为例,手把手教你如何抓取动态网页中的内容

以及:

【记录】用Python解析网易163博客的心情随笔FeelingCard返回的DWR-REPLY数据

转载请注明:在路上 » 【记录】给BlogsToWordPress添加支持导出网易的心情随笔

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址

网友最新评论 (3)

  1. 你好,我看过了你写的网易博客的采集评论,最近在学习网易博客用火车头采集博客内容,但是遇到很多困难,问了很多人都无法解决啊。。希望大神可以解决一下。。酬劳可以商量。。谢谢
    黄良11年前 (2014-06-11)回复
    • 加我的技术群:104028266 然后再讨论细节。
      crifan11年前 (2014-06-12)回复
  2. 终于导过去了,谢谢啊。 还有一点小小的建议,就是留言板好像不能导。
    Neysa11年前 (2013-12-26)回复
87 queries in 0.196 seconds, using 22.25MB memory