【背景】
之前的
BlogsToWordpress
不支持网易的心情随笔。
现在去添加此功能。
【解决过程】
1.结果使用:
BlogsToWordpress.py -s http://blog.163.com/ni_chen |
竟然结果连第一个帖子地址都找不到了。
2.所以去用Firebug调试网易博客,发现,原先的获得对应的帖子信息的访问,从之前的GET变成现在的POST了。
所以,把旧的GET的代码,
getBlogUrl = genGetBlogsUrl(userId, startBlogIdx, onceGetNum); logging.info("getBlogUrl=%s", getBlogUrl); # get blogs blogsResp = crifanLib.getUrlRespHtml(getBlogUrl); #------------------------------------------------------------------------------ # generate get blogs URL def genGetBlogsUrl(userId, startBlogIdx, onceGetNum): getBlogsUrl = ''; try : # http://api.blog.163.com/againinput4/dwr/call/plaincall/BlogBeanNew.getBlogs.dwr # callCount=1 # scriptSessionId=${scriptSessionId}187 # c0-scriptName=BlogBeanNew # c0-methodName=getBlogs # c0-id=0 # c0-param0=number:172799491 # c0-param1=number:0 # c0-param2=number:20 # batchId=955290 paraDict = { 'callCount' : '1', 'scriptSessionId': '${scriptSessionId}187', 'c0-scriptName' : 'BlogBeanNew', 'c0-methodName' : 'getBlogs', 'c0-id' : '0', 'c0-param0' : '', 'c0-param1' : '', 'c0-param2' : '', 'batchId' : '1', }; paraDict['c0-param0'] = "number:" + str(userId); paraDict['c0-param1'] = "number:" + str(startBlogIdx); paraDict['c0-param2'] = "number:" + str(onceGetNum); mainUrl = gConst['blogApi163'] + '/' + gVal['blogUser'] + '/' + 'dwr/call/plaincall/BlogBeanNew.getBlogs.dwr'; getBlogsUrl = crifanLib.genFullUrl(mainUrl, paraDict); logging.debug("Generated get blogs url %s", getBlogsUrl); except : logging.debug("Can not generate get blog url."); return getBlogsUrl;
希望改为最新的,POST的代码。
3。调试期间发现:
其中的post data,还不是普通的以&分隔的,而是以换行分隔的:
callCount=1 scriptSessionId=${scriptSessionId}187 c0-scriptName=BlogBeanNew c0-methodName=getBlogs c0-id=0 c0-param0=number:186541395 c0-param1=number:0 c0-param2=number:1 batchId=123756
所以, 还是比较特殊的。
结果折腾期间,还出错了:
LINE 959 : INFO getBlogsDwrMainUrl=http://api.blog.163.com/ni_chen/dwr/call/plaincall/BlogBeanNew.getBlogs.dwr |
参考:
httpclient 调用DWR应用时发生The specified call count is not a number 错误
去把默认的:
req.add_header('Content-Type', "application/x-www-form-urlencoded");
改为:
req.add_header('Content-Type', "text/plain");
才最终获得对应的DWR返回的内容:
//#DWR-INSERT //#DWR-REPLY var s0={};var s1={};var s2={};var s3=[];s0.abstractSysGen=1;s0.accessCount=20;s0.allowComment=-100;s0.allowView=-100;s0.blogAbstract="<div><br></div><div><div style=\"line-height: 22px;\" ><font size=\"4\" style=\"line-height: 28px;\" ><br style=\"line-height: 28px;\" ></font></div><div style=\"line-height: 22px;\" ><font size=\"4\" style=\"line-height: 28px;\" ><br style=\"line-height: 28px;\" ></font></div></div><font size=\"4\" >\u535A\u5BA2\u5DF2\u7ECF\u5168\u90E8\u642C\u8D70\u4E86\uFF0C\u8BF7\u79FB\u9A7E\u81F3\uFF1A</font><a target=\"_blank\" rel=\"nofollow\" href=\"http://nichen.info/blogs\" >http://nichen.info</a><br><div><font size=\"4\" ><br></font></div><div><font size=\"4\" ><br></font></div>";s0.blogAttachments=null;s0.blogCount=s1;s0.blogExt=s2;s0.circleCount=0;s0.circleIdList=s3;s0.circleIds=null;s0.classId="fks_084064092094082071087087087095085094087070080087085074081";s0.className="\u968F\u7B14";s0.commentCount=0;s0.comments=null;s0.content="<div><br></div><div><div style=\"line-height: 22px;\" ><font size=\"4\" style=\"line-height: 28px;\" ><br style=\"line-height: 28px;\" ></font></div><div style=\"line-height: 22px;\" ><font size=\"4\" style=\"line-height: 28px;\" ><br style=\"line-height: 28px;\" ></font></div></div><font size=\"4\" >\u535A\u5BA2\u5DF2\u7ECF\u5168\u90E8\u642C\u8D70\u4E86\uFF0C\u8BF7\u79FB\u9A7E\u81F3\uFF1A</font><a target=\"_blank\" rel=\"nofollow\" href=\"http://nichen.info/blogs\" >http://nichen.info</a><br><div><font size=\"4\" ><br></font></div><div><font size=\"4\" ><br></font></div>";s0.contentPlainText=null;s0.id="fks_087065081087083065081085085071072087089069081082087064093083";s0.ip="147.46.115.126";s0.isBlogAbstractComplete=false;s0.isPublished=1;s0.keyName="ID";s0.keyWordCheckedState=0;s0.lastAccessCountUpdateTime=1365142180844;s0.matchedKeyWord=false;s0.modifyTime=1365142180643;s0.moveFrom="NONE";s0.permaSerial="18654139520132782258253";s0.permalink="blog/static/18654139520132782258253";s0.photoIds=null;s0.photoStoreTypes=null;s0.publishTime=1362615778253;s0.publishTimeStr="8:22:58";s0.publisherId=0;s0.publisherNickname=null;s0.publisherUsername=null;s0.rank=5;s0.recomBlogHome=false;s0.ref=false;s0.shortPublishDateStr="2013-3-7";s0.synchMiniBlog=-1;s0.tag="";s0.title="\u642C\u8D70\u5566";s0.trackbackCount=0;s0.trackbackUrl="blog/18654139520132782258253.track";s0.userId=186541395;s0.userName="ni_chen";s0.userNickname="Neysa";s0.valid=0;s0.zipContent=null; s1.accessCount=20;s1.blogId=1251225334;s1.commentCount=0;s1.mainCommentCount=0;s1.permaSerial="18654139520132782258253";s1.recommendCount=0;s1.trackbackCount=0;s1.userId=186541395; s2.blogId=1251225334;s2.doubanResourceInfo=null;s2.miniBlogCard=0;s2.userId=186541395;s2.voteId=0; dwr.engine._remoteHandleCallback(‘1′,’0’,[s0]); |
接下来,就可以正常的去解析此内容了。
4.最后,终于可以使用POST的代码,获得对应的内容了:
blogsDwrRespHtml = getBlogsDwrRespHtml(userId, startBlogIdx, onceGetNum); logging.debug("blogsDwrRespHtml=%s", blogsDwrRespHtml); def getBlogsDwrRespHtml(userId, startBlogIdx, onceGetNum): # getBlogUrl = genGetBlogsUrl(userId, startBlogIdx, onceGetNum); # logging.info("getBlogUrl=%s", getBlogUrl); # # get blogs # blogsRespHtml = crifanLib.getUrlRespHtml(getBlogUrl); #change GET to POST #http://api.blog.163.com/ni_chen/dwr/call/plaincall/BlogBeanNew.getBlogs.dwr #callCount=1 #scriptSessionId=${scriptSessionId}187 #c0-scriptName=BlogBeanNew #c0-methodName=getBlogs #c0-id=0 #c0-param0=number:186541395 #c0-param1=number:0 #c0-param2=number:1 #batchId=494302 # http://api.blog.163.com/againinput4/dwr/call/plaincall/BlogBeanNew.getBlogs.dwr # callCount=1 # scriptSessionId=${scriptSessionId}187 # c0-scriptName=BlogBeanNew # c0-methodName=getBlogs # c0-id=0 # c0-param0=number:172799491 # c0-param1=number:0 # c0-param2=number:20 # batchId=955290 postDict = { 'callCount' : '1', 'scriptSessionId': '${scriptSessionId}187', 'c0-scriptName' : 'BlogBeanNew', 'c0-methodName' : 'getBlogs', 'c0-id' : '0', 'c0-param0' : '', 'c0-param1' : '', 'c0-param2' : '', 'batchId' : '1', }; postDict['c0-param0'] = "number:" + str(userId); postDict['c0-param1'] = "number:" + str(startBlogIdx); postDict['c0-param2'] = "number:" + str(onceGetNum); #http://api.blog.163.com/ni_chen/dwr/call/plaincall/BlogBeanNew.getBlogs.dwr getBlogsDwrMainUrl = gConst['blogApi163'] + '/' + gVal['blogUser'] + '/' + 'dwr/call/plaincall/BlogBeanNew.getBlogs.dwr'; logging.debug("getBlogsDwrMainUrl=%s", getBlogsDwrMainUrl); #Referer http://api.blog.163.com/crossdomain.html?t=20100205 headerDict = { 'Referer' : 'http://api.blog.163.com/crossdomain.html?t=20100205', 'Content-Type' : "text/plain", }; blogsRespHtml = crifanLib.getUrlRespHtml(getBlogsDwrMainUrl, postDict=postDict, headerDict=headerDict, postDataDelimiter='\r\n'); logging.debug("blogsRespHtml=%s", blogsRespHtml); return blogsRespHtml;
5.等待后续再去添加 网易的 心情随笔。
【后记 2013-09-22】
1.后来,实现了对应的抓取心情随笔:
【教程】以抓取网易博客帖子中的最近读者信息为例,手把手教你如何抓取动态网页中的内容
以及: