【背景】
之前的
BlogsToWordpress
不支持网易的心情随笔。
现在去添加此功能。
【解决过程】
1.结果使用:
BlogsToWordpress.py -s http://blog.163.com/ni_chen |
竟然结果连第一个帖子地址都找不到了。
2.所以去用Firebug调试网易博客,发现,原先的获得对应的帖子信息的访问,从之前的GET变成现在的POST了。
所以,把旧的GET的代码,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | getBlogUrl = genGetBlogsUrl(userId, startBlogIdx, onceGetNum); logging.info( "getBlogUrl=%s" , getBlogUrl); # get blogs blogsResp = crifanLib.getUrlRespHtml(getBlogUrl); #------------------------------------------------------------------------------ # generate get blogs URL def genGetBlogsUrl(userId, startBlogIdx, onceGetNum): getBlogsUrl = ''; try : # callCount=1 # scriptSessionId=${scriptSessionId}187 # c0-scriptName=BlogBeanNew # c0-methodName=getBlogs # c0-id=0 # c0-param0=number:172799491 # c0-param1=number:0 # c0-param2=number:20 # batchId=955290 paraDict = { 'callCount' : '1' , 'scriptSessionId' : '${scriptSessionId}187' , 'c0-scriptName' : 'BlogBeanNew' , 'c0-methodName' : 'getBlogs' , 'c0-id' : '0' , 'c0-param0' : '', 'c0-param1' : '', 'c0-param2' : '', 'batchId' : '1' , }; paraDict[ 'c0-param0' ] = "number:" + str (userId); paraDict[ 'c0-param1' ] = "number:" + str (startBlogIdx); paraDict[ 'c0-param2' ] = "number:" + str (onceGetNum); mainUrl = gConst[ 'blogApi163' ] + '/' + gVal[ 'blogUser' ] + '/' + 'dwr/call/plaincall/BlogBeanNew.getBlogs.dwr' ; getBlogsUrl = crifanLib.genFullUrl(mainUrl, paraDict); logging.debug( "Generated get blogs url %s" , getBlogsUrl); except : logging.debug( "Can not generate get blog url." ); return getBlogsUrl; |
希望改为最新的,POST的代码。
3。调试期间发现:
其中的post data,还不是普通的以&分隔的,而是以换行分隔的:
1 2 3 4 5 6 7 8 9 | callCount=1 scriptSessionId=${scriptSessionId}187 c0-scriptName=BlogBeanNew c0-methodName=getBlogs c0-id=0 c0-param0=number:186541395 c0-param1=number:0 c0-param2=number:1 batchId=123756 |
所以, 还是比较特殊的。
结果折腾期间,还出错了:
LINE 959 : INFO getBlogsDwrMainUrl=http://api.blog.163.com/ni_chen/dwr/call/plaincall/BlogBeanNew.getBlogs.dwr |
参考:
httpclient 调用DWR应用时发生The specified call count is not a number 错误
去把默认的:
1 | req.add_header( 'Content-Type' , "application/x-www-form-urlencoded" ); |
改为:
1 | req.add_header( 'Content-Type' , "text/plain" ); |
才最终获得对应的DWR返回的内容:
//#DWR-INSERT //#DWR-REPLY var s0={};var s1={};var s2={};var s3=[];s0.abstractSysGen=1;s0.accessCount=20;s0.allowComment=-100;s0.allowView=-100;s0.blogAbstract="<div><br></div><div><div style=\"line-height: 22px;\" ><font size=\"4\" style=\"line-height: 28px;\" ><br style=\"line-height: 28px;\" ></font></div><div style=\"line-height: 22px;\" ><font size=\"4\" style=\"line-height: 28px;\" ><br style=\"line-height: 28px;\" ></font></div></div><font size=\"4\" >\u535A\u5BA2\u5DF2\u7ECF\u5168\u90E8\u642C\u8D70\u4E86\uFF0C\u8BF7\u79FB\u9A7E\u81F3\uFF1A</font><a target=\"_blank\" rel=\"nofollow\" href=\"http://nichen.info/blogs\" >http://nichen.info</a><br><div><font size=\"4\" ><br></font></div><div><font size=\"4\" ><br></font></div>";s0.blogAttachments=null;s0.blogCount=s1;s0.blogExt=s2;s0.circleCount=0;s0.circleIdList=s3;s0.circleIds=null;s0.classId="fks_084064092094082071087087087095085094087070080087085074081";s0.className="\u968F\u7B14";s0.commentCount=0;s0.comments=null;s0.content="<div><br></div><div><div style=\"line-height: 22px;\" ><font size=\"4\" style=\"line-height: 28px;\" ><br style=\"line-height: 28px;\" ></font></div><div style=\"line-height: 22px;\" ><font size=\"4\" style=\"line-height: 28px;\" ><br style=\"line-height: 28px;\" ></font></div></div><font size=\"4\" >\u535A\u5BA2\u5DF2\u7ECF\u5168\u90E8\u642C\u8D70\u4E86\uFF0C\u8BF7\u79FB\u9A7E\u81F3\uFF1A</font><a target=\"_blank\" rel=\"nofollow\" href=\"http://nichen.info/blogs\" >http://nichen.info</a><br><div><font size=\"4\" ><br></font></div><div><font size=\"4\" ><br></font></div>";s0.contentPlainText=null;s0.id="fks_087065081087083065081085085071072087089069081082087064093083";s0.ip="147.46.115.126";s0.isBlogAbstractComplete=false;s0.isPublished=1;s0.keyName="ID";s0.keyWordCheckedState=0;s0.lastAccessCountUpdateTime=1365142180844;s0.matchedKeyWord=false;s0.modifyTime=1365142180643;s0.moveFrom="NONE";s0.permaSerial="18654139520132782258253";s0.permalink="blog/static/18654139520132782258253";s0.photoIds=null;s0.photoStoreTypes=null;s0.publishTime=1362615778253;s0.publishTimeStr="8:22:58";s0.publisherId=0;s0.publisherNickname=null;s0.publisherUsername=null;s0.rank=5;s0.recomBlogHome=false;s0.ref=false;s0.shortPublishDateStr="2013-3-7";s0.synchMiniBlog=-1;s0.tag="";s0.title="\u642C\u8D70\u5566";s0.trackbackCount=0;s0.trackbackUrl="blog/18654139520132782258253.track";s0.userId=186541395;s0.userName="ni_chen";s0.userNickname="Neysa";s0.valid=0;s0.zipContent=null; s1.accessCount=20;s1.blogId=1251225334;s1.commentCount=0;s1.mainCommentCount=0;s1.permaSerial="18654139520132782258253";s1.recommendCount=0;s1.trackbackCount=0;s1.userId=186541395; s2.blogId=1251225334;s2.doubanResourceInfo=null;s2.miniBlogCard=0;s2.userId=186541395;s2.voteId=0; dwr.engine._remoteHandleCallback(‘1′,’0’,[s0]); |
接下来,就可以正常的去解析此内容了。
4.最后,终于可以使用POST的代码,获得对应的内容了:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | blogsDwrRespHtml = getBlogsDwrRespHtml(userId, startBlogIdx, onceGetNum); logging.debug( "blogsDwrRespHtml=%s" , blogsDwrRespHtml); def getBlogsDwrRespHtml(userId, startBlogIdx, onceGetNum): # getBlogUrl = genGetBlogsUrl(userId, startBlogIdx, onceGetNum); # logging.info("getBlogUrl=%s", getBlogUrl); # # get blogs # blogsRespHtml = crifanLib.getUrlRespHtml(getBlogUrl); #change GET to POST #callCount=1 #scriptSessionId=${scriptSessionId}187 #c0-scriptName=BlogBeanNew #c0-methodName=getBlogs #c0-id=0 #c0-param0=number:186541395 #c0-param1=number:0 #c0-param2=number:1 #batchId=494302 # callCount=1 # scriptSessionId=${scriptSessionId}187 # c0-scriptName=BlogBeanNew # c0-methodName=getBlogs # c0-id=0 # c0-param0=number:172799491 # c0-param1=number:0 # c0-param2=number:20 # batchId=955290 postDict = { 'callCount' : '1' , 'scriptSessionId' : '${scriptSessionId}187' , 'c0-scriptName' : 'BlogBeanNew' , 'c0-methodName' : 'getBlogs' , 'c0-id' : '0' , 'c0-param0' : '', 'c0-param1' : '', 'c0-param2' : '', 'batchId' : '1' , }; postDict[ 'c0-param0' ] = "number:" + str (userId); postDict[ 'c0-param1' ] = "number:" + str (startBlogIdx); postDict[ 'c0-param2' ] = "number:" + str (onceGetNum); getBlogsDwrMainUrl = gConst[ 'blogApi163' ] + '/' + gVal[ 'blogUser' ] + '/' + 'dwr/call/plaincall/BlogBeanNew.getBlogs.dwr' ; logging.debug( "getBlogsDwrMainUrl=%s" , getBlogsDwrMainUrl); headerDict = { 'Content-Type' : "text/plain" , }; blogsRespHtml = crifanLib.getUrlRespHtml(getBlogsDwrMainUrl, postDict = postDict, headerDict = headerDict, postDataDelimiter = '\r\n' ); logging.debug( "blogsRespHtml=%s" , blogsRespHtml); return blogsRespHtml; |
5.等待后续再去添加 网易的 心情随笔。
【后记 2013-09-22】
1.后来,实现了对应的抓取心情随笔:
【教程】以抓取网易博客帖子中的最近读者信息为例,手把手教你如何抓取动态网页中的内容
以及: