【背景】
已从一个url中获得返回的json字符串:
{"code":"A00006",data:"\t<li id=\"cmt_1932099\">\r\n\t\t<table class=\"SG_revert_Left\">\r\n\t\t\t<tr>\r\n\t\t\t\t<td><\/td>\r\n\t\t\t<\/tr>\r\n\t\t<\/table>\r\n\t\t<div class=\"SG_revert_Cont\">\r\n\t\t\t<p><span class=\"SG_revert_Tit\" id=\"nick_cmt_1932099\"><a href=\"http:\/\/blog.sina.com.cn\/u\/1612702675\" target=\"_blank\">\u5343\u5bfb<\/a><\/span><span class=\"SG_revert_Time\"><em class=\"SG_txtc\">2012-03-29 09:52:17<\/em> <a id=\"56c89b680102dynu_1932099\" onclick=\"comment_report(’56c89b680102dynu_1932099′)\" href=\"javascript:;\">[\u4e3e\u62a5]<\/a><\/span><\/p>\r\n\t\t\t<div class=\"SG_revert_Inner SG_txtb\" id=\"body_cmt_1932099\"><img src=\"http:\/\/www.sinaimg.cn\/uc\/myshow\/blog\/misc\/gif\/E___0085EN00SIGT.gif\" style=\"margin:1px;cursor:pointer;\" onclick=\"window.open(‘http:\/\/blog.sina.com.cn\/myshow2010’)\" border=\"0\" title=\"\u65e0\u8bed\" \/><img src=\"http:\/\/www.sinaimg.cn\/uc\/myshow\/blog\/misc\/gif\/E___0085EN00SIGT.gif\" style=\"margin:1px;cursor:pointer;\" onclick=\"window.open(‘http:\/\/blog.sina.com.cn\/myshow2010′)\" border=\"0\" title=\"\u65e0\u8bed\" \/><\/div>\r\n\t\t\t\r\n\t\t\t\r\n\t\t\t\t\t<\/div>\r\n\t<\/li>\t<li class=\"SG_j_linedot1\" id=\"cmt_1932970\">\r\n\t\t<table class=\"SG_revert_Left\">\r\n\t\t\t<tr>\r\n\t\t\t\t<td><\/td>\r\n\t\t\t<\/tr>\r\n\t\t<\/table>\r\n\t\t<div class=\"SG_revert_Cont\">\r\n\t\t\t<p><span class=\"SG_revert_Tit\" id=\"nick_cmt_1932970\">\u65b0\u6d6a\u7f51\u53cb<\/span><span class=\"SG_revert_Time\"><em class=\"SG_txtc\">2012-03-30 13:50:50<\/em> <a id=\"56c89b680102dynu_1932970\" onclick=\"comment_report(’56c89b680102dynu_1932970’)\" href=\"javascript:;\">[\u4e3e\u62a5]<\/a><\/span><\/p>\r\n\t\t\t<div class=\"SG_revert_Inner SG_txtb\" id=\"body_cmt_1932970\">\u51e4\u59d0\u603b\u8ba9\u6211\u60f3\u5230\u90a3\u4e2a\u6e2f\u5267\u91cc\u7684\u5468\u661f\u661f\u60f3\u51fa\u540d\u3001\u6709\u4f5c\u4e3a\u7684\u5c0f\u4eba\u7269\uff0c\u54c4\u7b11\u4e2d\u603b\u6709\u6cea\u6c34\u3002\u4e00\u4e2a\u4eba\u6c11\u6559\u5e08\u90fd\u8fd9\u6837\u4e86\u3002<img src=\"http:\/\/www.sinaimg.cn\/uc\/myshow\/blog\/misc\/gif\/E___0085EN00SIGT.gif\" style=\"margin:1px;cursor:pointer;\" onclick=\"window.open(‘http:\/\/blog.sina.com.cn\/myshow2010’)\" border=\"0\" title=\"\u65e0\u8bed\" \/><br><\/div>\r\n\t\t\t\r\n\t\t\t\r\n\t\t\t\t\t<\/div>\r\n\t<\/li><div><input id=\"v1x\" type=\"hidden\" value=\"d64ae94d73690823f92c64e8868d3660\"\/><input id=\"v2x\" type=\"hidden\" value=\"\"\/><\/div>"} |
可以很清楚看到,其中就一个code键和一个data键,其中Data键的值,是对应的带反斜杠格式的Html源码。
现在需要从data键值的html源码中提取出对应的id或class,比如SG_revert_Cont等,所以,希望可以通过BeautifulSoup来处理该html源码。
【解决过程】
1.先获得对应的反斜杠格式的html源码:
\t<li id=\"cmt_1932099\">\r\n\t\t<table class=\"SG_revert_Left\">\r\n\t\t\t<tr>\r\n\t\t\t\t<td><\/td>\r\n\t\t\t<\/tr>\r\n\t\t<\/table>\r\n\t\t<div class=\"SG_revert_Cont\">\r\n\t\t\t<p><span class=\"SG_revert_Tit\" id=\"nick_cmt_1932099\"><a href=\"http:\/\/blog.sina.com.cn\/u\/1612702675\" target=\"_blank\">\u5343\u5bfb<\/a><\/span><span class=\"SG_revert_Time\"><em class=\"SG_txtc\">2012-03-29 09:52:17<\/em> <a id=\"56c89b680102dynu_1932099\" onclick=\"comment_report(’56c89b680102dynu_1932099′)\" href=\"javascript:;\">[\u4e3e\u62a5]<\/a><\/span><\/p>\r\n\t\t\t<div class=\"SG_revert_Inner SG_txtb\" id=\"body_cmt_1932099\"><img src=\"http:\/\/www.sinaimg.cn\/uc\/myshow\/blog\/misc\/gif\/E___0085EN00SIGT.gif\" style=\"margin:1px;cursor:pointer;\" onclick=\"window.open(‘http:\/\/blog.sina.com.cn\/myshow2010’)\" border=\"0\" title=\"\u65e0\u8bed\" \/><img src=\"http:\/\/www.sinaimg.cn\/uc\/myshow\/blog\/misc\/gif\/E___0085EN00SIGT.gif\" style=\"margin:1px;cursor:pointer;\" onclick=\"window.open(‘http:\/\/blog.sina.com.cn\/myshow2010′)\" border=\"0\" title=\"\u65e0\u8bed\" \/><\/div>\r\n\t\t\t\r\n\t\t\t\r\n\t\t\t\t\t<\/div>\r\n\t<\/li>\t<li class=\"SG_j_linedot1\" id=\"cmt_1932970\">\r\n\t\t<table class=\"SG_revert_Left\">\r\n\t\t\t<tr>\r\n\t\t\t\t<td><\/td>\r\n\t\t\t<\/tr>\r\n\t\t<\/table>\r\n\t\t<div class=\"SG_revert_Cont\">\r\n\t\t\t<p><span class=\"SG_revert_Tit\" id=\"nick_cmt_1932970\">\u65b0\u6d6a\u7f51\u53cb<\/span><span class=\"SG_revert_Time\"><em class=\"SG_txtc\">2012-03-30 13:50:50<\/em> <a id=\"56c89b680102dynu_1932970\" onclick=\"comment_report(’56c89b680102dynu_1932970’)\" href=\"javascript:;\">[\u4e3e\u62a5]<\/a><\/span><\/p>\r\n\t\t\t<div class=\"SG_revert_Inner SG_txtb\" id=\"body_cmt_1932970\">\u51e4\u59d0\u603b\u8ba9\u6211\u60f3\u5230\u90a3\u4e2a\u6e2f\u5267\u91cc\u7684\u5468\u661f\u661f\u60f3\u51fa\u540d\u3001\u6709\u4f5c\u4e3a\u7684\u5c0f\u4eba\u7269\uff0c\u54c4\u7b11\u4e2d\u603b\u6709\u6cea\u6c34\u3002\u4e00\u4e2a\u4eba\u6c11\u6559\u5e08\u90fd\u8fd9\u6837\u4e86\u3002<img src=\"http:\/\/www.sinaimg.cn\/uc\/myshow\/blog\/misc\/gif\/E___0085EN00SIGT.gif\" style=\"margin:1px;cursor:pointer;\" onclick=\"window.open(‘http:\/\/blog.sina.com.cn\/myshow2010’)\" border=\"0\" title=\"\u65e0\u8bed\" \/><br><\/div>\r\n\t\t\t\r\n\t\t\t\r\n\t\t\t\t\t<\/div>\r\n\t<\/li><div><input id=\"v1x\" type=\"hidden\" value=\"d64ae94d73690823f92c64e8868d3660\"\/><input id=\"v2x\" type=\"hidden\" value=\"\"\/><\/div> |
然后用BeautifulSoup去解析,再通过
soup.findAll(attrs={"class":"SG_revert_Cont"});
无法得到需要的内容。
2.后来进过一番折腾,终于找到办法了,那就是,先给原先反斜杠的html字符串,处理为正常的字符串,
然后再添加个普通的html的头和尾,即:
dataStr = dataStr.replace("\\t", "\t"); dataStr = dataStr.replace("\\r\\n", "\r\n"); dataStr = dataStr.replace("\\/", "/"); dataStr = dataStr.replace('\\"', '"'); logging.debug("after html parse: \n%s", dataStr); fakeHead = """ <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>Fake Title</title> <body> """; fakeTail = """ </body> </head> </html> """; dataStr = fakeHead + dataStr + fakeTail; soup = BeautifulSoup(dataStr);
然后此时再使用:
soup.findAll(attrs={"class":"SG_revert_Cont"});
就可以得到我们所需要的commentList了。就可以接着像正常的soup类型的变量一样来处理,可以得到我所需要的信息了。
【总结】
如果得到了反斜杠类型的html源码,但只是部分内容,却想要方便的解析其中内容,
可以考虑,先将其(1)转换为普通的不带反斜杠的html源码,然后再(2)添加一个伪(fake)的html的head和tail信息,伪装为一个普通的html源码,然后再用BeautifulSoup去处理,就可以得到期望的soup,可以进行信息提取处理了。