解决:
https://github.com/crifan/BlogsToWordpress/issues/1
期间,想要用Python的BeautifulSoup去提取:
<div class="ui-1582983425 noselect js-zfrg-6541" style="display: block;"> <span class="pgi pgb iblock fc03 bgc9 bdc0 js-znpg-097">上一页</span> <span class="pgi zpg1 iblock fc03 bgc9 bdc0 js-zslt-987 fc05">1</span> <span class="frg fgp fc06">…</span> <span class="pgi zpg2 iblock fc03 bgc9 bdc0">2</span> <span class="pgi zpg3 iblock fc03 bgc9 bdc0">3</span> <span class="pgi zpg4 iblock fc03 bgc9 bdc0">4</span> <span class="pgi zpg5 iblock fc03 bgc9 bdc0">5</span> <span class="pgi zpg6 iblock fc03 bgc9 bdc0">6</span> <span class="pgi zpg7 iblock fc03 bgc9 bdc0">7</span> <span class="pgi zpg8 iblock fc03 bgc9 bdc0">8</span> <span class="frg fgn fc06">…</span> <span class="pgi zpg9 iblock fc03 bgc9 bdc0">58</span> <span class="pgi pgb iblock fc03 bgc9 bdc0">下一页</span> </div> |
中的:
<span class="pgi zpg9 iblock fc03 bgc9 bdc0">58</span>
所以想要去查找:
class是pgi zpg开头的
(如果更精准的话,最好是:
class是pgi zpgN iblock fc03 bgc9 bdc0
其中N是数字,位数不限
)
得到数组后,取最后一个
beautifulsoup find class contains
python – Beautiful Soup if Class "Contains" or Regex? – Stack Overflow
soup.select好像不够好用?
python 2.7 – Beautiful Soup – Class contains ‘a’ and not contains ‘b’ – Stack Overflow
beautifulsoup
此处用的是3.0.6的bs
”soup.findAll(attrs={‘id’ : re.compile("para$")})“
Beautiful Soup Documentation — Beautiful Soup 4.4.0 documentation
Beautiful Soup 4.4.0 文档 — beautifulsoup 4.4.0 文档
去试试re正则去匹配
【总结】
最后用:
soup = htmlToSoup(respHtml) pageClassPattern = re.compile("pgi zpg\d+ iblock fc03 bgc9 bdc0") logging.debug("pageClassPattern=%s", pageClassPattern) allPageNodeList = soup.findAll(attrs={"class" : pageClassPattern}) logging.debug("allPageNodeList=%s", allPageNodeList) if allPageNodeList : lastPageNumNode = allPageNodeList[-1] logging.debug("lastPageNumNode=%s", lastPageNumNode) lastPageNumStr = lastPageNumNode.string.strip() logging.debug("lastPageNumStr=%s", lastPageNumStr) lastPageNum = int(lastPageNumStr) logging.debug("lastPageNum=%s", lastPageNum) |
即可获得并提取出所要的数字。