【问题】
本人在用BeautifulSoup抓到这些数据后(当然还有其他部分)不知道如何能够用BeautifulSoup的解析方法(用re好像很复杂)去提取我想要的24,804,000,000.00与1,511,750,000.00这两个数据,望各位大神出手相助! |
【解答】
1.想要提取数据,就要看清楚对应的html的结构,所以,手动格式化为:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | < tr > < td width = '150px' >< strong >报表日期</ strong ></ td > < td style = 'text-align:right;' >2013-03-31</ td > </ tr > < tr > </ tr > < tr > < td colspan = '5' >< strong >流动资产</ strong ></ td > </ tr > < tr > < td style = 'padding-left:30px' width = '150px' > < a target = '_blank' href = '/corp/view/vFD_FinanceSummaryHistory.php?stockid=002024&type=cbsheet1' >货币资金</ a > </ td > < td style = 'text-align:right;' >24,804,000,000.00</ td > </ tr > < tr > < td style = 'padding-left:30px' width = '150px' >< a target = '_blank' href = '/corp/view/vFD_FinanceSummaryHistory.php?stockid=002024&type=cbsheet110' >交易性金融资产</ a ></ td > < td style = 'text-align:right;' >1,511,750,000.00</ td > </ tr > </ tbody > |
就容易看清楚结构了。
2.可以看到,如果你此处确定上述的html的代码结构不会变的前提下:
那么是可以去通过:
1 | findAll(name = "td" , attrs = { "style" : "text-align:right;" }) |
搜索到那三个td的:
<td style=’text-align:right;’>2013-03-31</td> <td style=’text-align:right;’>1,511,750,000.00</td> |
3.(我自己也是刚知道的)
再去通过text参数去匹配对应的soup.string
其中BeautifulSoup的findAll中,支持正则re,所以可以用:
1 | findAll(name = "td" , attrs = { "style" : "text-align:right;" }, text = re. compile ( "\d+(,\d+)*\.\d+" )) |
去只匹配,你所需要的,那两个货币值:
24,804,000,000.00
1,511,750,000.00
(注意:不是那两个,完整的td:
<td style=’text-align:right;’>24,804,000,000.00</td> |
)
4.完整代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | #!/usr/bin/python # -*- coding: utf-8 -*- """ Function: 【解答】关于BeautifulSoup抓取目标数据的问题 Author: Crifan Li Version: 2013-06-06 Contact: https://www.crifan.com/contact_me """ import re; from BeautifulSoup import BeautifulSoup; def beautifulsoup_capture_money(): """ 1. answer other's question 2. demo BeautifulSoup usage: findAll(text=xxx) """ html = """<tr> <td width='150px'><strong>报表日期</strong></td> <td style='text-align:right;'>2013-03-31</td> </tr> <tr> </tr> <tr> <td colspan='5'><strong>流动资产</strong></td> </tr> <tr> <td style='padding-left:30px' width='150px'> <a target='_blank' href='/corp/view/vFD_FinanceSummaryHistory.php?stockid=002024&type=cbsheet1'>货币资金</a> </td> <td style='text-align:right;'>24,804,000,000.00</td> </tr> <tr> <td style='padding-left:30px' width='150px'><a target='_blank' href='/corp/view/vFD_FinanceSummaryHistory.php?stockid=002024&type=cbsheet110'>交易性金融资产</a></td> <td style='text-align:right;'>1,511,750,000.00</td> </tr> </tbody>""" ; soup = BeautifulSoup(html); #\d+(,\d+)*\.\d+ #can match: #24,804,000,000.00 #1,511,750,000.00 #123,750,000.00 #123,000.456 #23400.456 #... foundTds = soup.findAll(name = "td" , attrs = { "style" : "text-align:right;" }, text = re. compile ( "\d+(,\d+)*\.\d+" )); # !!! here match only the match re.compile text, not whole td tag print "foundTds=" ,foundTds; #foundTds= [u'24,804,000,000.00', u'1,511,750,000.00'] if (foundTds): for eachMoney in foundTds: print "eachMoney=" ,eachMoney; # eachMoney= 24,804,000,000.00 # eachMoney= 1,511,750,000.00 if __name__ = = "__main__" : beautifulsoup_capture_money(); |
【总结】
BeautifulSoup的findAll中,还支持传递text,去匹配对应的soup节点的string的值;
需要注意的是,匹配出来的值,不是整个html的tag(此处不是完整的td的节点)
而是对应的,符合你的text的字符串值(此处是,符合re.compile("\d+(,\d+)*\.\d+")的字符串的那部分的值)
注:
关于BeautifulSoup的findAll的函数说明,不了解的可以参考我的教程:
【教程】Python中第三方的用于解析HTML的库:BeautifulSoup
中所提到的,BeautifulSoup官网的教程:
findAll(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)
转载请注明:在路上 » 【解答】关于BeautifulSoup抓取目标数据的问题