【教程】详解Python正则表达式之： (?<=…) positive lookbehind assertion 后向匹配 /后向断言

Python 2.7手册中的官方解释是：

(?<=...)
Matches if the current position in the string is preceded by a match for ... that ends at the current position. This is called a positive lookbehind assertion. (?<=abc)def will find a match in abcdef, since the lookbehind will back up 3 characters and check if the contained pattern matches. The contained pattern must only match strings of some fixed length, meaning that abc or a|b are allowed, but a* and a{3,4} are not. Note that patterns which start with positive lookbehind assertions will never match at the beginning of the string being searched; you will most likely want to use the search() function rather than the match() function:
?
1
2
3
4
>>> import re
>>> m = re.search('(?<=abc)def', 'abcdef')
>>> m.group(0)
'def'
This example looks for a word following a hyphen:
?
1
2
3
>>> m = re.search('(?<=-)\w+', 'spam-egg')
>>> m.group(0)
'egg'

下面就来详细解释一下，此处的(?<=…)的含义：

1.一般的字符串匹配，都是匹配当前（位置开始往后）的字符串是什么。

而此处的(?<=…) 的功能是，判断（当前位置）之前的内容是什么。

从此处的语法，其实也容易理解：

因为其写法是 <=，即小于等于，分解为小于号和等于号：

小于号：表示从当前位置往前看；
等于号：表示判断前面的内容，是否为 …

所以，也才被称作，positive lookbehind assertion，此处我把其翻译为：后向匹配，后向断言。

2.为何要有positive lookbehind assertion？

作为查找字符串，匹配字符串的应用中，在一些复杂的情况时，不仅要判断当前内容是什么，还要判断当前位置之前的内容，是否满足一定条件，然后才好做出最终判断的。

如果了解html源码的话，则很容易理解，一个典型的例子是，img里面标签内，有src，图片的源地址，比如：

1	`<img style="text-align:center;margin:0px auto 10px;zoom:1;display:block"` `border="0"` `src="http://1821.img.pp.sohu.com.cn/images/blog/2012/4/12/16/19/u173669005_13766a7cbebg214.jpg">`

而想要正常查找一个html页面内的代码，通过

1	`"http://[\w\./]+\.jpg"`

去匹配后缀为.jpg类型的图片的话,虽然是可以匹配到的,但是存在一个问题。

那就是，本身html代码中，假如在src值内外部，某人只是也写了http开头的.jpg结尾的地址，但只是为了介绍一个普通的url地址，并非作为图片显示的。

比如：

fake html begin
 
some sohu blog pic url is something like this:
"http://1802.img.pp.sohu.com.cn/images/blog/2012/4/12/16/20/u173669005_13766a912eag214.jpg"
which use img.pp.sohu.com.cn as its image server.
 
<img style="text-align:center;margin:0px auto 10px;zoom:1;display:block" border="0" src="http://1821.img.pp.sohu.com.cn/images/blog/2012/4/12/16/19/u173669005_13766a7cbebg214.jpg">
 
fake html end

此时，如果你还是用上述的匹配规则去匹配，就会五匹配，把非图片类的jpg地址，也都匹配出来了。

而此时，为了避免误匹配，则就可以利用到 positive lookbehind assertion了。

详细的演示代码如下：

#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
【教程】详解Python正则表达式之： (?<=…) positive lookbehind assertion 后向匹配 /后向断言
<blockquote class="wp-embedded-content" data-secret="SwUOU0uZ8R"><a href="https://www.crifan.com/detailed_explanation_about_python_regular_express_positive_lookbehind_assertion/" data-original-title="" title="">【教程】详解Python正则表达式之： (?<=…) positive lookbehind assertion 后向匹配 /后向断言</a></blockquote><iframe class="wp-embedded-content" sandbox="allow-scripts" security="restricted" style="position: absolute; visibility: hidden;" title="《 【教程】详解Python正则表达式之： (?<=…) positive lookbehind assertion 后向匹配 /后向断言 》—在路上" src="https://www.crifan.com/detailed_explanation_about_python_regular_express_positive_lookbehind_assertion/embed/#?secret=G4GISrX9od#?secret=SwUOU0uZ8R" data-secret="SwUOU0uZ8R" width="500" height="282" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
 
Version:    2012-11-14
Author:     Crifan
"""
 
import re;
 
#提示：
#相关教程：
#【教程】详解Python正则表达式之： (?=…) lookahead assertion 前向匹配 /前向断言
#https://www.crifan.com/detailed_explanation_about_python_regular_express_lookahead_assertion
 
reLookbehindTestStr = """
fake html begin
 
some sohu blog pic url is something like this:
"http://1802.img.pp.sohu.com.cn/images/blog/2012/4/12/16/20/u173669005_13766a912eag214.jpg"
which use img.pp.sohu.com.cn as its image server.
 
<img style="text-align:center;margin:0px auto 10px;zoom:1;display:block" border="0" src="http://1821.img.pp.sohu.com.cn/images/blog/2012/4/12/16/19/u173669005_13766a7cbebg214.jpg">
 
fake html end
"""
 
# 1. (?<=...) - positive lookbehind assertion 后向匹配 /后向断言
 
# 下列的，通过普通的匹配操作，会误匹配出来前面的那个jpg图片地址
foundAllJpgUrl = re.findall(u'"(http://[\w\./]+\.jpg)"', reLookbehindTestStr);
print "foundAllJpgUrl=",foundAllJpgUrl; #foundAllJpgUrl= ['http://1802.img.pp.sohu.com.cn/images/blog/2012/4/12/16/20/u173669005_13766a912eag214.jpg', 'http://1821.img.pp.sohu.com.cn/images/blog/2012/4/12/16/19/u173669005_13766a7cbebg214.jpg']
 
# 而加上了 lookbehind assertion后，就可以精确只匹配 图片地址之前必须是 src= 的jpg图片
foundAllJpgUrl_lookbehind = re.findall(u'(?<=src=)"(http://[\w\./]+\.jpg)"', reLookbehindTestStr);
print "foundAllJpgUrl_lookbehind=",foundAllJpgUrl_lookbehind; #foundAllJpgUrl_lookbehind= ['http://1821.img.pp.sohu.com.cn/images/blog/2012/4/12/16/19/u173669005_13766a7cbebg214.jpg']

【总结】

对于这个，相对有点复杂的positive lookbehind assertion，正则匹配规则，一般来说，用到的还是很少的。

但是如果的确需要用到，才会发现，还是会很有用的，可以实现精确的匹配。避免那些误匹配的多余的内容。

转载请注明：在路上 » 【教程】详解Python正则表达式之： (?<=…) positive lookbehind assertion 后向匹配 /后向断言

Post Views: 2,483

与本文相关的文章

订阅在路上