第 6 章 Python中正则表达式的使用心得

摘要

此处整理一下，Python中使用正则表达式的心得：

6.1. re模块搜索时要注意竖线"|"的使用

某次，对于字符串

footerUni=u"分类： | 标签：";

使用：


foundCatZhcn = re.search(u"分类：(?P<catName>.+)|", footerUni);
print "foundCatZhcn=",foundCatZhcn;
if(foundCatZhcn):
    print "foundCatZhcn.group(0)=",foundCatZhcn.group(0);
    print "foundCatZhcn.group(1)=",foundCatZhcn.group(1);
    catName = foundCatZhcn.group("catName");
    print "catName=",catName;

所得到的结果却是：


foundCatZhcn= <_sre.SRE_Match object at 0x027E3C20>
foundCatZhcn.group(0)=
foundCatZhcn.group(1)= None
catName= None

其中group(0)，不是所期望的整个匹配的字符串，且group(1)应该是一个空格的字符，而不是None。

调试了半天，最后终于找到原因了，原来是在正则搜索中，竖线"|"，是or的关系

“

'|'

A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way. This can be used inside groups (see below) as well. As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy. To match a literal '|', use \|, or enclose it inside a character class, as in [|].

”

所以此处匹配到的结果是空值

所以测试过程中，无论如何修改re中的表达式，也都会得到foundCatZhcn是非空的值

然后对应的解决办法是，给竖线加上反斜杠，表示竖线字符本身：

foundCatZhcn = re.search(u"分类：(?P<catName>.*?)\|", footerUni);

这样才能真正自己想要的效果。


第 5 章 Python中的re.match		6.2. re模块的search的含义和用法及查找后group的含义