Python的手册中,是这么解释的:
'|' A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way. This can be used inside groups (see below) as well. As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy. To match a literal '|', use \|, or enclose it inside a character class, as in [|].
关于竖杠,表示或者的关系,其简单的用法,暂且不多解释。
下面只针对相对复杂一些的用法中,竖杠’|’,即或者,是如何使用的。
1.使用竖杠,匹配多个可能的字符串中的其中一种
最经典的例子要数匹配文件名后缀,图片后缀了。
此处,以图片后缀为例。
【需求】
希望匹配各种图片后缀,比如jpg,jpeg,gif,png,bmp等等,其中的一种。
【代码】
经过折腾,下述代码,是可以正常匹配的:
#!/usr/bin/python # -*- coding: utf-8 -*- """ 【教程】详解Python正则表达式之: '|' vertical bar 竖杠 https://www.crifan.com/detailed_explanation_about_python_regular_express_about_vertical_bar Version: 2012-11-05 Author: Crifan """ import re; testStrList = [ "http://www.example.com/picture_name.jpg", "http://www.example.com/picture_name.jpeg", "http://www.example.com/picture_name.gif", "http://www.example.com/picture_name.bmp", "http://www.example.com/picture_name.png", "http://www.example.com/picture_name.ige", ]; for eachTestStr in testStrList: #foundPictureSuffix = re.search("http://[^:]+?\.(?P<pictureSuffix>[(jpg)|(jpeg)|(gif)|(png)|(bmp)])", eachTestStr); # all will match e #foundPictureSuffix = re.search("http://[^:]+?\.(?P<pictureSuffix>[(jpg)|(jpeg)|(gif)|(png)|(bmp)]{3,4})", eachTestStr); # also match .ige foundPictureSuffix = re.search("http://[^:]+?\.(?P<pictureSuffix>(jpg)|(jpeg)|(gif)|(png)|(bmp))", eachTestStr); # work ok, not match ige if(foundPictureSuffix): print "eachTestStr=%s, pictureSuffix=%s"%(eachTestStr, foundPictureSuffix.group("pictureSuffix")); else: print "eachTestStr=%s, pictureSuffix=NOT MATCH"%(eachTestStr); print "-----------------------------------------------------------------------"; for eachTestStr in testStrList: foundPictureSuffixNoGroupName = re.search("http://[^:]+?\.((jpg)|(jpeg)|(gif)|(png)|(bmp))", eachTestStr); # work ok, not match ige #foundPictureSuffixNoGroupName = re.search("http://[^:]+?\.(jpg)|(jpeg)|(gif)|(png)|(bmp)", eachTestStr); # only match .jpg #foundPictureSuffixNoGroupName = re.search("http://[^:]+?(\.jpg)|(\.jpeg)|(\.gif)|(\.png)|(\.bmp)", eachTestStr); #still only match .jpg #foundPictureSuffixNoGroupName = re.search("http://[^:]+?[(\.jpg)|(\.jpeg)|(\.gif)|(\.png)|(\.bmp)]", eachTestStr); # will error: IndexError: no such group #foundPictureSuffixNoGroupName = re.search("http://[^:]+?([(\.jpg)|(\.jpeg)|(\.gif)|(\.png)|(\.bmp)])", eachTestStr); # only match . if(foundPictureSuffixNoGroupName): print "eachTestStr=%s, pictureSuffix=%s"%(eachTestStr, foundPictureSuffixNoGroupName.group(1)); else: print "eachTestStr=%s, pictureSuffix=NOT MATCH"%(eachTestStr);
其中,被注释掉的代码,是各种尝试,其中对应的输出,已经标注出来了,感兴趣的,自己去试试即可。
【总结】
总的来说,还是在最外层需要一个圆括号,表示一个group,然后group内部,再用多个竖杠,区分出多个可能的字符串,然后每个字符串,也被括号括起来,表示自己本身是完整的集合,需要严格匹配的。
即,类似于
((aaa)|(bbb)|(ccc))
之类的形式,就可以匹配多种字符串中的其中一种了。
另外对应的,如果想要给组添加命名,则就是这样的形式了:
(?P<groupName>(aaa)|(bbb)|(ccc))