最新消息:20210816 当前crifan.com域名已被污染,为防止失联,请关注(页面右下角的)公众号

【教程】详解Python正则表达式之: ‘|’ vertical bar 竖杠

Python re crifan 9420浏览 0评论

Python的手册中,是这么解释的:

'|'

A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way. This can be used inside groups (see below) as well. As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy. To match a literal '|', use \|, or enclose it inside a character class, as in [|].

关于竖杠,表示或者的关系,其简单的用法,暂且不多解释。

下面只针对相对复杂一些的用法中,竖杠’|’,即或者,是如何使用的。


1.使用竖杠,匹配多个可能的字符串中的其中一种

最经典的例子要数匹配文件名后缀,图片后缀了。

此处,以图片后缀为例。

【需求】

希望匹配各种图片后缀,比如jpg,jpeg,gif,png,bmp等等,其中的一种。

【代码】

经过折腾,下述代码,是可以正常匹配的:

#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
【教程】详解Python正则表达式之: '|' vertical bar 竖杠
https://www.crifan.com/detailed_explanation_about_python_regular_express_about_vertical_bar

Version:    2012-11-05
Author:     Crifan
"""

import re;

testStrList = [
    "http://www.example.com/picture_name.jpg",
    "http://www.example.com/picture_name.jpeg",
    "http://www.example.com/picture_name.gif",
    "http://www.example.com/picture_name.bmp",
    "http://www.example.com/picture_name.png",
    "http://www.example.com/picture_name.ige",
];

for eachTestStr in testStrList:
    #foundPictureSuffix = re.search("http://[^:]+?\.(?P<pictureSuffix>[(jpg)|(jpeg)|(gif)|(png)|(bmp)])", eachTestStr); # all will match e
    #foundPictureSuffix = re.search("http://[^:]+?\.(?P<pictureSuffix>[(jpg)|(jpeg)|(gif)|(png)|(bmp)]{3,4})", eachTestStr); # also match .ige
    foundPictureSuffix = re.search("http://[^:]+?\.(?P<pictureSuffix>(jpg)|(jpeg)|(gif)|(png)|(bmp))", eachTestStr); # work ok, not match ige
    
    if(foundPictureSuffix):
        print "eachTestStr=%s, pictureSuffix=%s"%(eachTestStr, foundPictureSuffix.group("pictureSuffix"));
    else:
        print "eachTestStr=%s, pictureSuffix=NOT MATCH"%(eachTestStr);

print "-----------------------------------------------------------------------";

for eachTestStr in testStrList:
    foundPictureSuffixNoGroupName = re.search("http://[^:]+?\.((jpg)|(jpeg)|(gif)|(png)|(bmp))", eachTestStr); # work ok, not match ige
    #foundPictureSuffixNoGroupName = re.search("http://[^:]+?\.(jpg)|(jpeg)|(gif)|(png)|(bmp)", eachTestStr); # only match .jpg
    #foundPictureSuffixNoGroupName = re.search("http://[^:]+?(\.jpg)|(\.jpeg)|(\.gif)|(\.png)|(\.bmp)", eachTestStr); #still only match .jpg
    #foundPictureSuffixNoGroupName = re.search("http://[^:]+?[(\.jpg)|(\.jpeg)|(\.gif)|(\.png)|(\.bmp)]", eachTestStr); # will error: IndexError: no such group
    #foundPictureSuffixNoGroupName = re.search("http://[^:]+?([(\.jpg)|(\.jpeg)|(\.gif)|(\.png)|(\.bmp)])", eachTestStr); # only match .
    
    if(foundPictureSuffixNoGroupName):
        print "eachTestStr=%s, pictureSuffix=%s"%(eachTestStr, foundPictureSuffixNoGroupName.group(1));
    else:
        print "eachTestStr=%s, pictureSuffix=NOT MATCH"%(eachTestStr);

其中,被注释掉的代码,是各种尝试,其中对应的输出,已经标注出来了,感兴趣的,自己去试试即可。

【总结】

总的来说,还是在最外层需要一个圆括号,表示一个group,然后group内部,再用多个竖杠,区分出多个可能的字符串,然后每个字符串,也被括号括起来,表示自己本身是完整的集合,需要严格匹配的。

即,类似于

((aaa)|(bbb)|(ccc))

之类的形式,就可以匹配多种字符串中的其中一种了。

另外对应的,如果想要给组添加命名,则就是这样的形式了:

(?P<groupName>(aaa)|(bbb)|(ccc))

转载请注明:在路上 » 【教程】详解Python正则表达式之: ‘|’ vertical bar 竖杠

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
90 queries in 0.176 seconds, using 22.15MB memory