【已解决】Python中，（1）re.compile后再sub可以工作，但re.sub不工作，或者是（2）re.search后replace工作，但直接re.sub以及re.compile后再re.sub都不工作

【问题】

Python中，对于一个字符串变量dataJsonStr，值为：

{"data":{
"blogid":1252395085,
"voteids":0,
"pubtime":1252395085,
"replynum":40,
"category":"xxxxx",
"tag":"xxxx",
"title":"xxxxx",
"effect":136315393,
"effect2":6,
"exblogtype":0,
"sus_flag":false,
"friendrelation":[],
"lp_type":0,
"lp_id":0,
"lp_style":0,
"lp_flag":0,
"orguin":622000169,
"orgblogid":1252395085,
"ip":3415476546,
"mention_uins":[ ],
"attach":[],
"replylist":[{xxxx
},
{
xxx
},
{
xxx
}]
}}

发现先去re.compile后再sub，是可以替换对应的字符串的，但是直接re.sub加上对应pattern，却无法实现字符串的替换功能。

然后经过折腾，最后得到如下结果的代码：

samePattern = r'"replylist":\[.+\]\s*\}\}$'; 
replacedString = '"replylist":[ ]}}'; 
   
# 0. -> can found: foundReplylist= <_sre.SRE_Match object at 0x02E384F0> 
#foundReplylist = re.search(samePattern, dataJsonStr, re.S); 
#print "foundReplylist=",foundReplylist; 
   
# 1. -> not work 
#dataJsonStr = re.sub(samePattern, replacedString, dataJsonStr, re.S); 
   
# 2. -> not work     
#dataJsonStr = re.compile(samePattern).sub(replacedString, dataJsonStr, re.S); 
   
# 3. -> work 
subP = re.compile(samePattern, re.S); 
dataJsonStr = subP.sub(replacedString, dataJsonStr);

【解决过程】

1.网上找了下，只找到老外的讨论，关于re.compile后，对性能的影响多少的讨论，貌似没找到我此处遇到的问题，即re.compile后然后再去sub，结果是可以的，但是直接re.sub加上对应的pattern，就不可以。

2.看到python手册中对re.compile的解释：

re.compile(pattern[, flags])
Compile a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods, described below.
The expression’s behaviour can be modified by specifying a flags value. Values can be any of the following variables, combined using bitwise OR (the | operator).
The sequence
prog = re.compile(pattern)
result = prog.match(string)
is equivalent to
result = re.match(pattern, string)
but using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.
Note
The compiled versions of the most recent patterns passed to re.match(), re.search() or re.compile() are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.

（1）貌似re.compile，只对search和match函数有效，如果是这个意思的话，那么就是对sub无效。

即re.compile后得到subPattern，再subPattern.sub，应该是无效的，但是和我此处的实际情况正好相反。

因为此处是re.compile后再sub，是可以工作的，但是直接用re.sub却不能工作。

（2）看起来，至少对于match和search，两者是等价的。但是对于sub，还是不知道具体影响如何。

3.对于上述最开始输入的字符串dataJsonStr，需要说明一点的是，其编码是GB18030的，而且对其chardet.detect的结果是：

{'confidence': 0.3094988723644705, 'encoding': 'ISO-8859-2'}

即其中还包含部分的ISO-8859-2编码的字符串，属于比较变态的，混合编码的字符串。

所以，不知道，此处是不是由于此字符串编码的复杂性，导致了re.compile再sub工作，但re.sub不工作的。

【后记 2012-04-17】

下面说说刚刚遇到的第二种情况：

（2）re.search后replace工作，但直接re.sub以及re.compile后再re.sub都不工作

【背景】

分析网页：

http://tyjzlcl.blog.sohu.com/197745682.html

已经通过Beautifulsoup得到soup，然后通过

foundContent = soup.find(id="main-content");

然后再：

divs = foundContent.findAll("div");

foundContent = foundContent.contents[1];

而得到对应的帖子的content了，但是将content中的如下部分的内容：

<div style="FONT-WEIGHT: bold">我的相关日志：</div>
<p>………..</p></div>

去除掉，以获取帖子的真正的内容。

问题转化为，用re.sub去除掉上述类型的字符串。

然后就写了对应代码：

1 2	`myBlogP` `=` `ur'<div style="FONT-WEIGHT: bold">我的相关日志：</div>.+?(?=</div>)';` `contentUni` `=` `re.sub(myBlogP, "", contentUni, re.I \| re.S);` `# not work here !!!`

但是发现却不工作，找不到对应的上述内容，也就无法实现替换的功能了。

【解决过程】

1.想到了之前好像也是遇到类似问题，所以就找回此贴来参考，发现用上面的办法，即先re.compile，再re.sub，也是不能工作，代码如下：

myBlogP = ur'<div style="FONT-WEIGHT: bold">我的相关日志：</div>'; 
subP = re.compile(myBlogP); 
print "subP=",subP; 
contentUni = subP.sub("", contentUni, re.I | re.S); # NOT work

2.后来折腾了半天，发现，如果只是去掉部分的内容，中间不带匹配符号的，比如：

1 2	`myBlogP` `=` `ur'<div style="FONT-WEIGHT: bold">我的相关日志：</div>';` `contentUni` `=` `re.sub(myBlogP, "", contentUni, re.I \| re.S);` `# can work`

则是可以工作的。

3.然后又尝试了其他一些代码，发现，先去re.search，再replace，是可以正常工作的：

myBlogP = ur'<div style="FONT-WEIGHT: bold">我的相关日志：</div>'; 
foundMyBlog = re.search(myBlogP, contentUni, re.I | re.S); 
print "foundMyBlog=",foundMyBlog; 
myBlogStr = foundMyBlog.group(0); 
contentUni = contentUni.replace(myBlogStr, "");

4.最后经过折腾，找到了Python手册关于re.sub的说明：

re.sub(pattern, repl, string[, count, flags])
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a linefeed, and so forth. Unknown escapes such as \j are left alone. Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern.
。。。。。。。。。。。。。。。。
The optional argument count is the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. If omitted or zero, all occurrences will be replaced. Empty matches for the pattern are replaced only when not adjacent to a previous match, so sub('x*', '-', 'abc') returns '-a-b-c-'.
In addition to character escapes and backreferences as described above, \g<name> will use the substring matched by the group named name, as defined by the (?P<name>...) syntax. \g<number> uses the corresponding group number; \g<2> is therefore equivalent to \2, but isn’t ambiguous in a replacement such as \g<2>0. \20 would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character '0'. The backreference \g<0> substitutes in the entire substring matched by the RE.
Changed in version 2.7: Added the optional flags argument.

然后看到其中提到了关于count的参数的解释：

The optional argument count is the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. If omitted or zero, all occurrences will be replaced. Empty matches for the pattern are replaced only when not adjacent to a previous match, so sub('x*', '-', 'abc') returns '-a-b-c-'.

然后理解为，count是可以省略掉的，所以此处对于re.sub的用法也是正确的，这样，其默认会将所有出现的匹配的字符串都替换掉的。

但是却又看到上面的参数是这样写的：

re.sub(pattern, repl, string[, count, flags])

而不是：

re.sub(pattern, repl, string[[, count], flags])

意味着，要么count和flags都有，要么都没有。而不是count单独可以忽略的。

不过，后来才明白，原来那句：“If omitted or zero, all occurrences will be replaced.”是针对传递参数的时候，用的是flags=xxx，这样就可以省略掉了count参数了，即这样使用：

1	`contentUni` `=` `re.sub(myBlogP, "", contentUni, flags=(re.I \| re.S));` `# can work`

最后经过折腾，得到如下结果：

contentUni = re.sub(myBlogP, "", contentUni, re.I | re.S); # not work here !!! 
contentUni = re.sub(myBlogP, "", contentUni, 2, re.I | re.S); # can work 
contentUni = re.sub(myBlogP, "", contentUni, 1, re.I | re.S); # can work 
contentUni = re.sub(myBlogP, "", contentUni, flags=(re.I | re.S)); # can work

此时，也才想起来：

A。上面的，之前所遇到的问题，估计也是对应的没有指定合适的count参数而导致re.sub不能正常工作的。然后就去确认了一下，发现原先的代码：

1 2	`subP` `=` `re.compile(replylistP, re.S);` `dataJsonStr` `=` `subP.sub(replacedReplylist, dataJsonStr);`

其实换为：

dataJsonStr = re.sub(replylistP, replacedReplylist, dataJsonStr, re.S); # NOT work 
dataJsonStr = re.sub(replylistP, replacedReplylist, dataJsonStr, 1, re.S); # work 
dataJsonStr = re.sub(replylistP, replacedReplylist, dataJsonStr, flags=re.S); # work

也是都可以工作的。

B。上面的re.compile后，再re.sub：

myBlogP = ur'<div style="FONT-WEIGHT: bold">我的相关日志：</div>'; 
subP = re.compile(myBlogP); 
print "subP=",subP; 
contentUni = subP.sub("", contentUni, re.I | re.S); # NOT work

之所以不能工作，是因为re.compile的时候，没有把flags参数re.I|re.S放进去，而传递给了re.sub，导致count参数还是没有指定，导致re.sub不能工作，如果换为这样，也就是可以的了：

myBlogP = ur'<div style="FONT-WEIGHT: bold">我的相关日志：</div>'; 
subP = re.compile(myBlogP, re.I | re.S); 
print "subP=",subP; 
contentUni = subP.sub("", contentUni); # work

【总结】

上面几种情况，其实都是一个根本原因，那就是，调用re.sub的时候，如果有传递flags参数，比如此处的re.I和re.S，的时候，记得要：

（1）以正确的方式，忽略count参数的值

1	`replacedStr` `=` `re.sub(replacePattern, orignialStr, replacedPartStr, flags=re.I);` `# can omit count parameter`

（2）指定合适的count参数的值

1	`replacedStr` `=` `re.sub(replacePattern, orignialStr, replacedPartStr,` `1, re.I);` `# must designate count parameter`

这样，才不会出现，由于只给定了flags的值，但是没有给count的值，而导致re.sub不工作。

【后记 2012-12-15】

后来，又再一次，遇到了类似的问题：

通过：

    toAddHead = u"""
jlka人生的道路虽然漫长")
126但紧要处常常只有几步")
_（走错一步或走对一步）")
足以影响人生的一个时期甚至是一生")
""";
    addedHead = re.sub(u"^(?P<wholeLine>.+?)$", u'say("\g<wholeLine>', toAddHead, re.M);
    print "addedHead=",addedHead.encode("GBK", "ignore");

无法实现，将每一行都添加上对应的say("

最后折腾了半天，才想起来，自己之前写的，当前此贴，所以再改为：

    toAddHead = u"""
jlka人生的道路虽然漫长")
126但紧要处常常只有几步")
_（走错一步或走对一步）")
足以影响人生的一个时期甚至是一生")
""";
    addedHead = re.sub(u"^(?P<wholeLine>.+?)$", u'say("\g<wholeLine>', toAddHead, flags=re.M);
    print "addedHead=",addedHead.encode("GBK", "ignore");

就可以了，实现每行都添加上对应的头了：

addedHead=
say("jlka人生的道路虽然漫长")
say("126但紧要处常常只有几步")
say("_（走错一步或走对一步）")
say("足以影响人生的一个时期甚至是一生")

所以，还是需要注意一下这点才可以。

转载请注明：在路上 » 【已解决】Python中，（1）re.compile后再sub可以工作，但re.sub不工作，或者是（2）re.search后replace工作，但直接re.sub以及re.compile后再re.sub都不工作

Post Views: 2,894

与本文相关的文章

订阅在路上