最新消息:20210816 当前crifan.com域名已被污染,为防止失联,请关注(页面右下角的)公众号

【研究】java和python中的正则的贪婪匹配

RegularExpression crifan 3068浏览 0评论

【背景】

之前在:

【记录】Android中用java的正则查找并替换宏定义中的参数

在java中,用如下代码:

/**
 * @author Crifan Li
 * 
 * @function test java regex look ahead
 * 【研究】java和python中的正则的贪婪匹配
 * https://www.crifan.com/research_java_python_regex_greedy_match
 * @version 2013-07-28
 *
 */

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexLookAheadTest {

	/**
	 * @param args
	 */
	public static void main(String[] args) {
		// TODO Auto-generated method stub

		//String testString = "#define defineMacro(a,b)  a+defineValue(a,b, \"string valur\")";
		String testString = "a+defineValue(a,b, \"string value\")";
		processDefineContent(testString);
	}
	public static String processDefineContent(String defineContent)
	{
	       //use regex to process it
	       String processedDefineContent = defineContent; //_get_dev_var_value((d),(e),METHODID(f), "inside_id_should_not_match")

	       Pattern idP = Pattern.compile("((?<id>[_a-zA-Z]\\w*)(?!\\())|(?<str>\"[^\"]+?\")"); //auto omit "someFunc(xxx" type id
	        
	       Matcher foundId = idP.matcher(processedDefineContent);
	        
	       // Find all matches
	       while (foundId.find()) {
	         // Get the matching string
	         int matchedIdStartPos = foundId.start(0);
	         int matchedIdEndPos = foundId.end(0);
	 
	         String strMatchedId = foundId.group("id");

	         if(null != strMatchedId)
	         {
	             System.out.println("Is ID: [" + matchedIdStartPos + "-" + matchedIdEndPos + "]=" + strMatchedId);
	         }
	         else
	         {
	             int matchedStrStartPos = foundId.start(0);
	             int matchedStrEndPos = foundId.end(0);
	             String strMatchedStr = foundId.group("str");
	             System.out.println("Is String: [" + matchedIdStartPos + "-" + matchedIdEndPos + "]=" + strMatchedStr);
	         }
	          
	         /*
				Is ID: [0-1]=a
				Is ID: [2-12]=defineValu
				Is ID: [14-15]=a
				Is ID: [16-17]=b
				Is String: [19-33]="string value"
	          */
	       }
	        
	       return processedDefineContent;
	}
}

结果输出的结果是:

Is ID: [0-1]=a

Is ID: [2-12]=defineValu

Is ID: [14-15]=a

Is ID: [16-17]=b

Is String: [19-33]="string value"

其中,很明显:

Is ID: [2-12]=defineValu

出错了。

因为是希望,跳过defineValue的,结果却是:

匹配到了

defineValu

而跳过了

e(

所以,现在就去研究一下,如何解决此类问题。

【折腾过程】

1。后来,是想到了\b的boundary,然后,java改为:

/**
 * @author Crifan Li
 * 
 * @function test java regex look ahead
 * 【研究】java和python中的正则的贪婪匹配
 * https://www.crifan.com/research_java_python_regex_greedy_match
 * @version 2013-07-28
 *
 */

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexLookAheadTest {

	/**
	 * @param args
	 */
	public static void main(String[] args) {
		// TODO Auto-generated method stub

		//String testString = "#define defineMacro(a,b)  a+defineValue(a,b, \"string valur\")";
		String testString = "a+defineValue(a,b, \"string value\")";
		processDefineContent(testString);
	}
	public static String processDefineContent(String defineContent)
	{
	       //use regex to process it
	       String processedDefineContent = defineContent; //_get_dev_var_value((d),(e),METHODID(f), "inside_id_should_not_match")

	       Pattern idP = Pattern.compile("((?<id>[_a-zA-Z]\\w*)\\b(?!\\())|(?<str>\"[^\"]+?\")"); //auto omit "someFunc(xxx" type id
	        
	       Matcher foundId = idP.matcher(processedDefineContent);
	        
	       // Find all matches
	       while (foundId.find()) {
	         // Get the matching string
	         int matchedIdStartPos = foundId.start(0);
	         int matchedIdEndPos = foundId.end(0);
	 
	         String strMatchedId = foundId.group("id");

	         if(null != strMatchedId)
	         {
	             System.out.println("Is ID: [" + matchedIdStartPos + "-" + matchedIdEndPos + "]=" + strMatchedId);
	         }
	         else
	         {
	             int matchedStrStartPos = foundId.start(0);
	             int matchedStrEndPos = foundId.end(0);
	             String strMatchedStr = foundId.group("str");
	             System.out.println("Is String: [" + matchedIdStartPos + "-" + matchedIdEndPos + "]=" + strMatchedStr);
	         }
	       }
	        
	       return processedDefineContent;
	}
}

即可实现效果:

自动忽略掉defineValue

输出结果为:

Is ID: [0-1]=a

Is ID: [14-15]=a

Is ID: [16-17]=b

Is String: [19-33]="string value"

2。然后后来,又想起来,会不会,对于之前的java的正则:

((?<id>[_a-zA-Z]\\w*)(?!\\())|(?<str>\"[^\"]+?\")

放在python中,即:

((?P<id>[_a-zA-Z]\w*)(?!\())|(?P<str>"[^"]+?")

就自动可以实现本来需要的效果,可以自动忽略掉defineValue呢?

然后就去试试,结果是:

#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
Function:
【研究】java和python中的正则的贪婪匹配
https://www.crifan.com/research_java_python_regex_greedy_match

Version:    2013-07-28
Author:     Crifan Li
Contact:    https://www.crifan.com/contact_me/
"""

import re;

def re_greey_id():
    testString = 'a+defineValue(a,b, "string value")';
    findAllId = re.findall('((?P<id>[_a-zA-Z]\w*)(?!\())|(?P<str>"[^"]+?")', testString);
    findAllId = re.findall('(([_a-zA-Z]\w*)(?=\b)(?!\())|("[^"]+?")', testString);
    print "findAllId=",findAllId; #findAllId= [('a', 'a', ''), ('defineValu', 'defineValu', ''), ('a', 'a', ''), ('b', 'b', ''), ('', '', '"string value"')]    
if __name__ == '__main__':
    re_greey_id()

和,java中的效果一样,也还是无法自动忽略掉defineValue

3.然后又去加上\b,然后结果是:

    testString = 'a+defineValue(a,b, "string value")';
    findAllId = re.findall('((?P<id>[_a-zA-Z]\w*)\b(?!\())|(?P<str>"[^"]+?")', testString);
    print "findAllId=",findAllId; #findAllId= [('', '', '"string value"')]

即,虽然可以自动忽略掉了defineValue,但是却也把其他的,正常的id,即a,a,b都忽略掉了,所以,变得更加不正常了。

4.然后以为是不是findall有问题。

所以去试试re.search,结果是:

import re;

def re_greey_id():
    testString = 'defineValue(a,';
    findSingleId = re.search('((?P<id>[_a-zA-Z]\w*)\b(?!\())|(?P<str>"[^"]+?")', testString);
    print "findSingleId=",findSingleId; #findSingleId= None

很明显,也不正常,本来应该是可以找到a的,不应该是None的。

感觉,貌似是Python中,\b的含义,有点问题啊。

5.再去试试,re.search,结果得到和re.findall类似的结果:

def re_greey_id():
    testString = 'defineValue(a,';
    findSingleId = re.search('(?P<id>[_a-zA-Z]\w*)(?!\()', testString);
    print "findSingleId.group('id')=",findSingleId.group('id'); #findSingleId.group('id')= defineValu

6.以为是嫌弃字母少了,所以改a为abc,结果还是不行:

    testString = 'defineValue(abc,';
    findSingleId = re.search('(?P<id>[_a-zA-Z]\w*)\b(?!\()', testString);
    print "findSingleId=",findSingleId; #findSingleId= None

7.本来以为,或许\b无法正确处理左括号'(‘?

但是,再去看python 2.7.5的\b的解释:

\b

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string, so the precise set of characters deemed to be alphanumeric depends on the values of the UNICODE and LOCALE flags. For example, r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'. Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals.

很明显是可以正常支持括号的。

8.后来注意到:

Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals.

所以,再去试试,结果就是正确的了:

    testString = 'defineValue(a,';
    findSingleId = re.search(r'(?P<id>[_a-zA-Z]\w*)\b(?!\()', testString);
    print "findSingleId.group('id')=",findSingleId.group('id'); #findSingleId.group('id')= a

9.反过来,再去试试re.findall,结果也就对了:

    testString = 'a+defineValue(a,b, "string value")';
    findAllId = re.findall(r'(\b([_a-zA-Z]\w*)\b(?!\())|("[^"]+?")', testString);
    print "findAllId=",findAllId; #findAllId= [('a', 'a', ''), ('a', 'a', ''), ('b', 'b', ''), ('', '', '"string value"')]

即:

即可以自动忽略掉defineValue,又可以正常捕获到几个id,或字符串了。

10.也在python中试试,两个反斜杠的效果,看看如何:

    testString = 'a+defineValue(a,b, "string value")';
    findAllId = re.findall('(([_a-zA-Z]\w*)\\b(?!\())|("[^"]+?")', testString);
    print "findAllId=",findAllId; #findAllId= [('a', 'a', ''), ('a', 'a', ''), ('b', 'b', ''), ('', '', '"string value"')]

很明显,也是可以的。

 

【总结】

之前,不论是java中,还是python中,对于正则中的\b==boundary,还真是用的不多。

所以,之前最开始遇到上述问题,没有想到用\b去解决。

然后现在用了\b后,java中是直接可以用\\b就可以了。

但是在python中,没注意,如果是普通字符串写法的正则,则\b是表示backspace的。

所以,想要在python的正则中用\b==boundary的话,则需要加上\r前缀或者对于\b本身写成\\b,就可以了。

转载请注明:在路上 » 【研究】java和python中的正则的贪婪匹配

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
82 queries in 0.437 seconds, using 22.12MB memory