【问题】
rgbsky([email protected])的邮件:
“您好,Notepad++是一个非常优秀的文字处理软件,在使用过程中,我发现用正则表达式[\u4e00-\u9fa5]来匹配中文好象会有问题。
比如:我有一个Ansi编码的txt文件,里面有字母、数字和一些中文,用[\u4e00-\u9fa5]会把Ansi编码的字母、数字也匹配上(我确信这些字母数字只占一个字节且与旁边字节组成的双字节也不在[\u4e00-\u9fa5]范围内),能请教一下是什么原因吗???多谢帮助!”
【问题解答】
1. 参考:
提到的:
[11] How to use regular expressions in Notepad++ (tutorial)
[12] Regular Expressions in SciTE
得知:
http://www.scintilla.org/SciTERegEx.html [12] \xHH a backslash followed by x and two hexa digits, becomes the character whose Ascii code is equal to these digits. If not followed by two digits, it is ‘x’ char itself. http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Regular_Expressions#Example_1 Non ASCII characters \xnn Specify a single chracter with code nn. What this stands for depends on the text encoding. For instance, \xE9 may match an é or a θ depending on the code page in an ANSI encoded document. \x{nnnn} Like above, but matches a full 16-bit Unicode character. If the document is ANSI encoded, this construct is invalid. |
所以,把你的
[\u4e00-\u9fa5] |
改为:
[\x{4e00}-\x{9fa5}] |
就可以实现匹配中文了。