Home  >  Article  >  Web Front-end  >  A 30-minute tutorial for beginners on regular expressions

A 30-minute tutorial for beginners on regular expressions

云罗郡主
云罗郡主Original
2019-01-25 13:53:192516browse

What exactly is a regular expression? The following PHP Chinese website will introduce you to regular expressions. [Recommended reading: Regular Expression Video Tutorial]

When writing programs or web pages that process strings, there is often a need to find strings that match certain complex rules. Regular expressions are tools used to describe these rules. In other words, regular expressions are codes that record text rules.

It is very likely that you have used the wildcard (wildcard) used for file search under Windows/Dos, that is, * and ?. If you wanted to find all Word documents in a certain directory, you would search for *.doc. Here, * will be interpreted as an arbitrary string. Similar to wildcards, regular expressions are also tools used for text matching, but they can describe your needs more accurately than wildcards - of course, at the cost of being more complicated - for example, you can write a regular expression, Used to find all strings starting with 0, followed by 2-3 digits, then a hyphen "-", and finally 7 or 8 digits.

Getting Started

The best way to learn regular expressions is to start with examples. After understanding the examples, you can modify and experiment with them yourself. A number of simple examples are given below, and they are explained in detail.

Suppose you are searching for hi in an English novel, you can use the regular expression hi.

This is almost the simplest regular expression. It can accurately match such a string: it consists of two characters, the first character is h, and the last character is i. Usually, tools that process regular expressions will provide an option to ignore case. If this option is selected, it can match any of the four cases hi, HI, Hi, and hI.

Unfortunately, many words contain the two consecutive characters hi, such as him, history, high, etc. If you use hi to search, the hi here will also be found. If we want to find the word hi accurately, we should use \bhi\b.

\b is a special code specified by a regular expression (well, some people call it a metacharacter), which represents the beginning or end of a word, which is the boundary of a word. Although English words are usually separated by spaces, punctuation marks, or newlines, \b does not match any of these word-separating characters, it only matches one position.

If what you are looking for is hi followed by a Lucy not far away, you should use \bhi\b.*\bLucy\b.

Here, . is another metacharacter, matching any character except newline characters. * is also a metacharacter, but it represents not a character, nor a position, but a quantity - it specifies that the content before * can be repeatedly used any number of times to make the entire expression match. Therefore, .* together means any number of characters that do not include a newline. Now the meaning of \bhi\b.*\bLucy\b is obvious: first the word hi, then any number of characters (but not newlines), and finally the word Lucy.

If other metacharacters are used at the same time, we can construct more powerful regular expressions. For example, the following example:

0\d\d-\d\d\d\d\d\d\d\d matches a string that starts with 0, then two numbers, and then It is a hyphen "-", and the last is 8 digits (that is, China's phone number. Of course, this example can only match the situation where the area code is 3 digits).

\d here is a new metacharacter, matching one digit (0, or 1, or 2, or...). - is not a metacharacter, it only matches itself - the hyphen (or minus sign, or dash, or whatever you want to call it).

In order to avoid so many annoying repetitions, we can also write this expression like this: 0\d{2}-\d{8}. The {2}({8}) after \d here means that the previous \d must be repeated and matched 2 times (8 times) in a row.

Testing regular expressions

If you don’t find regular expressions difficult to read and write, either you are a genius, or you are not from Earth. The syntax of regular expressions can be confusing, even for people who use it regularly. Because it is difficult to read and write and prone to errors, it is necessary to find a tool to test regular expressions.

Some details of regular expressions are different in different environments. This tutorial introduces the behavior of regular expressions under Microsoft .Net Framework 4.5. Therefore, I recommend to you the .Net version I wrote. Tools Regester. Please refer to the instructions on this page to install and run the software.

The following is a screenshot of Regester running:

A 30-minute tutorial for beginners on regular expressions

Metacharacters

Now you already know a few useful metacharacters, Such as \b,.,*, and \d. There are more metacharacters in regular expressions. For example, \s matches any whitespace character, including space, tab (Tab), newline character, and Chinese full-width space. wait. \w matches letters or numbers or underscores or Chinese characters, etc.

Let’s look at more examples:

\ba\w*\b matches words starting with the letter a - first the beginning of a word (\b), then The letter a, then any number of letters or numbers (\w*), and finally the end of the word (\b).

\d matches 1 or more consecutive numbers. Here is a metacharacter similar to *, the difference is that * matches repeated any number of times (possibly 0 times), while matches repeated 1 or more times.

\b\w{6}\b Matches words of exactly 6 characters.

A 30-minute tutorial for beginners on regular expressions

The metacharacters ^ (the symbol on the same key position as the number 6) and $ both match a position, which is somewhat similar to \b. ^ matches the beginning of the string you are looking for, and $ matches the end. These two codes are very useful when verifying the input content. For example, if a website requires that the QQ number you fill in must be 5 to 12 digits, you can use: ^\d{5,12}$.

{5,12} here is similar to {2} introduced before, except that {2} can only be matched twice, no more, no less, and {5,12} is repeated. The number of times cannot be less than 5 times and cannot be more than 12 times, otherwise they will not match.

Because ^ and $ are used, the entire input string must be used to match \d{5,12}, which means that the entire input must be 5 to 12 numbers, so if you enter If your QQ number can match this regular expression, then it meets the requirements.

Similar to the option to ignore case, some regular expression processing tools also have an option to process multiple lines. If this option is selected, the meaning of ^ and $ becomes the start and end of the matching line.

Character escape

If you want to find the metacharacter itself, for example, if you search for ., or *, there is a problem: you cannot specify them because they will be interpreted as something else. the meaning of. At this time you have to use \ to cancel the special meaning of these characters. Therefore, you should use \. and \*. Of course, to find \ itself, you also have to use \\.

For example: deerchao\.net matches deerchao.net, and C:\\Windows matches C:\Windows.

Repeat

You have already seen the previous ways of matching and repeating *, ,{2},{5,12}. The following are all qualifiers in regular expressions (codes that specify a number, such as *, {5,12}, etc.):

A 30-minute tutorial for beginners on regular expressions

Here are some examples of using repetition:

Windows\d matches Windows followed by 1 or more digits

^\w matches the first word of a line (or the first word of the entire string, whichever meaning is matched) Depends on the option settings)

Character class

To find numbers, letters or numbers, whitespace is very simple, because there are already metacharacters corresponding to these character sets, but if you want What should we do if we match a set of characters without predefined metacharacters (such as the vowels a, e, i, o, u)?

It's very simple, you just need to list them in square brackets, like [aeiou] matches any English vowel, [.?!] matches punctuation marks (. or? or!) .

We can also easily specify a character range. The meaning represented by [0-9] is exactly the same as \d: one digit; similarly [a-z0-9A-Z_] is also completely Equivalent to \w (if only English is considered).

Let’s do some analysis on it: first there is an escape character \(, which can appear 0 or 1 times (?), then a 0, followed by 2 numbers (\d{2 }), then one of ) or - or a space, which appears 1 time or not (?), and finally 8 digits (\d{8}).

Branch condition

Unfortunately, the expression just now can also match "incorrect" formats such as 010)12345678 or (022-87654321. To solve this problem, we need to Branching conditions are used. The branching conditions in regular expressions refer to several rules. If any one of the rules is met, it should be regarded as a match. The specific method is to separate different rules with |. I don’t understand. ? It doesn’t matter, look at the example:

0\d{2}-\d{8}|0\d{3}-\d{7} This expression can match two phone numbers separated by hyphens Number: One is a three-digit area code and an 8-digit local number (such as 010-12345678); the other is a 4-digit area code and a 7-digit local number (0376-2233445).

\(0\d{2 }\)[- ]?\d{8}|0\d{2}[- ]?\d{8} This expression matches a phone number with a 3-digit area code, where the area code can be enclosed in parentheses, or No, the area code and the local number can be separated by a hyphen or a space, or there can be no separation. You can try to use branch conditions to expand this expression to also support 4-digit area codes.

\d{5 }-\d{4}|\d{5} This expression is used to match the zip code of the United States. The rule for the zip code of the United States is 5 digits, or 9 digits separated by hyphens. The reason for giving this example Because it can illustrate a problem: when using branch conditions, pay attention to the order of each condition. If you change it to \d{5}|\d{5}-\d{4}, then it will only Match the 5-digit zip code (and the first 5 digits of the 9-digit zip code). The reason is that when matching branch conditions, each condition will be tested from left to right. If a certain branch is met, it will not go to the next one. Regardless of other conditions.

Grouping

We have already mentioned how to repeat a single character (just add the qualifier directly after the character); but if you want to repeat multiple characters, you should What to do? You can use parentheses to specify a subexpression (also called grouping), and then you can specify the number of repetitions of this subexpression. You can also perform other operations on the subexpression (will be introduced later) .

(\d{1,3}\.){3}\d{1,3} is a simple IP address matching expression. To understand this expression, analyze it in the following order: \d{1,3} matches a number from 1 to 3 digits, (\d{1,3}\.){3} matches a three-digit number plus an English The period (the whole is the group) is repeated three times, and finally a one to three-digit number (\d{1,3}) is added.

Unfortunately, it will also match the impossible IP address 256.300.888.999. If you can use arithmetic comparison, you may be able to solve this problem simply, but regular expressions do not provide any mathematical functions, so you can only use lengthy grouping, selection, and character classes to describe a correct IP address:( (2[0-4]\d|25[0-5]|[01]?\d\d?)\.){3}(2[0-4]\d|25[0-5]| [01]?\d\d?).

The key to understanding this expression is to understand 2[0-4]\d|25[0-5]|[01]?\d\d?, I won’t go into details here, you can do it yourself You should be able to analyze its meaning.

Antonym

Sometimes it is necessary to find characters that do not belong to a easily defined character class. For example, if you want to find any character other than numbers, you need to use the antonym:

A 30-minute tutorial for beginners on regular expressions

Example: \S matches a string that does not contain whitespace characters .

] > Matches a string starting with a enclosed in angle brackets.

Back reference

After using parentheses to specify a subexpression, the text matching this subexpression (that is, the content captured by this group) can be used in expressions or other programs. further processing. By default, each group will automatically have a group number. The rule is: from left to right, with the left bracket of the group as the mark, the group number of the first appearing group is 1, the second one is 2, and so on. analogy.

Backward reference is used to repeatedly search for text matching a previous group. For example, \1 represents the text matched by group 1. Hard to understand? Please see the example:

\b(\w )\b\s \1\b can be used to match repeated words, like go go, or kitty kitty. This expression is first a word, that is, more than one letter or number (\b(\w)\b) between the beginning and end of the word. This word will be captured in the group numbered 1. Then there are one or more whitespace characters (\s), and finally the content captured in group 1 (that is, the previously matched word) (\1).

You can also specify the group name of the subexpression yourself. To specify the group name of a subexpression, use the following syntax: (?\w ) (or replace the angle brackets with ' (?'Word'\w )), so that \ The group name of w is specified as Word. To back-reference the content captured by this group, you can use \k, so the previous example could also be written like this: \b(?\w )\b\s \k\b .

When using parentheses, there are many special-purpose syntaxes. Some of the most commonly used ones are listed below:

A 30-minute tutorial for beginners on regular expressions

Zero-width assertion

The next four are used to find when something (but not These contents) before or after, that is to say, they are used like \b,^,$ to specify a position that should meet certain conditions (i.e. assertions), so they are also called zero-width assertions. It is best to use an example to illustrate:

(?=exp) is also called a zero-width positive prediction look-ahead assertion. It asserts that the expression exp can be matched after the position where it appears. For example, \b\w (?=ing\b) matches the front part of a word ending in ing (other than ing). For example, when searching for I'm singing while you're dancing., it will match sing and dance. .

(?

If you want to add a comma between every three digits in a very long number (added from the right, of course), you can find the parts that need to be preceded and added with commas like this: ((( ?

The following example uses both assertions: (?

Negative zero-width assertion

We mentioned earlier how to find characters that are not a certain character or are not in a certain character class (antonym). But what if we just want to make sure a certain character doesn't appear, but don't want to match it? For example, if we want to find a word in which the letter q appears, but the q is not followed by the letter u, we can try this:

\b\w*q[^u]\w*\b匹配包含后面不是字母u的字母q的单词。但是如果多做测试(或者你思维足够敏锐,直接就观察出来了),你会发现,如果q出现在单词的结尾的话,像Iraq,Benq,这个表达式就会出错。这是因为[^u]总要匹配一个字符,所以如果q是单词的最后一个字符的话,后面的[^u]将会匹配q后面的单词分隔符(可能是空格,或者是句号或其它的什么),后面的\w*\b将会匹配下一个单词,于是\b\w*q[^u]\w*\b就能匹配整个Iraq fighting。负向零宽断言能解决这样的问题,因为它只匹配一个位置,并不消费任何字符。现在,我们可以这样来解决这个问题:\b\w*q(?!u)\w*\b。

零宽度负预测先行断言(?!exp),断言此位置的后面不能匹配表达式exp。例如:\d{3}(?!\d)匹配三位数字,而且这三位数字的后面不能是数字;\b((?!abc)\w)+\b匹配不包含连续字符串abc的单词。

同理,我们可以用(?

一个更复杂的例子:(?).*(?=)匹配不包含属性的简单HTML标签内里的内容。(?)指定了这样的前缀:被尖括号括起来的单词(比如可能是),然后是.*(任意的字符串),最后是一个后缀(?=)。注意后缀里的\/,它用到了前面提过的字符转义;\1则是一个反向引用,引用的正是捕获的第一组,前面的(\w+)匹配的内容,这样如果前缀实际上是的话,后缀就是了。整个表达式匹配的是之间的内容(再次提醒,不包括前缀和后缀本身)。

注释

小括号的另一种用途是通过语法(?#comment)来包含注释。例如:2[0-4]\d(?#200-249)|25[0-5](?#250-255)|[01]?\d\d?(?#0-199)。

要包含注释的话,最好是启用“忽略模式里的空白符”选项,这样在编写表达式时能任意的添加空格,Tab,换行,而实际使用时这些都将被忽略。启用这个选项后,在#后面到这一行结束的所有文本都将被当成注释忽略掉。例如,我们可以前面的一个表达式写成这样:

(?<=    # 断言要匹配的文本的前缀
      <(\w+)> # 查找尖括号括起来的字母或数字(即HTML/XML标签)
      )       # 前缀结束
      .*      # 匹配任意文本
      (?=     # 断言要匹配的文本的后缀
      <\/\1>  # 查找尖括号括起来的内容:前面是一个"/",后面是先前捕获的标签
      )       # 后缀结束

贪婪与懒惰

当正则表达式中包含能接受重复的限定符时,通常的行为是(在使整个表达式能得到匹配的前提下)匹配尽可能多的字符。以这个表达式为例:a.*b,它将会匹配最长的以a开始,以b结束的字符串。如果用它来搜索aabab的话,它会匹配整个字符串aabab。这被称为贪婪匹配。

有时,我们更需要懒惰匹配,也就是匹配尽可能少的字符。前面给出的限定符都可以被转化为懒惰匹配模式,只要在它后面加上一个问号?。这样.*?就意味着匹配任意数量的重复,但是在能使整个匹配成功的前提下使用最少的重复。现在看看懒惰版的例子吧:

a.*?b匹配最短的,以a开始,以b结束的字符串。如果把它应用于aabab的话,它会匹配aab(第一到第三个字符)和ab(第四到第五个字符)。

A 30-minute tutorial for beginners on regular expressions

处理选项

上面介绍了几个选项如忽略大小写,处理多行等,这些选项能用来改变处理正则表达式的方式。下面是.Net中常用的正则表达式选项:

A 30-minute tutorial for beginners on regular expressions

一个经常被问到的问题是:是不是只能同时使用多行模式和单行模式中的一种?答案是:不是。这两个选项之间没有任何关系,除了它们的名字比较相似(以至于让人感到疑惑)以外。

平衡组/递归匹配

有时我们需要匹配像( 100 * ( 50 + 15 ) )这样的可嵌套的层次性结构,这时简单地使用\(.+\)则只会匹配到最左边的左括号和最右边的右括号之间的内容(这里我们讨论的是贪婪模式,懒惰模式也有下面的问题)。假如原来的字符串里的左括号和右括号出现的次数不相等,比如( 5 / ( 3 + 2 ) ) ),那我们的匹配结果里两者的个数也不会相等。有没有办法在这样的字符串里匹配到最长的,配对的括号之间的内容呢?

为了避免(和\(把你的大脑彻底搞糊涂,我们还是用尖括号代替圆括号吧。现在我们的问题变成了如何把xx aa> yy这样的字符串里,最长的配对的尖括号内的内容捕获出来?

这里需要用到以下的语法构造:

(?'group') 把捕获的内容命名为group,并压入堆栈(Stack)

(?'-group') 从堆栈上弹出最后压入堆栈的名为group的捕获内容,如果堆栈本来为空,则本分组的匹配失败

(?(group)yes|no) 如果堆栈上存在以名为group的捕获内容的话,继续匹配yes部分的表达式,否则继续匹配no部分

(?!) 零宽负向先行断言,由于没有后缀表达式,试图匹配总是失败

我们需要做的是每碰到了左括号,就在压入一个"Open",每碰到一个右括号,就弹出一个,到了最后就看看堆栈是否为空--如果不为空那就证明左括号比右括号多,那匹配就应该失败。正则表达式引擎会进行回溯(放弃最前面或最后面的一些字符),尽量使整个表达式得到匹配。

<                         #最外层的左括号
    [^<>]*                #最外层的左括号后面的不是括号的内容
    (
        (
            (?&#39;Open&#39;<)    #碰到了左括号,在黑板上写一个"Open"
            [^<>]*       #匹配左括号后面的不是括号的内容
        )+
        (
            (?&#39;-Open&#39;>)   #碰到了右括号,擦掉一个"Open"
            [^<>]*        #匹配右括号后面不是括号的内容
        )+
    )*
    (?(Open)(?!))         #在遇到最外层的右括号前面,判断黑板上还有没有没擦掉的"Open";如果还有,则匹配失败
>                         #最外层的右括号

平衡组的一个最常见的应用就是匹配HTML,下面这个例子可以匹配嵌套的

标签:
]*>[^]*(((?'Open'
]*>)[^]*)+((?'-Open'
)[^]*)+)*(?(Open)(?!))
.

上边已经描述了构造正则表达式的大量元素,但是还有很多没有提到的东西。下面是一些未提到的元素的列表,包含语法和简单的说明。你可以在网上找到更详细的参考资料来学习它们--当你需要用到它们的时候。如果你安装了MSDN Library,你也可以在里面找到.net下正则表达式详细的文档。

A 30-minute tutorial for beginners on regular expressions

The above is the detailed content of A 30-minute tutorial for beginners on regular expressions. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn