The examples in this article describe the repeated matching of regular expression tutorials. Share it with everyone for your reference, the details are as follows:
Note: In all examples, the regular expression matching results are included between [and] in the source text. Some examples will be implemented using Java. If The usage of regular expressions in Java itself will be explained in the corresponding places. All java examples are tested under JDK1.6.0_13.
1. How many matches are there
The previous articles talked about matching one character, but if a character or a set of characters needs to be matched multiple times, what should be done? For example, if you want to match an email address, using the method mentioned before, someone may write a regular expression like \w@\w\.\w, but this can only match addresses like a@b.c. This is obviously incorrect, so let’s look at how to match email addresses.
First of all, you need to know the composition of an email address: a group of characters starting with alphanumeric or underscore, followed by the @ symbol, and then the domain name, that is, username@domain name address. However, this also depends on the specific email service provider. Some also allow . characters in user names.
1. Match one or more characters
To match multiple repetitions of the same character (or set of characters), simply add a The + character as a suffix is fine. +matches one or more characters (at least one). For example: a matches a itself, a+ will match one or more consecutive a's; [0-9]+ matches multiple consecutive numbers.
Note: When adding a + suffix to a character set, the + must be placed outside the character set, otherwise it will not be a repeated match. For example, [0-9+] represents a number or a + sign. Although it is grammatically correct, it is not what we want.
Text: Hello, mhmyqn@qq.com or mhmyqn@126.com is my email.
Regular expression: \w+@(\w+\.)+\w+
Result: Hello, [mhmyqn@qq.com] or [mhmyqn@126.com] is my email.
Analysis: \w+ can match one or more characters, while the subexpression (\ w+\.)+ can match a string like xxxx.edu., but it will not end with a . character, so there will be a \w+ at the end. Email addresses like mhmyqn@xxxx.edu.cn will also be matched.
2. Match zero or more characters
Use the metacharacter * to match zero or more characters. Its usage is exactly the same as +, just put it next to the character or character After the set, you can match zero or more consecutive occurrences of the character (or set of characters). For example, the regular expression ab*c can match ac, abc, abbbbc, etc.
3. Match zero or one character
Use the metacharacter ? to match zero or one character. As mentioned in the previous article, the regular expression \r\n\r\n is used to match a blank line, but in Unix and Linux, \r is not needed. You can use the metacharacters ?, \r?\n\r? \nThis can match blank lines in Windows as well as Unix and Linux. Let's look at an example of a URL matching the http or https protocol:
Text: The URL is http://www.mikan.com, to connect securely use https://www.mikan.cominstead.
Regular expression: https?://(\w+\.)+\w+
Result: The URL is [http://www.mikan.com], to connect securely use [https://www.mikan.com] instead.
Analysis: This pattern starts with https?, which means that the character before ? may or may not exist, so it can match http or https, followed by Parts are the same as the previous example.
2. Number of matching repetitions
+, * and ? in regular expressions solve many problems, but:
1) Number of characters matched by + and * There is no upper limit to the number. There is no way to set a maximum number of characters that they will match.
2) +, * and ? match at least one or zero characters. We cannot set another minimum number of characters for which they will match.
3) If we only use * and +, we cannot set the number of characters they match to an exact number.
Regular expressions provide a syntax for setting the number of repetitions. The number of repetitions should be given using { and } characters, and the value should be written between them.
1. Set an exact value for the number of repeated matches
If you want to set an exact value for the number of repeated matches, just write the number between { and }. For example, {4} means that the character (or set of characters) before it must be repeated 4 times in the original text to be considered a match. If it only appears 3 times, it is not considered a match.
As mentioned in the previous articles for examples of matching colors on the page, you can use the number of repetitions to match: #[[:xdigit:]]{6} or #[0-9a-fA-F ]{6}, POSIX characters are #\\p{XDigit}{6} in java.
2. Set an interval for the number of repeated matches
{} syntax can also be used to set an interval for the number of repeated matches, that is, set a minimum value and the number of repeated matches. maximum value. Such intervals must be given in the form {n, m}, where n>=m>=0. For example, a regular expression to check whether the date format is correct (without checking the validity of the date) (such as the date 2012-08-12 or 2012-8-12): \d{4}-\d{1,2}-\d {1,2}.
3. At least how many times must the match be repeated
The last usage of the{} syntax is to give a minimum number of repetitions (but not necessarily a maximum number of repetitions), such as {3,} indicating at least 3 repetitions. Note: There must be a comma in {3,}, and there cannot be a space after the comma. Otherwise something will go wrong.
Let’s look at an example, use regular expressions to find all amounts greater than $100:
Text:
$25.36
$125.36
$205.0
$2500.44
$44.30
Regular expression: $\d{3,}\.\d{2}
Result:
$25.36
【$125.36】
【$205.0】
【$2500.44】
$44.30
+,* ,? can be expressed as the number of repetitions:
+ is equivalent to {1,}
* is equivalent to {0,}
? is equivalent to {0,1 }
3. Prevent over-matching
? can only match zero or one character. {n} and {n,m} also have an upper limit on the number of matching repetitions, but like *, +, There is no upper limit for {n,}, which sometimes leads to over-matching.
Let’s look at an example of matching an html tag
Text:
Yesterday is history,tomorrow is a mystery B>, but today is a gift.
Regular expression: <[Bb]>.*[Bb]>
Result:
Yesterday is 【history,tomorrow is a mystery, but today is a gift】.
Analysis: <[Bb]> matches the tag (not case-sensitive), [Bb]> matches the tag (not case-sensitive). But the result is not as expected. There are three. Everything after the first tag and up to the last are matched.
Why is this so? Because * and + are both greedy metacharacters, their behavior pattern when matching is the more the better. They will try their best to match from the beginning of a text to the end of the text, rather than from the beginning of the text to until the first match is encountered.
Lazy versions of these metacharacters can be used when this greedy behavior is not required. Lazy means matching as few characters as possible, as opposed to greedy. Lazy metacharacters only need to add a ? suffix to greedy metacharacters. Here is the lazy version of the greedy metacharacter:
* *?
+ +?
{n,} {n,}?
So in the above example, the regular expression only needs to be changed to <[Bb]>.*?[Bb]>. The result is as follows:
history< /b>
mystery
gift
4. Summary
Regular Expression The true power of the formula is reflected in the matching number of repetitions. Here we introduce the usage of metacharacters +, *, and ?. If you want to accurately determine the number of matches, use {}. There are two types of metacharacters: greedy and lazy. When you need to prevent excessive matching, please use lazy metacharacters to construct regular expressions. Position matching will be introduced in the next article.
I hope this article will be helpful for everyone to learn regular expressions.
For more detailed explanations of repeated matching in regular expression tutorials, please pay attention to the PHP Chinese website!