Regular expressions metacharacters and matching rules

Function Explanation

##* It represents a original atom that matches the previous, matches 0 times or any number of preceding characters.

+ match a character in the front or more. A more standard approach would be to count points as atoms. Matches all characters except \n

| or. Note: It has the lowest priority.

^ must start with the string after the runes. Non-boundary

{m} Can and can only appear m times

{n,m} Can appear n to m times

{m,} At least m times, maximum The number of times is not limited

() Change the priority or treat a string as a whole. You can also use it for the matching data. The characters

match successfully, proving the + in \d+. \d matches numbers, and + matches the previous character at least once.

* Matches 0 or any number of previous characters

Explains that the commented out $string1 and $string are matched successfully. Because, \w matches 0-9A-Za-z_, and * means that the previous \w does not need to exist. If present there can be 1 or more.

? The previous character appears 0 or 1 times, optional

Matches $string, $string2 successfully, but fails to match $string1.
Because there are ABC before and after the match, and there is a 0-9 in the middle. 0-9 is optional, but there cannot be more than one.

. (dot) Matches all characters except \n

matches $string, $string2 successfully, but fails to match $string1.
Because there are ABC before and after the match, and there is a 0-9 in the middle. 0-9 is optional, but there cannot be more than one.

|(vertical bar), or, the lowest priority

We will see through experiments the matching of priority and or

Let’s see See:

1. At first, my idea of matching was to match abccd or abbcd. However, when $string1 and $string2 are matched, the matching results are abc and bcd.

2. Or matching is achieved, and abc or bcd is matched. It does not have a higher priority than strings contiguous together.

Then the question is, what should I do if I want to match abccd or abbcd in the above example?

You need to use () to change the priority.

The result is as follows:

is matched, the result is:

array (size=2)

0 => string 'abccd' (length = 5)

## 1 => string 'c' (length = 1)

Conclusion:

1) It does match abccd or abbcd ($string1 or $string3).

2) But there is one more element in the matching array, and the subscript of this element is 1

3) As long as the content in () matches successfully, the matched data will be placed in In this array element with index 1.

^ (circumflex), must start with the string after ^

The following conclusions were discovered through experiments:

1) $string1 The match was successful, but $string2 was not matched successfully

2) Because $string1 starts with the specified character

3) and $string2 does not start with the character after ^

The meaning of this regular translation is: start with "be so handsome" followed by at least one character a-zA-Z0-9_.

$ (dollar sign) must end with the character before $

Let’s run it to see the results and draw the conclusion:

$string1 matches successfully, but $string2 matches unsuccessfully. The characters before

$ are \d+, followed by Chinese efforts.

Therefore, the match is this whole one. \d refers to the integer type of 0-9, and the + sign represents at least one 0-9

\b and \B word boundary and non-word boundary

us Let’s explain what boundaries are:

1. Regular expressions have boundaries. This boundary is the boundary where the beginning and end of the delimiter are regular.

2. This is an English word, followed by a space, which means that the word has ended and the boundary of the word has been reached

\bWord boundary means it must be first or last.
\B Non-boundary means that it cannot be at the front or last of a regular expression.

Conclusion:

$string1, $string2 and $string3 all match successfully.

When $string1 matches, this space is the boundary.

When $string2 matches, thisis is the boundary.

When $string3 matches, thisisaapple reaches the end of the entire regular expression. So are boundaries. So the match is successful.

Let’s experiment with non-word boundaries:

Summary:

Matches $string1 successfully but $string2 fails.

Because \B is followed by this, so this cannot appear at word boundaries (spaces and beginning and ending).

{m} has and can only appear m times

Conclusion:
In the above example\d{3} I stipulated that 0-9 can only appear 3 times, one more time Not even once.

{n,m} can appear n to m times

Conclusion:
In the above example\d{1,3}, I specified 0- 9 can only appear once, twice or three times. All other times are wrong

{m,} At least m times, the maximum number is not limited

Conclusion:
In the above example\d{2, }I stipulate that the 0-9 at the end of the drink should appear at least twice, and there is no limit to the maximum number of times. Therefore, $string1 is unsuccessful in matching, and $string2 is matched successfully. $string3 is matched successfully.

Matching rules

Basic pattern matching

Everything starts from the basics. Patterns are the most basic elements of regular expressions. They are a set of characters that describe the characteristics of a string. Patterns can be simple, consisting of ordinary strings, or very complex, often using special characters to represent a range of characters, recurrences, or to represent context. For example:

^once

This pattern contains a special character ^, which means that the pattern only matches those strings starting with once. For example, this pattern matches the string "once upon a time" but does not match "There once was a man from NewYork". Just like the ^ symbol indicates the beginning, the $ symbol matches strings that end with a given pattern.

bucket$

This pattern matches "Who kept all of this cash in a bucket" but does not match "buckets". When the characters ^ and $ are used together, they represent an exact match (strings are the same as patterns). For example:

^bucket$

Only matches the string "bucket". If a pattern does not include ^ and $, then it matches any string that contains the pattern. For example: pattern

once

with string

There once was a man from NewYork
Who kept all of his cash in a bucket.

is a match.

The letters (o-n-c-e) in this pattern are literal characters, that is, they represent the letters themselves, as do numbers. Some other slightly more complex characters, such as punctuation marks and white characters (spaces, tabs, etc.), require escape sequences. All escape sequences begin with a backslash (\). The escape sequence for the tab character is: \t. So if we want to check whether a string starts with a tab character, we can use this pattern:

^\t

Similarly, use \n It means "new line", \r means carriage return. Other special symbols can be used with a backslash in front. For example, the backslash itself is represented by \\, the period is represented by \., and so on.

Character cluster

In INTERNET programs, regular expressions are usually used to verify user input. When a user submits a FORM, it is not enough to use ordinary literal characters to determine whether the entered phone number, address, email address, credit card number, etc. are valid.

So we need to use a more free way to describe the pattern we want, which is character clusters. To create a cluster representing all vowels, place all vowels in square brackets:

[AaEeIiOoUu]

This pattern matches any vowel character, but can only represent one character. Use a hyphen to represent a range of characters, such as:

[a-z] //Match all lowercase letters
[A-Z] //Match all uppercase letters
[a- zA-Z] //Match all letters
[0-9] //Match all numbers
[0-9\.\-] //Match all numbers, periods and minus signs
[ \f\r\t\n] //Match all white characters

Similarly, these only represent one character, which is very important. If you want to match a string consisting of a lowercase letter and a digit, such as "z2", "t6" or "g7", but not "ab2", "r2d3" or "b52", use this pattern:

^[a-z][0-9]$

Although [a-z] represents a range of 26 letters, here it can only be used with the first String matches where characters are lowercase letters.

It was mentioned earlier that ^ represents the beginning of a string, but it also has another meaning. When ^ is used within a set of square brackets, it means "not" or "exclude" and is often used to eliminate a certain character. Using the previous example, we require that the first character cannot be a number:

^[^0-9][0-9]$

This pattern matches "&5", "g7" and "-2", but does not match "12" and "66". Here are a few examples of excluding specific characters:

[^a-z] //All characters except lowercase letters
[^\\\/\^] //All characters except (\)(/)(^)
[^\"\'] //All characters except double quotes (") and single quotes (')

##Special characters "." (dot, period) in Used in regular expressions to represent all characters except "new line". So the pattern "^.5$" matches any two-character string that ends with the number 5 and starts with some other non-"newline" character. The pattern "." can match any string, except empty strings and strings containing only a "new line".

PHP's regular expressions have some built-in common character clusters, the list is as follows:

Character clusters##[[:alpha:]] Any letters
[[:digit:]] Any numbers

[[:alnum:]] Any letters and numbers[[:space:]] any whitespace characters

[[:upper:]] any uppercase letters

[[:lower:]] any lowercase letters

[[:punct:]] Any punctuation mark

[[:xdigit:]] Any hexadecimal number, equivalent to [0-9a-fA-F]

Identify recurring occurrences


By now, you already know how to match a letter or number, but more In many cases, you may want to match a word or a group of numbers. A word consists of several letters, and a group of numbers consists of several singular numbers. The curly braces ({}) following a character or character cluster are used to determine the number of times the preceding content is repeated.

Character clusterDescription

^[a-zA-Z_]$ All letters and underscores
^[[:alpha:]]{3}$ All 3-letter words

^a$ Letter a^a{4} $ aaaa

^a{2,4}$ aa,aaa or aaaa

^a{1,3}$ a,aa or aaa

^a{2 ,}$ A string containing more than two a's

^a{2,} Such as: aardvark and aaab, but not apple

a{2,} Such as: baad and aaa, But not Nantucket

\t{2} Two tab characters

.{2} All two characters

These examples describe three different uses of curly braces. A number, {x} means "the preceding character or character cluster appears only x times"; a number plus a comma, {x,} means "the preceding content appears x or more times"; two Comma-separated numbers, {x,y} means "the previous content appears at least x times, but not more than y times". We can extend the pattern to more words or numbers:

^[a-zA-Z0-9_]{1,}$ //All containing more than one letter, number or underscore String
^[1-9]{1,}$ //All positive numbers
^\-{0,1}[0-9]{1,}$ //All integers
^[-]?[0-9]+\.?[0-9]+$ //All floating point numbers

The last example is not easy to understand, is it? ? Look at it this way: with everything starting with an optional minus sign ([-]?) (^), followed by 1 or more digits ([0-9]+), and a decimal point (\.) followed by 1 or more digits ([0-9]+) and nothing else ($) after them. Below you will learn about the simpler methods you can use.

The special characters "?" are equal to {0,1}, they both represent: "0 or 1 previous content" or "the previous content is optional". So the example just now can be simplified to:

^\-?[0-9]{1,}\.?[0-9]{1,}$

The special characters "*" and {0,} are equal, and they both represent "0 or more previous contents". Finally, the character "+" is equal to {1,}, which means "1 or more previous contents", so the above 4 examples can be written as:

^[a-zA -Z0-9_]+$ //All strings containing more than one letter, number or underscore
^[0-9]+$ //All positive numbers
^\-?[0-9 ]+$ //All integers
^\-?[0-9]*\.?[0-9]*$ //All floating point numbers

Of course This doesn't technically reduce the complexity of the regex, but it makes them easier to read.




Continuing Learning
||
submit Reset Code
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!