Regular Expressions Introduction and Syntax
Why use regular expressions?
Typical search and replace operations require you to provide the exact text that matches the expected search results. While this technique may be sufficient for performing simple search and replace tasks on static text, its lack of flexibility makes searching dynamic text difficult, if not impossible, in this way.
By using regular expressions, you can:
1. Test patterns within strings.
For example, you can test the input string to see if a phone number pattern or a credit card number pattern occurs within the string. This is called data validation.
2. Replace text.
You can use regular expressions to identify specific text in a document, remove that text entirely, or replace it with other text.
3. Extract substrings from strings based on pattern matching.
You can find specific text in the document or input field.
For example, you may need to search the entire site, remove outdated material, and replace certain HTML formatting tags. In this case, regular expressions can be used to determine whether this material or this HTML formatting tag occurs in each file. This process narrows the list of affected files to those that contain material that needs to be removed or changed. Regular expressions can then be used to remove outdated material. Finally, regular expressions can be used to search and replace tags.
Regular expression - Grammar
Regular expression (regular expression) describes a string matching pattern, which can be used to check whether a string contains a certain substring, and the matching substring Replace a string or extract a substring that meets a certain condition from a string, etc.
When listing directories, *.txt in dir *.txt or ls *.txt is not a regular expression, because the meaning of * here is different from the * in regular expressions.
The method of constructing a regular expression is the same as that of creating a mathematical expression. That is, small expressions can be combined together to create larger expressions using a variety of metacharacters and operators. The components of a regular expression can be a single character, a collection of characters, a range of characters, a selection between characters, or any combination of all of these components.
Regular expressions are text patterns composed of ordinary characters (such as the characters a through z) and special characters (called "metacharacters"). A pattern describes one or more strings to match when searching for text. A regular expression acts as a template that matches a character pattern with a searched string.
Normal characters
Normal characters include all printable and non-printable characters that are not explicitly specified as metacharacters. This includes all uppercase and lowercase letters, all numbers, all punctuation, and some other symbols.
Non-printing characters
Non-printing characters can also be part of regular expressions. The following table lists the escape sequences that represent non-printing characters:
CharacterDescription
##\cx Matches by The control character specified by x. For example, \cM matches a Control-M or carriage return character. The value of x must be one of A-Z or a-z. Otherwise, c is treated as a literal 'c' character.
Special characters
The so-called special characters are characters with special meanings, such as the * in "*.txt" mentioned above , simply speaking, means the meaning of any string. If you want to find files with * in the file name, you need to escape the *, that is, add a \ before it. ls\*.txt. Many metacharacters require special treatment when trying to match them. To match these special characters, you must first "escape" the characters, that is, precede them with a backslash character (\). The following table lists the special characters in regular expressions:Special CharactersDescription
$ Matches the input characters The end position of the string. If the RegExp object's Multiline property is set, $ also matches '\n' or '\r'. To match the $ character itself, use \$.
( ) Marks the beginning and end of a subexpression. Subexpressions can be obtained for later use. To match these characters, use \ (and \).
* Matching the sub -expression in front of the sub -expression zero or multiple times. To match the * character, use \ *.
+ Matches the previous subexpression one or more times. To match the + character, use \+.
. Matches any single character except the newline character \n. To match ., use \.
[ Marks the beginning of a square bracket expression. To match [, use \[.
? Matches the preceding subexpression zero or once, or specifies a non-greedy qualifier. To match the ? character, use \?.
\ Mark the next character as either a special character, a literal character, a backward reference, or an octal escape character. For example, 'n' matching character 'n'. '\n' matches a newline character. The sequence '\\' matches "\", while '\(' matches "(".
^ Matches the beginning of the input string, unless used in a square bracket expression, in which case it means no Accepts this set of characters. To match the ^ character itself, use \^ to mark the start of a qualifier expression. To match {, use \{. To match |, use \|
##qualifier##qualifier is used to specify the regular expression. How many times must a given component appear to satisfy the match? There are 6 types of regular expression qualifiers: * or + or ? or {n} or {n,} or {n,m}. There are:Characters
Description ##* Matches the previous subexpression zero or more times. For example, zo. * Can match "z" and "zoo". * Equivalent to {0,}.##+ Matches the previous subexpression one or more times. Matches "zo" and "zoo", but not "z". + Equivalent to {1,}.? Matches the preceding subexpression zero or one time. For example, "do(es)" ?" can match "do" in "do" or "does". ? is equivalent to {0,1}.
.
{n,} n is a non-negative integer. Match at least n times. For example, 'o{2,}' does not match the 'o' in "Bob", but it matches all o's in "foooood". 'o{1,}' is equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'.
{n,m} m and n are both non-negative integers, where n <= m. Match at least n times and at most m times. For example, "o{1,3}" will match the first three o's in "fooooood". 'o{0,1}' is equivalent to 'o?'. Please note that there cannot be a space between the comma and the two numbers.
Since the number of chapters will likely exceed nine in a large input document, you need a way to handle two or three-digit chapter numbers. Qualifiers give you this ability. The following regular expression matches chapter titles with any number of digits:
/Chapter [1-9][0-9]*/
Note that the qualifier appears after the range expression. Therefore, it applies to the entire range expression, in this case, only numbers from 0 to 9 (inclusive) are specified.
The + qualifier is not used here because there is not necessarily a need for a number in the second or subsequent position. Don’t use it either? characters because it limits chapter numbers to only two digits. You need to match at least one number after Chapter and a space character.
If you know that chapter numbers are limited to only 99 chapters, you can use the following expression to specify at least one but at most two digits.
/Chapter [0-9]{1,2}/
The disadvantage of the above expression is that chapter numbers greater than 99 will still only match The first two digits. Another drawback is that Chapter 0 will also match. A better expression to match only two digits would be:
/Chapter [1-9][0-9]?/
or
/Chapter [1-9][0-9]{0,1}/
*, + and ? qualifiers are all greedy, Since they match as many words as possible, non-greedy or minimal matching can be achieved by simply adding a ? after them.
For example, you might search an HTML document for section titles enclosed in H1 tags. The text looks like this in your document:
Chapter 1 – Introduction to Regular Expressions
The following expressions match Everything from the opening less-than sign (<) to the closing H1 tag's greater-than sign (>).
/<.*>/
If you only need to match the opening H1 tag, the following "non-greedy" expression only matches < H1>.
/<.*?>/
The expression changes from "greedy" to "greedy" by placing ? after the *, +, or ? qualifier Expressions are converted to "non-greedy" expressions or minimal matches.
locator
Thelocator enables you to pin a regular expression to the beginning or end of a line. They also enable you to create regular expressions that appear within a word, at the beginning of a word, or at the end of a word.
The locator is used to describe the boundary of a string or a word. ^ and $ refer to the beginning and end of the string respectively. \b describes the front or back boundary of a word. \B represents a non-word boundary.
The qualifiers of regular expressions are:
CharactersDescription
##^ Matches the input characters The starting position of the string. If the RegExp object's Multiline property is set, ^ also matches the position after \n or \r.
/^Chapter [1-9][0-9]{0,1}/
/^Chapter [1-9][0-9]{0,1}$/
/\bCha/
/ter\b/
/\Bapt/
The string apt occurs at non-word boundaries in the word Chapter, but at word boundaries in the word aptitude. For the \B non-word boundary operator, position does not matter because the match does not care whether it is the beginning or the end of a word.
Select
Enclose all selections in parentheses, and separate adjacent selections with |. But using parentheses will have a side effect, that is, related matches will be cached. In this case, you can use ?: before the first option to eliminate this side effect.
Among them, ?: is one of the non-capturing elements, and the other two non-capturing elements are ?= and ?!. These two have more meanings. The former is a forward lookup and matches at any beginning. The search string is matched at any position within the regular expression pattern within parentheses, which is a negative lookahead that matches the search string at any initial position that does not match the regular expression pattern.
Backreference
Adding parentheses around a regular expression pattern or part of a pattern will cause the associated matches to be stored in a temporary buffer , each captured submatch is stored in the order in which it appears in the regular expression pattern from left to right. Buffer numbers start at 1 and can store up to 99 captured subexpressions. Each buffer can be accessed using '\n', where n is a one- or two-digit decimal number that identifies the specific buffer.
Captures can be overridden using the non-capturing metacharacters '?:', '?=' or '?!', ignoring the saving of related matches.
One of the simplest and most useful applications of backreferences is the ability to find matches of two identical adjacent words in text. Take the following sentence as an example:
Is is the cost of gasoline going up up?
The above sentence obviously has multiple repeated words. It would be nice to devise a way to locate this sentence without having to look for repetitions of each word. The following regular expression uses a single subexpression to achieve this:
/\b([a-z]+) \1\b/gi
Captured expressions, as specified by [a-z]+, include one or more letters. The second part of the regular expression is a reference to a previously captured submatch, i.e., the second occurrence of the word exactly matched by the bracket expression. \1 specifies the first submatch. Word boundary metacharacters ensure that only whole words are detected. Otherwise, phrases such as "is issued" or "this is" will not be correctly recognized by this expression.
The global tag (g) after the regular expression indicates that the expression is applied to as many matches as can be found in the input string. The case-insensitive (i) tag at the end of the expression specifies case-insensitivity. Multiline tags specify potential matches that may occur on either side of newline characters.
Backreferences also break down a Universal Resource Indicator (URI) into its components. Suppose you want to break the following URI into protocol (ftp, http, etc.), domain address, and page/path:
http://www.w3cschool.cc:80/html/html-tutorial.html
The following regular expression provides this functionality:
/(\w+):\/\/([^/:]+)(:\d*)?([^# ]*)/
The first bracket subexpression captures the protocol portion of the web address. This subexpression matches any word preceded by a colon and two forward slashes. The second parenthetical subexpression captures the domain address portion of the address. The subexpression matches one or more characters except / and :. The third bracketed subexpression captures the port number (if one is specified). This subexpression matches zero or more digits following the colon. This subexpression can be repeated only once. Finally, the fourth parenthetical subexpression captures the path and/or page information specified by the Web address. This subexpression matches any sequence of characters that does not include the # or space character.
Applying the regular expression to the URI above, each submatch contains the following:
1) The first bracketed subexpression contains "http"
2) The second bracket subexpression contains "www.w3cschool.cc"
3) The third bracket subexpression contains ":80"
4) The fourth bracket subexpression The expression contains "../html/html-tutorial.html"
atomic
The atom is the smallest unit in the regular expression. To put it bluntly, the atom needs to be matched. Content. A valid regular expression must contain at least one atom.
Explanation: The spaces, carriage returns, line feeds, 0-9, A-Za-z, Chinese, punctuation marks, and special symbols we see are all atoms.
Before doing the atomic example, let’s first explain a function, preg_match:
int preg_match (string $regular, string $string[, array &$result])
Function: Match $string variable based on $regular variable. If it exists, return the number of matches and put the matched results into the $result variable. If no result is found, 0 is returned.
Start and end
^ represents the beginning; $ represents the end
The following The code can match any number starting with date
$str = 'date20150121';
if (preg_match(' /^date/', $str)) {
echo 'Match successful';
} else {
echo 'Match failed';
}
\w Used to match letters, numbers or underscore characters;
\d matches numbers (\D represents non-digits) to represent
$str = 'date20150121';
if (preg_match('/^\w/', $str, $matches)) {
print_r($matches);
} else {
echo 'Match failed';
}
Specially identified atoms
AtomsDescription
##\ d Matches a 0-9
Example:
##\dmatches a 0-9
\Dmatches a non-0-9 value
The match is successful, matched middle. Because it is not a character between 0-9. \wmatches a-zA-Z0-9_
The match is successful and the underscore is matched. \WMatch a non-a-zA-Z0-9_
Match failed. Because, all the above are a-zA-Z0-9_, and there is nothing that is not a-zA-Z0-9_. \s matches all whitespace characters\n \t \r spaces
The match is successful because there is a carriage return. \S Non-empty characters
matched successfully. Although there are spaces, carriage returns and indents on it. However, there is a non-whitespace character a. Therefore, the match is successful. [] Specified range of atoms
Conclusion: In the above example, 0-5 failed to match $string, but $string1 succeeded. Because, the first value in $string is 6, which is not in the range of [0-5].
Conclusion:
$string and $string1 both match successfully. Because \w is [a-zA-Z0-9_]
Conclusion:
$string, $string1, $string2 are matched successfully, but $string3 is unsuccessful. Because $string3 exceeds the range of [abc], it starts from d.
[^ character] does not match characters in the specified interval
Conclusion:
1) Unsuccessful when matching $string, but successful when matching $string1. Because there is a circumflex character inside the square brackets.
2) The function of the ^ circumflex character inside the square brackets is not to match the characters inside the square brackets.
#