Regular expression - syntax
Regular expression (regular expression) describes a string matching pattern that can It is used to check whether a string contains a certain substring, replace the matching substring, or extract a substring that meets a certain condition from a string, etc.
When listing directories, *.txt in dir *.txt or ls *.txt is not a regular expression, because the meaning of * here is different from the * in regular expressions.
The method of constructing a regular expression is the same as that of creating a mathematical expression. That is, small expressions can be combined together to create larger expressions using a variety of metacharacters and operators. The components of a regular expression can be a single character, a collection of characters, a range of characters, a selection between characters, or any combination of all of these components.
Regular expressions are text patterns composed of ordinary characters (such as the characters a through z) and special characters (called "metacharacters"). A pattern describes one or more strings to match when searching for text. A regular expression acts as a template that matches a character pattern with a searched string.
Normal characters
Normal characters include all printable and non-printable characters that are not explicitly specified as metacharacters. This includes all uppercase and lowercase letters, all numbers, all punctuation, and some other symbols.
Non-printing characters
Non-printing characters can also be part of regular expressions. The following table lists the escape sequences that represent non-printing characters:
Special characters
So-called Special characters are characters with special meanings, such as the * in "*.txt" mentioned above. Simply put, they represent the meaning of any string. If you want to find files with * in the file name, you need to escape the *, that is, add a \ before it. ls\*.txt.
Many metacharacters require special treatment when trying to match them. To match these special characters, you must first "escape" the characters, that is, precede them with a backslash character (\). The following table lists the special characters in regular expressions:
Qualifier
Qualifier is used to specify how many times a given component of a regular expression must appear to satisfy a match. There are 6 types: * or + or ? or {n} or {n,} or {n,m}.
Regular expression qualifiers are:
Since the chapter number will likely exceed nine in a large input document, you need a way to handle two- or three-digit chapter numbers. Qualifiers give you this ability. The following regular expression matches chapter titles numbered with any number of digits:
/Chapter [1-9][0-9]*/
Note that the qualifier appears after the range expression. Therefore, it applies to the entire range expression, in this case, only numbers from 0 to 9 (inclusive) are specified.
The + qualifier is not used here because there is not necessarily a need for a number in the second or subsequent position. Don’t use it either? characters because it limits chapter numbers to only two digits. You need to match at least one number after Chapter and a space character.
If you know that chapter numbers are limited to only 99 chapters, you can use the following expression to specify at least one but at most two digits.
/Chapter [0-9]{1,2}/
The disadvantage of the above expression is that chapter numbers greater than 99 still only match the first two digits. Another drawback is that Chapter 0 will also match. A better expression that matches only two digits would be:
/Chapter [1-9][0-9]?/
or
/ Chapter [1-9][0-9]{0,1}/
*, + and ? qualifiers are greedy because they will match as many literals as possible, only if Adding a ? after them can achieve non-greedy or minimum matching.
For example, you might search an HTML document for section titles enclosed in H1 tags. The text looks like this in your document:
<H1>Chapter 1 – Introduction to Regular Expressions</H1>
The following expressions match starting less than symbols (< Everything between ;) and the greater-than sign (>) that closes the H1 tag.
/<.*>/
If you only need to match the opening H1 tag, the following "non-greedy" expression only matches <H1>.
/<.*?>/
By placing ? after the *, +, or ? qualifier, the expression is converted from a "greedy" expression to a "non- Greedy" expression or minimum match.
locator
Thelocator enables you to pin a regular expression to the beginning or end of a line. They also enable you to create regular expressions that appear within a word, at the beginning of a word, or at the end of a word.
The locator is used to describe the boundary of a string or a word. ^ and $ refer to the beginning and end of the string respectively. \b describes the front or back boundary of a word. \B represents a non-word boundary.
The qualifiers of regular expressions are:
Note: Qualifiers cannot be used with anchor points. Since there cannot be more than one position immediately before or after a newline or word boundary, expressions such as ^* are not allowed.
To match text at the beginning of a line of text, use the ^ character at the beginning of the regular expression. Do not confuse this use of ^ with the use inside bracket expressions.
To match text at the end of a line of text, use the $ character at the end of the regular expression.
To use anchor points when searching for chapter titles, the following regular expression matches a chapter title that contains only two trailing digits and appears at the beginning of the line:
/^ Chapter [1-9][0-9]{0,1}/
The real chapter title not only appears at the beginning of the line, but it is also the only text in the line. It appears both at the beginning of a line and at the end of the same line. The following expression ensures that the specified match only matches chapters and not cross-references. You can do this by creating a regular expression that matches only the beginning and end of a line of text.
/^Chapter [1-9][0-9]{0,1}$/
Matches word boundaries slightly differently, but adds a lot to the regular expression important abilities. Word boundaries are the positions between words and spaces. A non-word boundary is any other position. The following expression matches the first three characters of the word Chapter because these three characters appear after a word boundary:
/\bCha/
\b The position of the characters is very important of. It looks for a match at the beginning of the word if it's at the beginning of the string to be matched. If it's at the end of the string, it looks for a match at the end of the word. For example, the following expression matches the string ter in the word Chapter because it appears before a word boundary:
/ter\b/
The following expression matches Chapter The string apt in aptitude does not match the string apt in aptitude:
/\Bapt/
The string apt occurs at a non-word boundary in the word Chapter, but Occurs at a word boundary in the word aptitude. For the \B non-word boundary operator, position does not matter because the match does not care whether it is the beginning or the end of a word.
Selection
Enclose all selections in parentheses, and separate adjacent selections with |. But using parentheses will have a side effect, that is, related matches will be cached. In this case, you can use ?: before the first option to eliminate this side effect.
Among them?: is one of the non-capturing elements, and there are two non-capturing elements: ?= and ?!. These two have more meanings. The former is a forward lookup, and it matches any starting parentheses. The regular expression pattern matches the search string at any position that does not match the regular expression pattern, which is negative lookahead and matches the search string at any beginning position that does not match the regular expression pattern.
Backreference
Adding parentheses around a regular expression pattern or part of a pattern will cause the associated match to be stored in a temporary buffer, and each captured submatch will be Expression patterns are stored in the order they appear from left to right. Buffer numbers start at 1 and can store up to 99 captured subexpressions. Each buffer can be accessed using '\n', where n is a one- or two-digit decimal number that identifies the specific buffer.
Captures can be overridden using the non-capturing metacharacters '?:', '?=' or '?!', ignoring the preservation of related matches.
One of the simplest and most useful applications of backreferences is the ability to find matches of two identical adjacent words in text. Take the following sentence as an example:
Is is the cost of gasoline going up up?
The above sentence obviously has multiple repeated words. It would be nice to devise a way to locate this sentence without having to look for repetitions of each word. The following regular expression uses a single subexpression to achieve this:
/\b([a-z]+) \1\b/gi
captured expression, Includes one or more letters, as specified by [a-z]+. The second part of the regular expression is a reference to a previously captured submatch, i.e., the second occurrence of the word exactly matched by the bracket expression. \1 specifies the first submatch. Word boundary metacharacters ensure that only whole words are detected. Otherwise, phrases such as "is issued" or "this is" will not be correctly recognized by this expression.
The global tag (g) after the regular expression indicates that the expression is applied to as many matches as can be found in the input string. The case-insensitive (i) tag at the end of the expression specifies case-insensitivity. Multiline tags specify potential matches that may occur on either side of newline characters.
Backreferences also break down a Universal Resource Indicator (URI) into its components. Suppose you want to break the following URI into protocol (ftp, http, etc.), domain address, and page/path:
//m.sbmmt.com:80/html/html-tutorial.html
The following regular expression provides this functionality:
/(\w+):\/\/([^/:]+)(:\d*)?( [^# ]*)/
The first bracket subexpression captures the protocol portion of the web address. This subexpression matches any word preceded by a colon and two forward slashes. The second parenthetical subexpression captures the domain address portion of the address. The subexpression matches one or more characters except / and :. The third bracketed subexpression captures the port number (if one is specified). This subexpression matches zero or more digits following the colon. This subexpression can be repeated only once. Finally, the fourth parenthetical subexpression captures the path and/or page information specified by the Web address. This subexpression matches any sequence of characters that does not include the # or space character.
Applying the regular expression to the URI above, each submatch contains the following:
The first bracketed subexpression contains "http"
The second The first bracket subexpression contains "m.sbmmt.com"
The third bracket subexpression contains ":80"
The fourth bracket subexpression contains "../html /html-tutorial.html"
For more regular expression knowledge, please see//m.sbmmt.com/regexp/regexp-tutorial.html
More regular expressions For examples of the formula, see http://www.cnblogs.com/diony/archive/2010/12/16/1908499.html
Next Section