Detailed explanation of the use of regular metacharacters-JS Tutorial-php.cn

Detailed explanation of the use of regular metacharacters

php中世界最好的语言

Release： 2018-03-30 09:49:43

Original

1714 people have browsed it

This time I will bring you a detailed explanation of the use of regular metacharacters. What are the precautions when using regular metacharacters? The following is a practical case, let’s take a look.

Note: In all examples, the regular expression matching result is contained between [ and ] in the source text, Some examples will be implemented using Java. If it is the usage of regular expressions in Java itself, it will be explained in the corresponding place. All java examples are tested under JDK1.6.0_13.

1. Escape special characters

Metacharacters are characters that have special meanings in regular expressions. Because metacharacters have special meanings in regular expressions, these characters cannot be used to represent themselves. You can escape a metacharacter by preceding it with a backslash, so that the resulting escape sequence will match that character itself rather than its special metacharacter meaning. For example, if you want to match [and], you must escape it:

and
.

To escape metacharacters, you need to use the slash \ character, which means that the \ character itself is also a metacharacter. To match the \ character itself, it must be escaped into \\. Such as matching windows file path.

2. Match white space characters

Metacharacters can be roughly divided into two types: one is used to match text (such as .), and the other is regular The expression's syntax requires it (such as [and]).

When performing regular expression searches, we often encounter situations where we need to match non-printing whitespace characters in the original text. For example, we may need to find all tab characters, or we need to find newline characters. Such characters are difficult to be directly input into a regular expression. In this case, we can use the special elements listed below. characters to enter them:

\b	Go back (and delete) one character (Backspace key)
\f	Form feed character
\n	Line feed character
\r	Carriage return character
\t	Tab character (Tab key)
\v	Vertical Tab

Let’s look at an example to remove blank lines from the file:

Text:

8 5 4 1 6 3 2 7 9
7 6 2 9 5 8 3 4 1
9 3 1 4 2 7 8 5 6

6 9 3 8 7 5 1 2 4
5 1 8 3 4 2 6 9 7
2 4 7 6 1 9 5 3 8

3 26 7 8 4 9 1 5
4 8 9 5 3 1 7 6 2
1 7 5 2 9 6 4 8 3

Regular expression: \r\n\r\n

Analysis: \r\n matches a carriage return + line feed combination, it is used as the end tag of a text line in the Windows operating system. A search using the regular expression \r\n\r\n will match two consecutive end-of-line tags, which happen to be blank lines.

Note: Unix and Linux operating systems only use a newline character to end a text line. In other words, to match blank lines in Unix or Linux systems, just use \n\n. No need to add \r. Regular expressions applicable to both windows and Unix/Linux should include an optional \r and a must-match \n, that is, \r?\n\r?\n, which will be discussed in a later article .

The Java code is as follows:

public static void matchBlankLine() throws Exception{
  BufferedReader br = new BufferedReader(new FileReader(new File("E:/九宫格.txt")));
  StringBuilder sb = new StringBuilder();
  char[] cbuf = new char[1024];
  int len = 0;
  while(br.ready() && (len = br.read(cbuf)) > 0){
    br.read(cbuf);
    sb.append(cbuf, 0, len);
  }
  String reg = "\r\n\r\n";
  System.out.println("原内容：\n" + sb.toString());
  System.out.println("处理后：-----------------------------");
  System.out.println(sb.toString().replaceAll(reg, "\r\n"));
}

Copy after login

The running result is as follows:

原内容：
8 5 4 1 6 3 2 7 9
7 6 2 9 5 8 3 4 1
9 3 1 4 2 7 8 5 6
6 9 3 8 7 5 1 2 4
5 1 8 3 4 2 6 9 7
2 4 7 6 1 9 5 3 8
3 2 6 7 8 4 9 1 5
4 8 9 5 3 1 7 6 2
1 7 5 2 9 6 4 8 3
 
处理后：-----------------------------
8 5 4 1 6 3 2 7 9
7 6 2 9 5 8 3 4 1
9 3 1 4 2 7 8 5 6
6 9 3 8 7 5 1 2 4
5 1 8 3 4 2 6 9 7
2 4 7 6 1 9 5 3 8
3 2 6 7 8 4 9 1 5
4 8 9 5 3 1 7 6 2
1 7 5 2 9 6 4 8 3

Copy after login

3. Match specific character categories

Character sets (matching one of multiple characters) are the most common form of matching, and some commonly used character sets can be replaced by special metacharacters. These metacharacters match a certain class of characters (class metacharacters). Class metacharacters are not essential because you can match a certain class of characters by enumerating the relevant characters one by one or by defining a character range, but using them The constructed regular expression is concise and easy to understand and is commonly used in practical applications.

1. Match numbers and non-numbers

\d Any number, equivalent to any one of [0-9] or [0123456789]
\D Non-digits, equivalent to [^0-9] or [^0123456789]

2. Match letters and numbers with non-letters and numbers

letters (A-Z is not Case-sensitive), numbers, and underscores are a commonly used set of characters. The following metacharacters can be used:

\w Any letter (case-insensitive), numbers, and underscores are equivalent to [0- 9a-zA-Z_]
\W Any non-alphanumeric and underscore, equivalent to [^0-9a-zA-Z_]

3. Matches whitespace characters and non-whitespace characters

\s Any white space character is equivalent to [\f\n\r\t\v]
\S Any white space character is equivalent to [^\f\n \r\t\v]

Note: The backspace metacharacter \b is not within the range of \s.

4. Match hexadecimal or octal values

Hexadecimal: given with the prefix \x, for example: \x0A corresponds to the ASCII character 10 (newline character), its effect is equivalent to \n.
Octal: given with the prefix \0, the value itself can be two or three digits, for example: \011 corresponds to ASCII character 9 (tab), and its effect is equivalent to \t.

4. Use POSIX character classes

POSIX character classes are a shorthand form supported by many regular expression implementations. Java also supports it, but JavaScript does not. POSIX characters are as follows:

##[ :blank:]Space or tab character, equivalent to [\t]##[:cntrl:][:digit:][:graph:][:lower:][:print:][:punct:][:space:][:upper:][:xdigit:]

POSIX characters are not the same as the metacharacters we have seen before. Let’s look at an example of using regular expressions to match colors on a web page:

Text: background-color:#3636FF;height:30px;width:60px;">Test

Regular expression:#[[ :xdigit:]] [[:xdigit:]] [[:xdigit:]] [[:xdigit:]] [[:xdigit:]] [[:xdigit:]]

Result ：【#3636FF】;height:30px;width:60px;">Test

Note: The pattern used here begins with [[ and ends with ]], which is necessary to use POSIX character classes. POSIX characters must be enclosed between [: and:], and the outer [and] characters are used to define a Set, the inner [ and ] characters are part of the POSIX character class itself.

The POSIX character representation in java is different. It is not included between [: and:], but starts with \p and is included between { and }, and the case is different. At the same time Added \p{ASCII} as follows:

[:alnum:]	Any letter or number, equivalent to [a-zA-Z0-9]
[:alpha:]	Any letter is equivalent to [a-zA-Z]

ASCII control character ( ASCII 0 to 31, plus ASCII 127)
Any number, equivalent to [0-9]
Any printable character, but not including spaces
Any lowercase letter, equivalent to [a-z]
Any printable character
Any character that does not belong to [:alnum:] and [:cntrl:]
Any whitespace character, including spaces, is equivalent to [^\f\n\r\t\v]
Any uppercase letter is equivalent to [A-Z]
Any hexadecimal digit is equivalent to [a- fA-F0-9]

##\p{Graph}Visible characters: [\p{Alnum}\p{Punct}]\p{Lower}Lowercase alphabetic characters: [a-z]\p{Print}Printable characters: [\p{Graph}\x20]\p{Punct}Punctuation: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ \p{Space}White space characters: [ \t\n\x0B\f\r]\p{Upper}uppercase Alphabetical characters: [A-Z]\p{XDigit} Hexadecimal digits: [0-9a-fA-F]

\p{Alnum}	Alphanumeric characters: [\p{Alpha}\p {Digit}]
\p{Alpha}	Alphabetic characters: [\p{Lower}\p{Upper}]
\p{ASCII}	All ASCII: [\x00-\x7F]
\p{Blank}	space or Tab character: [ \t]
\p{Cntrl}	Control character: [\x00-\x1F\x7F]
\p{Digit}	Decimal digits: [0-9]

## I believe you have mastered the method after reading the case in this article. For more exciting information, please pay attention to other related articles on the php Chinese website!