Detailed explanation of C# regular expression metacharacters-C#.Net Tutorial-php.cn

This article organizes the metacharacters of C# regular expressions. Regular expressions are expressions composed of characters. Each character represents a rule. The characters in the expression are divided into two types: ordinary characters and metacharacters. Ordinary characters refer to characters whose literal meaning remains unchanged and match text in an exact match manner, while metacharacters have special meanings and represent a type of character.

Detailed explanation of C# regular expression metacharacters

Treat text as a stream of characters, with each character placed in a position. For example, the regular expression "Room\d\d\d", the first four The character Room is an ordinary character, and the following character \ is an escape character. It forms a metacharacter \d with the following character d, which means there is any number at that position.

Detailed explanation of C# regular expression metacharacters

Described in the language of regular expressions: the regular expression "Room\d\d\d" captures a total of 7 characters, which means " A type of string starting with "Room" and ending with three numbers. We call this type of string a pattern (Pattern), also called a regular pattern.

1. Escape characters

The escape character is \, which escapes ordinary characters into metacharacters with special meanings. Commonly used escape characters are:

\t: Horizontal tab character

\v: Vertical tab character

\r: Carriage return

\n: Line feed

\\: represents the character \, that is to say, escapes the escaped character \ into an ordinary character \

\": represents the character ", in C#, double quotes are used to define String, the double quotes contained in the string are represented by \"

2. Character class

When performing regular matching , the input text is regarded as a sequenced character stream. The object of character class metacharacter matching is characters, and the characters will be captured. The so-called captured characters mean that the characters captured by one metacharacter will not be matched by other metacharacters. Subsequent The metacharacters can only be re-matched from the remaining text.

Commonly used character class metacharacters:

[char_group]: Match any character in the character group

[^char_group]: Match any character except character group

[first-last]: Match any character in the character range from first to last, the character range includes first and last.

. : Wildcard, matches any character except \n

\w: Matches any word (word) character, word characters usually refer to A-Z, a-z and 0 -9

\W: Matches any non-word character, which refers to characters except A-Z, a-z and 0-9

\s: Matches any whitespace character

\S: Matches any non-whitespace character

\d: Matches any numeric character

\D: Matches any non-numeric character

Note , escape characters also belong to character class metacharacters, and characters will also be captured when performing regular matching.

3. Locator

Locator matching (or capturing ) is a position, which determines whether the pattern matching is successful based on the position of the character. The locator does not capture characters and is zero-width (width is 0). Commonly used locators are:

^: By default, match the start position of the string; in multi-line mode, match the start position of each line;

$: By default, match the end position of the string, or the end of the string The position before \n; in multi-line mode, matches the position before the end of each line, or the position before \n at the end of each line.

\A: Matches the beginning position of the string;

\Z: Match the end position of the string, or the position before \n at the end of the string;

\z: Match the end position of the string;

\G: Match The end position of the previous match;

\b: Match the beginning or end position of a word;

\B: Match the middle position of a word;

Detailed explanation of C# regular expression metacharacters

4. Quantifiers, Greedy and Lazy

Quantifiers refer to limiting the number of occurrences of a previous regular pattern. Quantifiers are divided into two modes: Greedy mode And lazy mode, greedy mode means matching as many characters as possible, while lazy mode means matching as few characters as possible. By default, quantifiers are in greedy mode. Add ? after the quantifier to enable lazy mode.

*: Occurs 0 or more times

: Occurs 1 or more times

?: Occurs 0 or 1 times

{n}: Appears n times

{n,}: Appears at least n times

{n,m}: Appears n to m times

Note, Multiple occurrences means that the preceding metacharacter appears multiple times. For example, \d{2} is equivalent to \d\d, but only two numbers appear, and the two numbers are not required to be the same. To represent the same two numbers, grouping must be used.

Detailed explanation of C# regular expression metacharacters

5. Grouping and capturing characters

() Parentheses not only determine the scope of the expression, but also create groups. The expression within () is a group. The reference group means that the text matched by the two groups is exactly the same. The basic syntax for defining a group:

(pattern)

Copy after login

This type of grouping will capture characters. The so-called capture characters refer to: characters captured by a metacharacter will not be matched by other metacharacters. Subsequent metacharacters can only be obtained from Rematch the remaining text.

1. Group numbering and naming

By default, each group is automatically assigned a group number. The rules are: from left to right, numbering in the order in which the left brackets of the group appear. One group has a group number of 1, the second has a group number of 2, and so on. You can also specify a name for the group. This group is called a named group. The named group will also be automatically numbered. The number starts from 1 and increases by 1 one by one. The syntax for specifying a name for the group is:

(?< name > pattern)

Copy after login

Generally speaking, Groups are divided into named groups and numbered groups. The ways to reference a group are:

Reference a group by group name:\k

Reference a group by group number:\number

Note that grouping can only be referenced backwards, that is, starting from the left side of the regular expression text, the grouping must be defined first, and then it can be referenced after the definition.

The syntax for referencing groups in regular expressions is "\number". For example, "\1" represents the substring matching group 1, and "\2" represents the string matching group 2. In this way analogy.

For example, "<(.*?)>.*?" can match

valid

, when referencing the group, the text corresponding to the group are exactly the same.

2. Group constructor

The group construction method is as follows:

(pattern): Capture the matching subexpression and assign a group number to the group
(?< name > pattern): Capture matching subexpressions into named groups
(?:pattern): Non-capturing groups, ungrouped and assigned a group No.
(?> pattern): Greedy grouping

3, Greedy grouping

Greedy grouping is also called non-backtracking grouping. This grouping disables backtracking. The regular expression engine will match as many characters as possible in the input text. If no further matches can be made, there is no backtracking to try additional pattern matches.

(?> pattern )

Copy after login

4. Choose one of two

| means or, matching either one of the two. Note that | divides the expressions on the left and right sides into two parts.

pattern1 | pattern2

Six, zero-width assertion

Zero-width means that the width is 0, and the matching is the position, so the matching substring It will not appear in the matching results, and assertion refers to the result of judgment. Only when the assertion is true, the match is considered successful.

For the locator, you can match the beginning and end of a sentence (^ $) or the beginning and end of a word (\b) , these metacharacters Only match a position, specifying that the position meets certain conditions, rather than matching certain characters, therefore, they are called zero-width assertions. The so-called zero-width means that they do not match any characters, but match a position; the so-called assertion refers to a judgment, and the regular expression will continue to match only when the assertion is true. Zero-width assertions can match an exact position, rather than simply specifying a sentence or word.

Regular expressions treat text as a flow of characters from left to right. To the right is called backward (Look behind) , and to the left is called forward (Look ahead) . For regular expressions, only when the specified pattern (Pattern) is matched, the assertion is True, which is called a positive expression, and the unmatched pattern is True, which is called a negative expression.

According to the direction of matching and the qualitative nature of matching, zero-width assertions are divided into four types:

(?= pattern): forward, positive assertion
(?! pattern): forward, negative assertion
(?<= pattern): backward, positive assertion
(?

1. Forward positive assertion

Forward positive assertion defines that a pattern must exist at the end (or right side) of the text, but the substring matched by the pattern will not appear In the matching results, forward assertions usually appear on the right side of the regular expression, indicating that the right side of the text must meet a specific pattern:

 (?= subexpression )

Copy after login

Use forward positive assertions to define a fuzzy match, and the suffix must Contains specific characters:

\b\w+(?=\sis\b)

Copy after login

Analyze the regular expression:

\b: Indicates the boundary of the word

\w: Indicates that the word appears at least once

(?=\sis\b): Forward positive assertion, \s represents a whitespace character, is is an ordinary character, an exact match, and \b is a word boundary.

From the analysis, it can be concluded that the text matching the regular expression must contain the word is. Is is a separate word, not a part of a word. for example

Sunday is a weekend day 匹配该正则，匹配的值是Sunday，而The island has beautiful birds 不匹配该正则。

2、后向肯定断言

后向肯定断言定义一个模式必须存在于文本的开始（或左侧），但是该模式匹配的子串不会出现在匹配的结果中，后向断言通常出现在正则表达式的左侧，表示文本的左侧必须满足特定的模式：

(?<= subexpression )

Copy after login

使用后向肯定断言可以定一个模糊匹配，前缀必须包含特定的字符：

(?<=\b20)\d{2}\b

Copy after login

对正则表达式进行分析：

(?<=\b20)：后向断言，\b表示单词的开始，20是普通字符

\d{2}：表示两个数字，数字不要求相同

\b：单词的边界

该正则表达式匹配的文本具备的模式是：文本以20开头、以两个数字结尾。

七，用正则从格式化的文本中扣值

有如下的JSON格式的文本，从文本中扣出字段（CustomerId、CustomerName、CustomerIdSource和CustomerType）的值：

{"CustomerDetails":"[{\"CustomerId\":\"57512f19\",\"CustomerName\":\"cust xyz\",\"CustomerIdSource\":\"AadTenantId\",\"CustomerType\":\"Enterprise\"}]"}

Copy after login

注意，该文本转换为C#中的字符时，需要对双引号和转义字符进行转义。由于这四个字段提取规则相同，可以写一个通用的模式来提取：

public static string GetNestedItem(string txt, string key)
{
    string pat = string.Format("(?<=\\\\\"{0}\\\\\":\\\\\").*?(?=\\\\\")", key);
    return Regex.Match(txt, pat, RegexOptions.IgnoreCase).Value;
}

Copy after login

正则表达式得解析：

.*?：懒惰模式，匹配尽可能少的文本

(?=\\\\\")：前向断言，用于匹配字段值得双引号

本文来自 C#.Net教程栏目，欢迎学习！

The above is the detailed content of Detailed explanation of C# regular expression metacharacters. For more information, please follow other related articles on the PHP Chinese website!