30-minute introductory tutorial on regular expressions
30 minutes, if you have no experience with regular expressions, please do not try to do it in 30SecondsIntroduction - Unless you are superman:)
Don’t be intimidated by the complex expressions below, just follow me step by step, you will find that regular expressions do not actually have you It's as difficult as imagined. Of course, if after reading this tutorial, you find that you understand a lot, but can’t remember almost anything, that’s normal - I think that people who have never been exposed to regular expressions will find that they understand a lot after reading this tutorial. , the possibility of remembering more than 80% of the mentioned grammar is zero. This is just to let you understand the basic principles. You will need to practice more and use it more in the future to master regular expressions proficiently. In addition to being an introductory tutorial, this article also attempts to become a regular expression syntax reference manual that can be used in daily work. As far as the author's own experience is concerned, this goal has been achieved well - you see, I can't write down everything myself, can I?Clear format Text format convention: Technical terminology Metacharacter/grammar format Regular expression Part of the regular expression (for analysis) The source character to match String Explanation of regular expressions or part of them
Hide side notes There are some comments on the right side of this article, mainly It is used to provide some relevant information or explain some basic concepts to readers without a programmer background, and can usually be ignored.
What exactly is a regular expression?Character is the most basic unit when computer software processes text, which may be letters, numbers, punctuation marks, spaces, newlines, Chinese characters, etc. String is a sequence of 0 or more characters. Text is text, string. Saying that a certain string matches a certain regular expression usually means that part (or several parts) of the string can satisfy the conditions given by the expression.
When writing programs or web pages that process strings, there is often a need to find strings that match certain complex rules.Regular expression is a tool used to describe these rules. In other words, regular expressions are codes that record text rules.
It is very likely that you have used the wildcard (wildcard) used for file search under Windows/Dos, that is, * and ?. If you wanted to find all Word documents in a certain directory, you would search for *.doc. Here, * will be interpreted as any string. Similar to wildcards, regular expressions are also tools used for text matching, but they can describe your needs more accurately than wildcards - of course, at the cost of being more complicated - for example, you can write a regular expression, Used to find all strings starting with 0, followed by 2-3 digits, then a hyphen "-", and finally 7 or 8 digits(like 010-12345678 or 0376-7654321).
The best way to learn regular expressions is to start with examples. After understanding the examples, you can modify and experiment with them yourself. A number of simple examples are given below, and they are explained in detail.
Suppose you are searching for hi in an English novel, you can use the regular expression hi.
This is almost the simplest regular expression. It can accurately match such a string: consists of two characters, the first character is h, and the next character is i. Usually, tools that handle regular expressions will provide an option to ignore case. If this option is checked, it can match hi,HI ,Hi,hIAny one of these four situations.
Unfortunately, many words contain the two consecutive characters hi, such as him,history,high and so on. If you use hi to search, the hi will also be found. If we want to accurately search for the word hi , we should use /bhi/b.
/b is a special code specified by the regular expression (well, some people call it metacharacter, metacharacter ), represents the beginning or end of the word, which is the boundary of the word . Although English words are usually separated by spaces, punctuation, or newlines, /b does not match any of these word-separating characters, it only matches one Location.
If you need a more precise statement, /b matches a position where the preceding character and the following character are not all (one is, one is not or not exists)/w.
If you are looking for hi followed not far by a Lucy, you should use /bhi/b.*/bLucy /b.
Here, . is another metacharacter that matches any character except newline characters. * is also a metacharacter, but it represents not a character, nor a position, but a quantity - it specifies that the content preceding * can be reused continuously. Any number of times so that the entire expression is matched . Therefore, .* taken together means any number of characters that do not include newlines. Now the meaning of /bhi/b.*/bLucy/b is obvious: is first a word hi, then any number of characters (but not Line break), and finally the word Lucy .
The newline character is '/n', the character whose ASCII encoding is 10 (hexadecimal 0x0A).
If other metacharacters are used at the same time, we can construct a more powerful regular expression. For example, the following example:
0/d/d-/d/d/d/d/d/d/d/d matches such a string: starts with 0, then two digits, then a hyphen "-", and finally 8 digits (that is, China's phone number. Of course, this example can only Matches the situation where the area code is 3 digits).
Here/d is a new metacharacter, matching one digit (0, or 1, or 2, or...) . - is not a metacharacter and only matches itself - the hyphen (or minus sign, or dash, or whatever you want to call it).
In order to avoid so many annoying repetitions, we can also write this expression like this: 0/d{2}-/d{8}. Here the meaning of {2}({8}) after /d The previous /d must be repeated and matched 2 times (8 times) .
Other available testing tools:
RegexBuddy
Javascript regular expression online testing tool
If you don’t find regular expressions difficult to read and write, either you are a genius, or you are not from Earth. The syntax of regular expressions can be confusing, even for people who use it regularly. Because it is difficult to read and write and prone to errors, it is necessary to find a tool to test regular expressions.
Some details of regular expressions are different in different environments. This tutorial introduces the behavior of regular expressions under Microsoft .Net Framework 2.0, so I will introduce you to a tool under .Net Regex Tester. First make sure you have .Net Framework 2.0 installed, and then download Regex Tester. This is a green software. After downloading, open the compressed package and run RegexTester.exe directly.
The following is a screenshot of Regex Tester running:
Now you already know a few useful metacharacters , such as /b,.,*, and /d. There are more metacharacters in regular expressions, such as /s matches any whitespace character in , including Spaces, tabs, newlines, Chinese full-width spaces, etc.. /w matches letters or numbers or underscores or Chinese characters, etc. .
Special processing of Chinese/Chinese characters is supported by the regular expression engine provided by .Net. For details in other environments, please check the relevant documents.
Here are some more examples:
/ba/w*/bmatches with the letters ## Words starting with #a - first the beginning of a word (/b), then the letters a, then any number of letters or numbers (/w*), and finally the end of the word (/b).
Okay, now let’s talk about what the words in the regular expression mean: no less than one consecutive/w. Yes, this has little to do with the thousands of things with the same name that you have to memorize when learning English:)
/d+match1 or more consecutive digits. The + here is a metacharacter similar to *, but the difference is *matches repeated any number of times (possibly 0 times) , while + matches repeated 1 or More times .
/b/w{6}/b matches the word with exactly 6 characters. Regular expression engines usually provide a method to "test whether a specified string matches a regular expression", such as the RegExp.test() method in JavaScript or the Regex.IsMatch() method in .NET. Matching here refers to whether there is any part of the string that conforms to the expression rules. If ^ and $ are not used, for /d{5,12}In terms of ##, using this method can only ensure that the string contains 5 to 12 consecutive digits , instead of the entire string being 5 to 12 digits. The metacharacters ^ (the symbol on the same key as the number 6) and $ both match A position, which is somewhat similar to /b. ^ matches the beginning of the string you are looking for, and $ matches the end. These two codes are very useful when verifying the input content. For example, if a website requires that the QQ number you fill in must be 5 to 12 digits, you can use: ^/d{5,12} $. The {5,12} here is similar to the {2} introduced before, except However, {2} matching can only be repeated 2 times , {5,12} means cannot be repeated less than 5 times and cannot be more than 12 times , otherwise it will not match. Because ^ and $ are used, the entire input string must be used with /d{5,12} to match, that is to say, the entire input must be 5 to 12 numbers , so if the input QQ If the number can match this regular expression, it meets the requirements. Similar to the option to ignore case, some regular expression processing tools also have an option to process multiple lines. If this option is selected, the meaning of ^ and $ becomes the beginning of the matching line and ends with . If you want to find the metacharacter itself, for example, you search ., or *, the problem arises: you can't specify them, because they will be interpreted as something else. At this time you have to use / to cancel the special meaning of these characters. Therefore, you should use /. and /*. Of course, to find / itself, you also have to use //. For example: unibetter/.commatchesunibetter.com,C://Windowsmatches C:/Windows. You have already seen the previous *,+,{2},{5,12}These are the repeated matching methods. The following are all qualifiers in regular expressions (specified number of codes, such as *, {5,12}, etc.): Here are some examples of using repetition: Windows/d+matches Windows followed by 1 or more digits ^/w+matches the first word of a line (or the first word of the entire string, specifically Which meaning to match depends on the option settings) To find numbers, letters or numbers, blanks is very simple, because there are already corresponding characters Metacharacters for a set, but what if you want to match a set of characters without predefined metacharacters (such as the vowels a, e, i, o, u)? is very simple, you just need to list them in square brackets, like [aeiou] will match any English element The phonetic letters , [.?!] match punctuation marks (. or ? or !) . We can also easily specify a character range, like [0-9] represents the same meaning /d is exactly the same: One digit; similarly [a-z0-9A-Z_ ] is also completely equivalent to /w (if only English is considered). The following is a more complex expression: /(?0/d{2}[) -]?/d{8}. "(" and ")" are also metacharacters, which will be mentioned in the grouping section later, so they need to be escaped here. This expression can match phone numbers in several formats, like (010)88886666, or 022-22334455, or 02912345678, etc. Let’s do some analysis on it: First, there is an escape character /(, which can appear 0 or 1 times (?), then a 0, followed by 2 numbers (/d{2}), then ) One of or - or space, it appears 1 time or not ( ?), and finally 8 numbers (/d{8}). Unfortunately, the expression just now can also match010)12345678or(022 -87654321 Such "incorrect" format. To solve this problem, we need to use ##branch condition in the regular expression. Branch conditions refers to several rules. If any one of them is met, it should be regarded as a match. The specific method is to use | Different rules are separated. Don’t understand? It doesn’t matter. Look at the example: This expression canmatch two types of phone numbers separated by hyphens: one is a three-digit area code and an 8-digit local number (such as 010-12345678). It is a 4-digit area code and a 7-digit local code (0376-2233445)/(0/d{2}/)[- ]?/d{ 8}|0/d{2}[- ]?/d{8}This expression . You can try to use branch conditions to expand this expression to also support 4-digit area codes ##. # /d{5}-/d{4}|/d{5}This expression is used to match zip codes in the United States. The rule for U.S. zip codes is 5 digits, or 9 digits separated by hyphens. The reason why this example is given is because it can illustrate a problem: When using branch conditions, pay attention to the order of each condition. If you change it to /d{5}|/d{5}-/d{4}, then only 5-digit postal codes (and 9-digit postal codes) will be matched. the first 5 digits of the zip code). The reason is that when matching branch conditions, each condition will be tested from left to right. If a certain branch is met, other conditions will not be considered. We have already mentioned how to repeat a single character (just add the qualifier directly after the character); but what if you want to repeat multiple characters? You can use parentheses to specify a subexpression (also called grouping), and then you can specify the subexpression Once the number of repetitions is determined, you can also perform other operations on subexpressions (will be introduced later). (/d{1,3}/.){3}/d{1,3} is a simple IP The address matches the expression. To understand this expression, analyze it in the following order: /d{1,3}matches a number from 1 to 3 digits, (/d{1,3}/.){3}matches three digits plus an English period (the whole is thisGroup) Repeat 3 times , and finally add a one to three digit number( /d{1,3}). No number in the IP address can be greater than 255. Don’t let the writers of the third season of "24" fool you... Unfortunately, it will also match256.300.888.999This IP address cannot exist. If you can use arithmetic comparison, you may be able to solve this problem simply, but regular expressions do not provide any mathematical functions, so you can only use lengthy grouping, selection, and character classes to describe a correct IP address: ((2[0-4]/d|25[0-5]|[01]?/d/d?)/.){3}(2[0-4]/d|25 [0-5]|[01]?/d/d?). The key to understanding this expression is to understand 2[0-4]/d|25[0-5]|[01]?/d/d?, I won’t go into details here, you should be able to analyze its meaning yourself.
Code
Description
.
Matches any character except newline characters
/w
Match letters or numbers or underscores or Chinese characters
##/s
Matches any whitespace character
/d
Match numbers
/b
Matches the beginning or end of a word
^
Matches the beginning of the string
$
Matches the end of the string
Character escape
Repeat
Code/Syntax
Description
*
Repeat zero or more times
+
Repeat one or more times
?
Repeat zero or one time
{n}
Repeat n times
{n,}
##Repeat n times or more
{n,m}
Repeat n to m times
Character class
Branch condition
Grouping
The above is the detailed content of Regular expression introductory tutorial. For more information, please follow other related articles on the PHP Chinese website!