I have been exposed to many languages. I value whether the regular expressions of a language are powerful and whether the combination of regular expressions and grammar is close. At this point, JavaScript is doing pretty well, at least with regular literals. Of course, the most powerful one is Perl. I recently discovered that the behavior of regular expressions in JavaScript is somewhat different from the regular expressions in other languages or tools. Although it is almost impossible for you to write and almost never use the regular rules I will talk about below, it is good to understand them after all. The code examples in this article are all executed in a JavaScript environment that is compatible with ES5. That is to say, the performance in versions before IE9, versions around Fx4, etc. is likely to be different from what I will describe below.
1. Empty character class
A character class that does not contain any characters [] is called an empty character class (empty char class). I believe you have never heard others call it this, because in other languages, this kind of The writing method is illegal, and all documents and tutorials will not talk about an illegal syntax. Let me demonstrate how other languages or tools report this error:
$echo | grep '[]'
grep: Unmatched [ or [^
$echo | sed '/[]/'
sed: -e expression #1, character 4: unterminated address regular expression
$echo | awk '/[]/'
awk: cmd. line:1: /[]/
awk: cmd. line:1: ^ unterminated regexp
awk: cmd. line:1: error: Unmatched [ or [^ : /[]//
$echo | perl -ne '/[]/'
Unmatched [ in regex; marked by <-- HERE in m/[ <-- HERE ]/ at -e line 1.
$echo | ruby -ne '/[]/'
-e:1: empty char-class: /[]/
$python -c 'import re;re.match ("[]","")'
Traceback (most recent call last):
File "
File "E:Pythonlibre.py", line 137, in match
return _compile(pattern, flags).match(string)
File "E:Pythonlibre.py", line 244, in _compile
raise error, v # invalid expression
sre_constants.error: unexpected end of regular expression
In JavaScript, the empty character class is a legal regular component, but its effect is "never match", that is, it will fail to match anything. It is equivalent to an empty negative forward look. lookahead)(?!) effect:
js> "whatevern".match(/[]/g) //Empty character class, never match
null
js> "whatevern".match(/ (?!)/g) //Null negates forward look, never matches
null
Obviously, this kind of thing is of no use in JavaScript.
2. Negate the empty character class
does not contain any characters The negative character class [^] can be called a negative empty char class or an empty negative char class, either way, because this term is my "own creation" and is the same as what I said above. The empty character class is similar, and this writing method is also illegal in other languages:
$echo | grep '[^]'
grep: Unmatched [ or [^
$echo | sed '/[^ ]/'
sed: -e expression #1, character 5: unterminated address regular expression
$echo | awk '/[^]/'
awk: cmd. line:1: / [^]/
awk: cmd. line:1: ^ unterminated regexp
awk: cmd. line:1: error: Unmatched [ or [^: /[^]//
$echo | perl - ne '/[^]/'
Unmatched [ in regex; marked by <-- HERE in m/[ <-- HERE ^]/ at -e line 1.
$echo | ruby -ne '/[^]/'
-e:1: empty char-class: /[^]/
$python -c 'import re;re.match("[^]","")'
Traceback (most recent call last):
File "
File "E:Pythonlibre.py", line 137, in match
return _compile(pattern , flags).match(string)
File "E:Pythonlibre.py", line 244, in _compile
raise error, v # invalid expression
sre_constants.error: unexpected end of regular expression
$
In JavaScript, negating the empty character class is a legal regular component. Its effect is exactly the opposite of that of the empty character class. It can match any character, including the newline character "n", that is, it is equivalent to the common [ ss] and [ww]: [
js & GT; "WHATEVERN" .match (/[^]/G) // Negative empty characters, match any character
["h", "a", "a", "a", "a", "a" , "t", "e", "v", "e", "r", "n"]
js> "whatevern".match(/[sS]/g) //Complementary character class, match any Characters
["w", "h", "a", "t", "e", "v", "e", "r", "n"]
It should be noted that it cannot be called "always matching regular expressions" because the character class must have one character to match. If the target string is empty or has been consumed by the regular expression on the left, the match will Failed, for example:
js> /abc[^]/.test("abc") //There are no characters after c, and the match fails.
false
If you want to know the real "always matching regular", You can take a look at an article I translated before: "Empty" regular expressions
3.[]] and [^]]
This is relatively simple to say, that is: in the regular expressions of Perl and some other Linux commands, If the character class [] contains a right square bracket []] followed by a left square bracket, the right square bracket will be treated as an ordinary character, that is, it can only match "]", and in JavaScript, this This regular expression will be recognized as an empty character class followed by a right square bracket. The empty character class matches nothing. [^]] is also similar: in JavaScript, it matches an arbitrary character (negating the empty character class) followed by a Right bracket, such as "a]", "b]", and in other languages, any non-] character is matched.
$perl -e 'print "]" =~ /[]]/ '
$js -e 'print(/[]]/.test("]"))'
false
$perl -e 'print "x" =~ /[^] ]/'
$js -e 'print(/[^]]/.test("x"))'
false
4.$ anchor
Some beginners think that $ matches It is the newline character "n", which is completely wrong. $ is a zero-width assertion (zero-width assertion). It is impossible to match a real character. It can only match one position. What I want The difference mentioned occurs in non-multiline mode: You may think that in non-multiline mode, doesn't $ match the position after the last character? In fact, it is not that simple. In most other languages, if If the last character in the target string is the newline character "n", then $ will also match the position before that newline character, that is, it will match the two positions on the left and right sides of the last newline character. Z and z are found in many languages. If you know the difference between these two notations, you should understand that in other languages (Perl, Python, php, Java, c#...), $ in non-multiline mode is equivalent to Z, In JavaScript, $ in non-multiline mode is equivalent to z (it will only match the last position, regardless of whether the last character is a newline character). Ruby is a special case, because it defaults to multiline mode, multiline mode The next $ will match the position before each newline character, and of course it will also include the newline character that may appear at the end. These points are also mentioned in the book "Regular Guide" by Yu Sheng.
$perl - e 'print "whatevern" =~ s/$/replacement character/rg' //Global replacement
whatever replacement character //The position before the newline character is replaced
Replacement character -S $ js -e 'proprint ("Whatvern" .replace (/$/g, "replace the character")' // Global replacement
whatever
replacement character //
5. Forward reference
We all know that there is a back reference in regular expressions, which is to use a backslash + number to reference a string that has been matched by a previous capture group. The purpose It is used to match again or as a replacement result (become $). But there is a special case, what will happen if the referenced capture group has not yet started (the left bracket is the delimiter) and a backreference is used. For example Regular /(2(a)){2}/, (a) is the second capture group, but on the left side of it, 2 is used to refer to its matching result. We know that regular matches are matched from left to right. This is where the title of this section, forwards reference, comes from. It is not a strict concept. So now you think about what the following JavaScript code will return:
js> /(2( a)){2}/.exec("aaa")
???
Before answering this question, first look at the performance in other languages. Similarly, in other languages, writing this way is basically invalid grep '(2(a)){2}' }/'
sed: -e expression #1, character 12: illegal backreference
$echo aaa | awk '/(2(a)){2}/'
$echo aaa | perl -ne 'print /(2(a)){2}/'
$echo aaa | ruby -ne 'print $_ = ~/(2(a)){2}/'
$python -c 'import re;print re.match("(2(a)){2}","aaa")'
None
No error is reported in awk because awk does not support reverse references. The 2 is interpreted as a character with ASCII code 2. No error is reported in Perl, Ruby, and Python. I don’t know why it is designed like this. They should all learn Perl, but the effect is the same. In this case, it is impossible. Matched successfully.
In JavaScript, not only does it not report an error, but it can also match successfully. See if the answer is the same as the one you just thought of:
js> /(2(a)){2}/.exec("aaa" )
["aa", "a", "a"]
In case you forget what the result returned by the exec method is, let me tell you. The first element is the complete matching string, which is RegExp[" $&"], followed by the matching content of each capture group, that is, RegExp.$1 and RegExp.$2. Why can it be matched successfully? What is the matching process? My understanding is:
enters the first one first capture group (the leftmost left parenthesis), the first valid match is 2, but the second capture group (a) has not yet been rounded up at this time, so the value of RegExp.$2 is still undefined, so 2 matches A null character, or "position", to the left of the first a in the target string, just like ^ and other zero-width assertions. The point is that the match is successful. Keep going, and the second capture group (a) matches At the first a in the target string, the value of RegExp.$2 is also assigned to "a", and then at the end of the first capture group (the rightmost right bracket), the value of RegExp.$1 is also "a". Then comes the quantifier {2}, that is to say, starting from the first a in the target string, a new round of regular matching (2(a)) begins. The key point is here: RegExp. Is the value of $2 the value of 2 matching or the value "a" assigned at the end of the first round of matching? The answer is: "No", the values of RegExp.$1 and RegExp.$2 will be cleared to undefined, 1 and 2 will be the same as the first time, successfully matching a null character (equivalent to no effect, writing or not is the same). Successfully matched the second a in the target string, then RegExp.$1 and RegExp.$2 The value of becomes "a" again, and the value of RegExp["$&"] becomes the complete matching string, the first two a: "aa".
In the early version of Firefox (3.6), the quantifier A new round of matching will not clear the value of the existing capture group. In other words, in the second round of matching, 2 will match the second a, so:
js> /(2(a )){2}/.exec("aaa")
["aaa", "aa", "a"]
In addition, the end of a capture group depends on whether the right bracket is closed, such as /(a1){ 3}/, although when 1 is used, the first capture group has started to match, but it has not ended yet. This is also a forward reference, so 1 matches is still empty:
js> /(a1 ){3}/.exec("aaa")
["aaa", "a"]
Explain another example:
js> /(?:(f)(o)(o)| (b)(a)(r))*/.exec("foobar")
["foobar", undefined, undefined, undefined, "b", "a", "r"]
* is a quantifier , after the first round of matching: $1 is "f", $2 is "o", $3 is "o", $4 is undefined, $5 is undefined, $6 is undefined.
At the beginning of the second round of matching: the captured value All are reset to undefined.
After the second round of matching: $1 is undefined, $2 is undefined, $3 is undefined, $4 is "b", $5 is "a", $6 is "r".
$& is assigned For "foobar", the match ends.
The last question:
js> /(?:^(a)|1(a)|(ab)){2}/.exec("aab ")
????