Unicode Equivalents for w and b in Java Regular Expressions?
Java's implementation of Perl-style character class shortcuts (w, b, s, etc.) is limited to ASCII characters. To match Unicode characters correctly, you need a better way to rewrite these shortcuts.
Solution:
Utilize a custom function to rewrite the following charclass escapes:
\w \W \s \S \v \V \h \H \d \D \b \B \X \R
Rewritten Definitions:
\s => [^\u0009-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000] \S => [^\u0009-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000] \v => [\u000A-\u000D\u0085\u2028\u2029] \V => [^\u000A-\u000D\u0085\u2028\u2029] \h => [\u0009\u0020\u00A0\u1680\u180E\u2000-\u200A\u202F\u205F\u3000] \H => [^\u0009\u0020\u00A0\u1680\u180E\u2000\u2001-\u200A\u202F\u205F\u3000] \w => [\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]] \W => [^\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]] \b => (?:(?<=[a-z0-9])(?![a-z0-9])|(?<![a-z0-9])(?=[a-z0-9])) \B => (?:(?<=[a-z0-9])(?=[a-z0-9])|(?<![a-z0-9])(?![a-z0-9])) \d => \p{Nd} \D => \P{Nd} \R => (?:(?>\u000D\u000A)|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]) \X => (?>\PM\pM*)
Boundary Considerations:
Java's b and B are not solely bound to w. A rewritten b using A(?:AB|BC) construction can search for boundaries where:
A rewritten B using A(?:BC|AB) construction can search for non-boundaries where:
Complete Rewriting Function:
Grab the source code to get the full rewriting function mentioned above.
Additional Features:
The above is the detailed content of How to Handle Unicode Characters in Java Regular Expressions Using \w and \b Equivalents?. For more information, please follow other related articles on the PHP Chinese website!