Rewriting w and b in Java Regexes for Unicode Compatibility
Java's w and b regular expression shortcuts have limited Unicode support. To address this, you can rewrite these shortcuts using the following Unicode-aware definitions:
\w (words) => [\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]] \W (non-words) => [^\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]] \b (word boundary) => (?:(?<=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])|(?<![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])) \B (non-word boundary) => (?:(?<=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])|(?<![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]))
Other Unicode-Aware Regexp Shortcuts:
Regexp Shortcut | Unicode-Aware Definition | |
---|---|---|
s (whitespace) | [u0009-u000Du0020u0085u00A0u1680u180Eu2000-u200Au2028u2029u202Fu205Fu3000] | |
S (non-whitespace) | 1 | |
v (vertical whitespace) | [u000A-u000Du0085u2028u2029] | |
V (non-vertical whitespace) | 2 | |
h (horizontal whitespace) | [u0009u0020u00A0u1680u180Eu2000-u200Au202Fu205Fu3000] | |
H (non-horizontal whitespace) | 3 | |
d (digits) | p{Nd} | |
D (non-digits) | P{Nd} | |
R (line break) | (?:(?>u000Du000A) | [u000Au000Bu000Cu000Du0085u2028u2029]) |
X (extended grapheme cluster) | (?>PMpM*) |