在 Java 正则表达式中重写 w 和 b 以实现 Unicode 兼容性
Java 的 w 和 b 正则表达式快捷方式对 Unicode 支持有限。为了解决这个问题,您可以使用以下 Unicode 感知定义重写这些快捷方式:
\w (words) => [\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]] \W (non-words) => [^\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]] \b (word boundary) => (?:(?<=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])|(?<![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])) \B (non-word boundary) => (?:(?<=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])|(?<![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]])(?![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]))
其他 Unicode 感知正则表达式快捷方式:
Regexp Shortcut | Unicode-Aware Definition | |
---|---|---|
s (whitespace) | [u0009-u000Du0020u0085u00A0u1680u180Eu2000-u200Au2028u2029u202Fu205Fu3000] | |
S (non-whitespace) | 1 | |
v (vertical whitespace) | [u000A-u000Du0085u2028u2029] | |
V (non-vertical whitespace) | 2 | |
h (horizontal whitespace) | [u0009u0020u00A0u1680u180Eu2000-u200Au202Fu205Fu3000] | |
H (non-horizontal whitespace) | 3 | |
d (digits) | p{Nd} | |
D (non-digits) | P{Nd} | |
R (line break) | (?:(?>u000Du000A) | [u000Au000Bu000Cu000Du0085u2028u2029]) |
X (extended grapheme cluster) | (?>PMpM*) |