Home > Java > javaTutorial > How to Handle Unicode Characters in Java Regular Expressions Using \w and \b Equivalents?

How to Handle Unicode Characters in Java Regular Expressions Using \w and \b Equivalents?

Mary-Kate Olsen
Release: 2024-12-11 08:42:10
Original
246 people have browsed it

How to Handle Unicode Characters in Java Regular Expressions Using w and b Equivalents?

Unicode Equivalents for w and b in Java Regular Expressions?

Java's implementation of Perl-style character class shortcuts (w, b, s, etc.) is limited to ASCII characters. To match Unicode characters correctly, you need a better way to rewrite these shortcuts.

Solution:

Utilize a custom function to rewrite the following charclass escapes:

\w \W \s \S \v \V \h \H \d \D \b \B \X \R
Copy after login

Rewritten Definitions:

\s => [^\u0009-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]
\S => [^\u0009-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]

\v => [\u000A-\u000D\u0085\u2028\u2029]
\V => [^\u000A-\u000D\u0085\u2028\u2029]

\h => [\u0009\u0020\u00A0\u1680\u180E\u2000-\u200A\u202F\u205F\u3000]
\H => [^\u0009\u0020\u00A0\u1680\u180E\u2000\u2001-\u200A\u202F\u205F\u3000]

\w => [\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]
\W => [^\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]]

\b => (?:(?<=[a-z0-9])(?![a-z0-9])|(?<![a-z0-9])(?=[a-z0-9]))
\B => (?:(?<=[a-z0-9])(?=[a-z0-9])|(?<![a-z0-9])(?![a-z0-9]))

\d => \p{Nd}
\D => \P{Nd}

\R => (?:(?>\u000D\u000A)|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029])

\X => (?>\PM\pM*)
Copy after login

Boundary Considerations:

Java's b and B are not solely bound to w. A rewritten b using A(?:AB|BC) construction can search for boundaries where:

  • IF follows word ==> THEN doesn't precede word
  • ELSIF doesn't follow word ==> THEN does precede word

A rewritten B using A(?:BC|AB) construction can search for non-boundaries where:

  • IF follows word ==> THEN does precede word
  • ELSIF doesn't follow word ==> THEN doesn't precede word

Complete Rewriting Function:

Grab the source code to get the full rewriting function mentioned above.

Additional Features:

  • Unicode character input in logical code points
  • Convenience definitions for natural-language words, dashes, hyphens, and apostrophes
  • Augmentation of regex escapes and unescaping of string escapes

The above is the detailed content of How to Handle Unicode Characters in Java Regular Expressions Using \w and \b Equivalents?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template