Tokenizing Strings with Spaces, Excluding Quoted Substrings in Java
Separating a string based on spaces can be straightforward, but what if certain parts of the string are quoted and should be treated as a single token? In Java, you can achieve this nuanced splitting using regular expressions.
To handle this scenario, the following approach can be used:
String str = "Location \"Welcome to india\" Bangalore Channai \"IT city\" Mysore"; List<String> list = new ArrayList<>(); Matcher m = Pattern.compile("([^\"]\S*|\".+?\")\s*").matcher(str); while (m.find()) list.add(m.group(1));
The regular expression used here effectively splits the string into tokens based on whitespace, but it also identifies quoted substrings. By capturing these quoted substrings as single tokens, we can ensure that phrases like "Welcome to india" remain intact.
The regular expression can be understood as follows:
This allows us to handle complex strings with quoted phrases effectively. The resulting list contains tokens that represent individual words or quoted phrases, as required in the example provided.
The above is the detailed content of How to Tokenize Strings with Spaces, Excluding Quoted Substrings in Java?. For more information, please follow other related articles on the PHP Chinese website!