84669 person learning
152542 person learning
20005 person learning
5487 person learning
7821 person learning
359900 person learning
3350 person learning
180660 person learning
48569 person learning
18603 person learning
40936 person learning
1549 person learning
1183 person learning
32909 person learning
项目里面用到了分词,但是得到很多无关的分词,比如标点,各种符号之类的。
后来在黑名单里面加了中文停用词,能去除绝大部分,但是还是有一些奇奇怪怪的符号去不掉。
所以求一个正则:只能是 (数字,字母(不区分大小写),汉字)或其任意组合。
比如
谢谢
人生最曼妙的风景,竟是内心的淡定与从容!
\u4E00-\u9FA5\uF900-\uFA2D 加上 \w
public static void main(String[] args) { // TODO implement RegexStuff.main String regex = "([\u4E00-\u9FA5\uF900-\uFA2D]|\\w)+"; //Pattern pattern = Pattern.compile(regex); String str1 = "abcF"; String str2 = "as212"; String str3 = "das你好1d"; String str4 = "34D4H好"; String str5 = "大家"; System.out.println(str1.matches(regex)); // true System.out.println(str2.matches(regex)); // true System.out.println(str3.matches(regex)); // true System.out.println(str4.matches(regex)); // true System.out.println(str5.matches(regex)); // true }
参考http://blog.csdn.net/sww_simpcity/article/details/9082993
\u4E00-\u9FA5\uF900-\uFA2D
加上
\w
参考
http://blog.csdn.net/sww_simpcity/article/details/9082993