If you search for Python regular expressions on the Internet, there are tens of millions of search results. It seems that everyone has a great demand for learning regular expressions. So what are regular expressions? A regular expression is a special sequence of characters that can help you easily check whether a string matches a certain pattern. Below, I will give you a brief introduction to Python's regular expressions for matching Chinese based on cases. Friends who need it can learn it.
Regular expressions are not part of the Python programming language (http://www.maiziedu.com/course/python/). Regular expressions are a powerful tool for processing strings. They have their own unique syntax and an independent processing engine. They may not be as efficient as str's own method, but they are very powerful. Thanks to this, the syntax of regular expressions is the same in languages that provide regular expressions. The only difference is that different programming language implementations support different numbers of syntaxes; but don’t worry, unsupported syntaxes are usually It is a part that is not commonly used.
Introduction to Python Regular Expressions
A regular expression is a special sequence of characters that can help you easily check whether a string matches a certain pattern.
Python has added the re module since version 1.5, which provides Perl-style regular expression patterns. The
re module brings full regular expression capabilities to the Python language. The
compile function generates a regular expression object based on a pattern string and optional flag arguments. This object has a series of methods for regular expression matching and replacement. The
re module also provides functions identical to these methods, which take a pattern string as their first argument.
The above are all foreshadowing the main text. Let’s take a look at how python regular expressions match Chinese.
# -*- coding: utf-8 -*-
import re
def findPart(regex, text, name):
res=re.findall(regex, text)
if res:
print "There are %d %s parts:n"% (len(res), name)
for r in res:
print "t",r.encode("utf8")
print
text =" #who#helloworld#a中文x#"
usample=unicode(text,'utf8')
findPart(u"#[wu2E80-u9FFF]+#", usample, "unicode chinese")
Note:
Several major non-English language character ranges
2E80~33FFh: Chinese, Japanese and Korean symbol area. Contains Kangxi dictionary radicals, Chinese, Japanese and Korean auxiliary radicals, phonetic symbols, Japanese kana, Korean musical notes, Chinese, Japanese and Korean symbols, punctuation, circled or bracketed rune numbers, months, as well as Japanese kana combinations, units, years Number, month, date, time, etc.
3400~4DFFh: Expanded area A for ideographic characters recognized by China, Japan and Korea, which contains a total of 6,582 Chinese, Japanese and Korean Chinese characters.
4E00~9FFFh: China, Japan and Korea recognized ideographic area, containing a total of 20,902 Chinese, Japanese and Korean Chinese characters.
A000~A4FFh: Yi writing area, containing the writing and root characters of the Yi people in southern China.
AC00~D7FFh: Hangul pinyin combination word area, which contains words spelled out with Korean phonetic notes.
F900~FAFFh: Chinese, Japanese and Korean compatible ideogram area, containing a total of 302 Chinese, Japanese and Korean Chinese characters.
FB00~FFFDh: text expression area, containing combination of Latin characters, Hebrew, Arabic, Chinese, Japanese and Korean straight punctuation, small symbols, half-width symbols, full-width
(
#!/usr/bin/ python3
# -*- coding: UTF-8 -*-
import re
message = u'天人合一'.encode('utf8')
print(re.search(u'人'. encode('utf8'), message).group())
Example in interactive mode
>>> import re
>>> s='Phone No. 010-87654321' >>>
>>> r=re.compile(r'(d+)-(d+)')
>>> m=r.search(s)
>>
This topic was reviewed and approved by Xiaobei on 2016-5-17 13:27
|