character set

Okay, through the above examples, we have a preliminary understanding of Python's regular expressions. You may ask, what are the rules for regular expressions, and what do the letters mean?

Actually, there is no need to worry about this. The corresponding list of regular expression rules will be given later in this chapter, and these things can be easily found on Google on the Internet. So now, we will further deepen our understanding of regular expressions and talk about the character set of regular expressions.

The character set is a set of characters enclosed by a pair of square brackets "[]". Using a character set, you can match one of multiple characters.

For example, if you use C[ET]O to match the CEO or CTO, that is to say, [ET] represents an E or a T. Like the [a-z] mentioned above, it is one of all lowercase letters. The hyphen "-" is used here to define a character range of consecutive characters. Of course, this way of writing can contain multiple character ranges, such as: [0-9a-fA-F], which matches a single hexadecimal number and is not case-sensitive. Note that the order of character and range definitions has no impact on the matching results.

In fact, I have said so much, I just want to prove that the character relationship between a pair of square brackets "[]" in a character set is an or relationship. Let's look at an example:

import re
a = 'uav,ubv,ucv,uwv,uzv,ucv,uov'
# 字符集
# 取 u 和 v 中间是 a 或 b 或 c 的字符
findall = re.findall('u[abc]v', a)
print(findall)
# 如果是连续的字母，数字可以使用 - 来代替
l = re.findall('u[a-c]v', a)
print(l)
# 取 u 和 v 中间不是 a 或 b 或 c 的字符
re_findall = re.findall('u[^abc]v', a)
print(re_findall)
输出的结果：
['uav', 'ubv', 'ucv', 'ucv']
['uav', 'ubv', 'ucv', 'ucv']
['uwv', 'uzv', 'uov']

In the example, the negated character set is used, that is, if the left square bracket "[" is followed by an angle bracket "^", the character set will be negated. One thing to remember is that the negated character set must match a single character. For example: q[^u] does not mean: match a q without u following it. It means: match a q followed by a character that is not u. The specifics can be understood by comparing the output results in the above example.

We all know that regular expressions themselves define some rules, such as \d, which matches all numeric characters. In fact, it is equivalent to [0-9]. An example is also written below. These special characters are interpreted in the form of sets.

import re
a = 'uav_ubv_ucv_uwv_uzv_ucv_uov&123-456-789'
# 概括字符集
# \d 相当于 [0-9] ,匹配所有数字字符
# \D 相当于 [^0-9] ， 匹配所有非数字字符
findall1 = re.findall('\d', a)
findall2 = re.findall('[0-9]', a)
findall3 = re.findall('\D', a)
findall4 = re.findall('[^0-9]', a)
print(findall1)
print(findall2)
print(findall3)
print(findall4)
# \w 匹配包括下划线的任何单词字符，等价于 [A-Za-z0-9_]
findall5 = re.findall('\w', a)
findall6 = re.findall('[A-Za-z0-9_]', a)
print(findall5)
print(findall6)

Output result:

['1', '2', '3', '4', '5', '6', '7', '8', '9']
['1', '2', '3', '4', '5', '6', '7', '8', '9']
['u', 'a', 'v', '_', 'u', 'b', 'v', '_', 'u', 'c', 'v', '_', 'u', 'w', 'v', '_', 'u', 'z', 'v', '_', 'u', 'c', 'v', '_', 'u', 'o', 'v', '&', '-', '-']
['u', 'a', 'v', '_', 'u', 'b', 'v', '_', 'u', 'c', 'v', '_', 'u', 'w', 'v', '_', 'u', 'z', 'v', '_', 'u', 'c', 'v', '_', 'u', 'o', 'v', '&', '-', '-']
['u', 'a', 'v', '_', 'u', 'b', 'v', '_', 'u', 'c', 'v', '_', 'u', 'w', 'v', '_', 'u', 'z', 'v', '_', 'u', 'c', 'v', '_', 'u', 'o', 'v', '1', '2', '3', '4', '5', '6', '7', '8', '9']
['u', 'a', 'v', '_', 'u', 'b', 'v', '_', 'u', 'c', 'v', '_', 'u', 'w', 'v', '_', 'u', 'z', 'v', '_', 'u', 'c', 'v', '_

Continuing Learning