This article brings you relevant knowledge about python, which mainly introduces issues related to resume screening, including defining the ReadDoc class to read word files and defining the search_word function. Let’s take a look at the relevant content of the screening. I hope it will be helpful to everyone.
Recommended learning: python video tutorial
The relevant resume information is as follows:
Known conditions:
Want to find files containing the specified Keyword resume (such as Python, Java)
Implementation idea:
Read each word file in batches (obtain word information through glob), and combine all their readable contents Obtain and filter through keywords to get the target resume address.
One thing to note here is that not all "resumes" are presented in the form of paragraphs. For example, the resume downloaded from the "Liepin" website is " "Tabular format", and the resume downloaded from "boss" is in "paragraph format". You need to pay attention when reading it here. The demonstration script exercise we did is in "table format".
Here, we can specifically define a "ReadDoc" class, which defines two functions for reading "paragraphs" and "tables".
The practical case script is as follows:
# coding:utf-8from docx import Documentclass ReadDoc(object): # 定义一个 ReadDoc ,用以读取 word 文件 def __init__(self, path): # 构造函数默认传入读取 word 文件的路径 self.doc = Document(path) self.p_text = '' self.table_text = '' self.get_para() self.get_table() def get_para(self): # 定义 get_para 函数用以读取 word 文件的段落 for p in self.doc.paragraphs: self.p_text += p.text + '\n' # 读取的段落内容进行换行 print(self.p_text) def get_table(self): # 定义 get_table 函数循环读取表格内容 for table in self.doc.tables: for row in table.rows: _cell_str = '' # 获取每一行的完整信息 for cell in row.cells: _cell_str += cell.text + ',' # 每一行加一个 "," 隔开 self.table_text += _cell_str + '\n' # 读取的表格内容进行换行 print(self.table_text)if __name__ == '__main__': path = glob.os.path.join(glob.os.getcwd(), 'test_file/简历1.docx') doc = ReadDoc(path) print(doc)
Look at the running results of the ReadDoc
class
OK, the above has successfully read the word document of the resume, next we will read The obtained content is filtered out by selecting keyword information, and resumes containing keywords are filtered out.
The practical case script is as follows:
# coding:utf-8import globfrom docx import Documentclass ReadDoc(object): # 定义一个 ReadDoc ,用以读取 word 文件 def __init__(self, path): # 构造函数默认传入读取 word 文件的路径 self.doc = Document(path) self.p_text = '' self.table_text = '' self.get_para() self.get_table() def get_para(self): # 定义 get_para 函数用以读取 word 文件的段落 for p in self.doc.paragraphs: self.p_text += p.text + '\n' # 读取的段落内容进行换行 # print(self.p_text) # 调试打印输出 word 文件的段落内容 def get_table(self): # 定义 get_table 函数循环读取表格内容 for table in self.doc.tables: for row in table.rows: _cell_str = '' # 获取每一行的完整信息 for cell in row.cells: _cell_str += cell.text + ',' # 每一行加一个 "," 隔开 self.table_text += _cell_str + '\n' # 读取的表格内容进行换行 # print(self.table_text) # 调试打印输出 word 文件的表格内容def search_word(path, targets): # 定义 search_word 用以筛选符合内容的简历;传入 path 与 targets(targets 为列表) result = glob.glob(path) final_result = [] # 定义一个空列表,用以后续存储文件的信息 for i in result: # for 循环获取 result 内容 isuse = True # 是否可用 if glob.os.path.isfile(i): # 判断是否是文件 if i.endswith('.docx'): # 判断文件后缀是否是 "docx" ,若是,则利用 ReadDoc类 实例化该文件对象 doc = ReadDoc(i) p_text = doc.p_text # 获取 word 文件内容 table_text = doc.table_text all_text = p_text + table_text for target in targets: # for 循环判断关键字信息内容是否存在 if target not in all_text: isuse = False break if not isuse: continue final_result.append(i) return final_resultif __name__ == '__main__': path = glob.os.path.join(glob.os.getcwd(), '*') result = search_word(path, ['python', 'golang', 'react', '埋点']) # 埋点是为了演示效果,故意在 "简历1.docx" 加上的 print(result)
The running results are as follows:
Recommended learning: python video tutorial
The above is the detailed content of Python automation practice for screening resumes. For more information, please follow other related articles on the PHP Chinese website!