多线程 - 为何python子线程会等待很长时间

Question

背景：运行一个爬虫，开了10个线程，每个线程先去爬取指定数量的代理作为自己的代理池，然后开始工作。 问题：下面是爬虫日志的两行，可以看到在第一行任务处等待了45秒，而这里不过是输出一条信息，十分不理解为...

高洛峰 · Answer

It is found that the described problem is mainly caused by the improper use of SQLite. The previous design was to open a connection until the verification of all agents in the agent pool is completed and a certain number of agents are captured before closing the connection. Whenever there is an agent The addition, modification, and deletion of information are all written to the data file, which causes coarse-grained SQLite to be locked for a long time.

After discovering this problem, we optimized it. When starting a new connection, we will close the connection immediately after reading the inventory agent. After that, all agent new, updated, and deleted data will be temporarily stored in class variables until all required agents are obtained. Open a new connection, use executemany to update the data, then close the connection, complete the scheduled task, and the speed will increase.

But I still can’t understand why in the original situation, the thread scheduling mechanism allows the thread blocked by the database to always occupy resources instead of switching in time?

伊谢尔伦 · Answer

So your thread is blocked at the level of writing to the database. Since you are using sqlite, I will give you another ancient power to speed up the database writing operation:

import sqlite3

...
conn = sqlite3.connect('xxx.db')
cur = conn.cursor()
cur.execute("CREATE TABLE xxx")  # 建个表
cur.execute("PRAGMA synchronous = OFF")  # 关闭磁盘同步
cur.execute("BEGIN TRANSACTION")  # 开始事务处理
cur.executemany("INSERT INTO names VALUES (?,?)", lst)  # 批量插入爬到的数据
conn.commit()
conn.close()
...

Three methods are used to speed upsqlitewriting speed

Turn off disk synchronization
SQLite Transactions
executemany batch insert

PS: In addition, if you have ample memory, you can throw the database file into the tmpfs directory, which will greatly eliminate the impact of disk I/O (equivalent to writing directly in memory)