Each HTML page is a DOM tree. When crawling, record the position of each sensitive word in the node, and then compare it in the database to complete the job.
The idea of recording the IDs that have been crawled is right. Even if it is a reply, you can do it this way. Just record the ID, time and other identifying information of the reply
Each HTML page is a DOM tree. When crawling, record the position of each sensitive word in the node, and then compare it in the database to complete the job.
Please use bloom filter
The idea of recording the IDs that have been crawled is right. Even if it is a reply, you can do it this way. Just record the ID, time and other identifying information of the reply