python - 做爬虫的时候，怎么判断是否爬过呢-PHP Chinese Network Q&A

Article Topic Learning Download Q&A Programming Dictionary Game Recent Updates

简体中文(ZH-CN) English(EN) 繁体中文(ZH-TW) 日本語(JA) 한국어(KO) Melayu(MS) Français(FR) Deutsch(DE)

python - 做爬虫的时候，怎么判断是否爬过呢

天蓬老师 2017-04-18 10:17:25

0

3

564

以前没有做过爬虫，现在项目有个需求就是要爬取指定论坛的某个板块的帖子下面如果出现 “机械XXX” 等关键字就要抓取通知管理员，

但是现在有个问题，怎么判断是否爬过呢，目前这个爬虫第一版只爬帖子内容不爬回复，目前想到的思路是定时爬前20页的id 然后去数据库里对比是否有id 如果不存在此id再继续爬，不知道这个思路怎么样

但是第二版的话就要爬回复内容了这个应该怎么实现呢？各位大神有什么思路吗

天蓬老师

欢迎选择我的课程，让我们一起见证您的进步~~

reply all (3)

Peter_Zhu2017-04-18 10:19:25 3 floor

Each HTML page is a DOM tree. When crawling, record the position of each sensitive word in the node, and then compare it in the database to complete the job.

Like+0

Add Reply

黄舟2017-04-18 10:19:25 2 floor

Please use bloom filter

Like+0

Add Reply

刘奇2017-04-18 10:19:25 1 floor

The idea of recording the IDs that have been crawled is right. Even if it is a reply, you can do it this way. Just record the ID, time and other identifying information of the reply

Like+0

Add Reply

Popular Topics

More>

Popular Articles

Popular Tutorials

More>

Related Tutorials

Popular Recommendations

Latest courses

Latest Downloads

More>

Web Effects

Website Source Code

Website Materials

Front End Template

About us Disclaimer Sitemap: php.cn：Public welfare online PHP training，Help PHP learners grow quickly！