python - 爬虫获取页面链接,求问如何判断是最新的链接?
黄舟
黄舟 2017-04-18 09:46:23
0
4
312

小弟想做一个自动转发网站新闻的微博机器人作为python练手项目。
我知道需要api对接、需要爬取网站的新闻链接和标题。
但是如何只提取最新的新闻呢?
以下是按照我的要求过滤后,输出所有新闻的代码:

bar = soup.find_all('li', attrs={'data-label': True})
news = len(bar)
for i in range(news):
    if u'巴塞罗那' in bar[i]['data-label'].split(','):
        print bar[i]

我想提取过滤后列表里的第一条:print bar .
但提取后会反复显示len(bar)次,而且跳过过滤规则,请问如何解决?

黄舟
黄舟

人生最曼妙的风景,竟是内心的淡定与从容!

reply all(4)
伊谢尔伦

Are you crawling the live broadcast?

You can set a variable lasttime to record the time of the last crawl

from datetime import datetime

#type datetime
lasttime

bar = soup.find_all('li', attrs={'data-label': True})
news = len(bar)
for i in range(news):
    d = datetime.strptime(bar[i].text[-19:], "%Y-%m-%d %H:%M:%S")
    if u'巴塞罗那' in bar[i]['data-label'].split(',') and d > lasttime:
        print bar[i]
阿神

In fact, this problem is very common, that is, heavy sentences. First, you need to add a unique identifier to each news, such as a timestamp, or the connection method in the live broadcast bar: "http://news.zhibo8.cc/zuqiu/2016-10-18/5805df3d3422f", you can Available:

20161018-5805df3d3422f

As the unique ID of the news, or more strictly, add the football logo, such as 0:

0-20161018-5805df3d3422f

With a unique ID, it is much easier to handle. There are many ways. For example, maintain a list in memory, which stores the IDs of the news on the current page in order, and then crawl the page next time, then the new ones on the page The news is the news after the first id in the current list. Then update the list. You can delete old news from the list. For example, if n new news are added, then the last n news will be deleted. Regardless of space or time, it's pretty good.
If you still want to save news, then save the deleted news to the database every time.

迷茫

Don’t all news pages have a time field?

大家讲道理

Your goal is to extract the latest news and include the keywords you set! ! In fact, the simplest way is to set time.sleep(60) and re-crawl the web page data after one minute. Then you can get the latest news, right? Also, your question contains too little information,

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!