网页爬虫 - python如何优雅的进行字符过滤?
黄舟
黄舟 2017-04-18 10:15:45
0
2
461

比如我在采集一个网站的时候,我想直接过滤或是替换掉一些没用的信息,比如QQ 手机 或是 www 开头的东西

数据量小一点还好我可以这样写:

if "http" or "www" or "QQ" or "qq" in content:
    ....

但是数据量大的话那不是很悲剧?
难道要一直or来连接?

最优雅的实现方式是什么?我想能用正则的话肯定是比较好的

因为需要匹配的信息太多了 ,比如QQ号码,网址,电话等这些都要进行查找和替换

黄舟
黄舟

人生最曼妙的风景,竟是内心的淡定与从容!

reply all(2)
伊谢尔伦

This depends on the size of your data. If the data size is small, at most you can save the keywords in redis or some configuration file. Every time you crawl down the data, take out all the keywords and replace them.

But because you are a web crawler, if the keywords and the strings that need to be filtered are particularly large, even if you use regular expressions, the efficiency will be very worrying.

For example, you have 100,000 keywords that need to be filtered out. Suppose you can combine these 100,000 keywords into 50,000 regular expressions (not to mention whether to manually write so many regular expressions or automatically generate regular expressions), each time The string that climbed down is very long, and it needs to be looped at least 50,000 times to match all regular expressions. I think this simple method may not be available.

Just my personal suggestion, you can refer to this article: http://blog.jobbole.com/99910/ It talks about how to segment keywords and build keyword indexes to achieve more efficient queries. This article introduces stackoverflow's tag engine.

Or suggest using heavyweight ones like ElasticSearch. . . Obviously there is no way to say the dozens of words here.

迷茫

What the person above said is correct, but if the data is small, you can consider using any

a = [1, 2]
b = [2, 3]
if any(i in b for i in a):
    pass
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!