网页爬虫 - python如何优雅的进行字符过滤?

Question

比如我在采集一个网站的时候,我想直接过滤或是替换掉一些没用的信息,比如QQ 手机 或是 www 开头的东西 数据量小一点还好我可以这样写: {代码...} 但是数据量大的话那不是很悲剧?难道要一直or来连接? 最优雅的实现...

伊谢尔伦 · Answer

This depends on the size of your data. If the data size is small, at most you can save the keywords in redis or some configuration file. Every time you crawl down the data, take out all the keywords and replace them.

But because you are a web crawler, if the keywords and the strings that need to be filtered are particularly large, even if you use regular expressions, the efficiency will be very worrying.

For example, you have 100,000 keywords that need to be filtered out. Suppose you can combine these 100,000 keywords into 50,000 regular expressions (not to mention whether to manually write so many regular expressions or automatically generate regular expressions), each time The string that climbed down is very long, and it needs to be looped at least 50,000 times to match all regular expressions. I think this simple method may not be available.

Just my personal suggestion, you can refer to this article: http://blog.jobbole.com/99910/ It talks about how to segment keywords and build keyword indexes to achieve more efficient queries. This article introduces stackoverflow's tag engine.

Or suggest using heavyweight ones like ElasticSearch. . . Obviously there is no way to say the dozens of words here.

迷茫 · Answer

What the person above said is correct, but if the data is small, you can consider using any

a = [1, 2]
b = [2, 3]
if any(i in b for i in a):
    pass

Php8, I'm coming too

Learn website layout in 30 minutes

Shangguan Oracle Beginner to Proficient Video Tutorial

Your first line of UNI-APP code

Flutter from scratch to app launch

Brother Lian New Linux Video Tutorial

AXURE 9 Video Tutorial (Suitable for Product Manager Interactive Product Design UI)

Zero Basic Proficiency PS Video Tutorial

16 day UI video tutorial to get you started

PS Techniques and Slicing Techniques Video Tutorial

Alibaba Cloud Environment Construction and Project Launch Video Tutorial

Overview of Computer Networks - Basic Knowledge that Programmers Must Master

Essential Tutorial for Programmers - HTTP Protocol Explanation

Websocket Video Tutorial