python 如何实现并行查找关键字所在的行?
高洛峰
高洛峰 2017-04-17 17:45:22
0
3
1123

我有几十万个关键字放在文件4.txt中,想提取文件3.txt中含有关键字的行,保存到文件5.txt中.
文件3有200万行,我使用下面的代码可以实现我的要求,但是非常慢,一个下午还没运行完,谁有快一点的方法?
使用并行改如何改造呢?我看到这里有个并行的帖子,,与我的不同的事,我要同时读以及查询同一个文件,上述链接可以并行操作多个文件。

with open('3.txt', 'r') as f3, open('4.txt', 'r') as f4, open('result.txt', 'w') as f5:
    a = [line.strip() for line in f4.readlines()]
    for li in f3.readlines():
        new_line = li.strip().split()[1][:-2]
        for i in a:
            if i in new_line:
                f5.writelines(li)
高洛峰
高洛峰

拥有18年软件开发和IT教学经验。曾任多家上市公司技术总监、架构师、项目经理、高级软件工程师等职务。 网络人气名人讲师,...

reply all(3)
阿神

Because there are no actual files, there is no way to give you a 100% guarantee, but for your code, I have some suggestions for efficiency improvements:

(Maybe you will find that the improved code does not require a parallel solution at all)


First of all, a big problem is readlines(). This method will read all the lines in the file objects in one go. This is obviously extremely poor for efficiency and resource usage. There are hundreds of thousands of lines. It is very scary to read millions of lines in one go. readlines(),這個方法會一口氣讀取 file objects 中的所有行,這對於效率和資源的使用顯然是極差的,幾十萬行幾百萬行的東西要一口氣讀了,這可是非常恐怖的.

詳細的分析和討論請參考Never call readlines() on a file

(文章中的這段話幾乎可當作是警語了)

There are hundreds of questions on places like StackOverflow about the readlines method, and in every case, the answer is the same.
"My code is takes forever before it even gets started, but it's pretty fast once it gets going."
That's because you're calling readlines.
"My code seems to be worse than linear on the size of the input, even though it's just a simple loop."
That's because you're calling readlines.
"My code can't handle giant files because it runs out of memory."
That's because you're calling readlines.

結論是: 建議所有使用 readlines 的地方全部改掉

範例:

with open('XXX', 'r') as f:
    for line in f.readlines():
       # do something...

一律改成:

with open('XXX', 'r') as f:
    for line in f:
       # do something...

直覺上效率會好很多.


其次,你使用了 list 來查找關鍵字,這也是相當沒效率的:

for i in a:
    if i in new_line:

為了確認 new_line 中是否有關鍵字 i,這邊走訪了一整個關鍵字 list: a,對於一般的情況可能還好,但是數十萬的關鍵字比對,對每一行都走訪一次 a 會造成大量的時間浪費,假設 a 裡面有 x 個關鍵字,f3 中有 y 行,每行有 z 個字,這邊要花的時間就是 x*y*z(根據你文件的行數,這個數量級極為驚人).

如果簡單地利用一些使用 hash 來查找的容器肯定會好一些,比如說 dictionary 或是 set


最後是關於你的查找部分:

for li in f3.readlines():
    new_line = li.strip().split()[1][:-2]
    for i in a:
        if i in new_line:
            f5.writelines(li)

這邊我不是很懂,new_line 看起來是一個子字串,然後現在要用這個字串去比對關鍵字?

不過先撇開這個不談,關於含有關鍵字的 new_line 在印出後,似乎不該繼續循環 a,除非你的意思是 new_line 中有幾個關鍵字我就要印 line 幾次. 否則加上一個 break

For detailed analysis and discussion, please refer to Never call readlines() on a file

(This paragraph in the article can almost be regarded as a warning)

There are hundreds of questions on places like StackOverflow about the readlines method, and in every case, the answer is the same.
"My code is takes forever before it even gets started, but it's pretty fast once it gets going."
That's because you're calling readlines.
"My code seems to be worse than linear on the size of the input, even though it's just a simple loop."
That's because you're calling readlines.
"My code can't handle giant files because it runs out of memory."
That's because you're calling readlines.

🎜The conclusion is: It is recommended that all places where readlines are used be changed. 🎜 🎜Example:🎜
with open('3.txt') as f3, open('4.txt') as f4, open('result.txt', 'w') as f5:
    keywords = set(line.strip() for line in f4)
    for line in f3:
        new_line = line.strip().split()[1][:-2]
        for word in new_line:
            if word in keywords:
                print(line, file=f5)
                break
🎜Always be changed to:🎜 rrreee 🎜Intuitively, the efficiency will be much better. 🎜 🎜 🎜Secondly, you used list to find keywords, which is also quite inefficient:🎜 rrreee 🎜In order to confirm whether there is the keyword i in new_line, we visited the entire keyword list: a. In general cases, it may still be Okay, but for hundreds of thousands of keyword comparisons, visiting a once for each row will cause a lot of time waste. Suppose there are x keywords in a, f3, and each line has z words. The time it takes here is x*y*z (depending on the number of lines in your file, this order of magnitude is extremely amazing ). 🎜 🎜It would definitely be better if we simply use some containers that use hash to search, such as dictionary or set. 🎜 🎜 🎜The last part is about your search:🎜 rrreee 🎜I don’t quite understand this. new_line seems to be a substring, and now you need to use this string to compare keywords? 🎜 🎜But putting this aside for now, regarding new_line containing keywords, after printing, it seems that you should not continue to loop a, unless you mean new_line and I have to print line several times. Otherwise, adding a break can also speed up the process. 🎜 🎜 🎜It is recommended that your code be changed to:🎜 rrreee 🎜If I got you wrong, feel free to tell me and let’s discuss it again. Intuitively, your problem can be solved without using parallelism🎜
伊谢尔伦

acautomatic machine

黄舟

Based on @dokelung’s answer, with slight modifications, it can basically meet my requirements. This answer is somewhat different from using grep -f 4.txt 3.txt > 5.txt. I am comparing the differences between the two result files.

with open('3.txt') as f3, open('4.txt') as f4, open('result.txt', 'w') as f5:
    keywords = set(line.strip() for line in f4)
    for line in f3:
        new_line = line.strip().split()[1][:-2]
        if new_line in keywords:
            print(line.strip(), file=f5)
            
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template