


One article to understand the problem of large file sorting/external memory sorting
Question 1: A file contains 500 million lines, each line is a random integer, and all integers in the file need to be sorted.
Divide&Conquer (pide&Conquer), ReferenceBig data algorithm: Sorting 500 million data
For this one Total.txt with 500000000 lines is sorted, and the file size is 4.6G.
Every time 10,000 lines are read, sort and write them into a new subfile (quick sort is used here).
1. Split & Sort
#!/usr/bin/python2.7 import time def readline_by_yield(bfile): with open(bfile, 'r') as rf: for line in rf: yield line def quick_sort(lst): if len(lst) < 2: return lst pivot = lst[0] left = [ ele for ele in lst[1:] if ele < pivot ] right = [ ele for ele in lst[1:] if ele >= pivot ] return quick_sort(left) + [pivot,] + quick_sort(right) def split_bfile(bfile): count = 0 nums = [] for line in readline_by_yield(bfile): num = int(line) if num not in nums: nums.append(num) if 10000 == len(nums): nums = quick_sort(nums) with open('subfile/subfile{}.txt'.format(count+1),'w') as wf: wf.write('\n'.join([ str(i) for i in nums ])) nums[:] = [] count += 1 print count now = time.time() split_bfile('total.txt') run_t = time.time()-now print 'Runtime : {}'.format(run_t)
will generate 50,000 small files, each small file is about 96K in size.
During the execution of the program, the memory usage has been around 5424kB
It takes 94146 seconds to split the entire file.
2. Merge
#!/usr/bin/python2.7 # -*- coding: utf-8 -*- import os import time testdir = '/ssd/subfile' now = time.time() # Step 1 : 获取全部文件描述符 fds = [] for f in os.listdir(testdir): ff = os.path.join(testdir,f) fds.append(open(ff,'r')) # Step 2 : 每个文件获取第一行,即当前文件最小值 nums = [] tmp_nums = [] for fd in fds: num = int(fd.readline()) tmp_nums.append(num) # Step 3 : 获取当前最小值放入暂存区,并读取对应文件的下一行;循环遍历。 count = 0 while 1: val = min(tmp_nums) nums.append(val) idx = tmp_nums.index(val) next = fds[idx].readline() # 文件读完了 if not next: del fds[idx] del tmp_nums[idx] else: tmp_nums[idx] = int(next) # 暂存区保存1000个数,一次性写入硬盘,然后清空继续读。 if 1000 == len(nums): with open('final_sorted.txt','a') as wf: wf.write('\n'.join([ str(i) for i in nums ]) + '\n') nums[:] = [] if 499999999 == count: break count += 1 with open('runtime.txt','w') as wf: wf.write('Runtime : {}'.format(time.time()-now))
During the execution of the program, the memory usage has been around 240M
It took about 38 hours to merge less than 50 million rows of data...
Although the memory usage is reduced, the time is complicated The degree is too high; You can further reduce memory usage by reducing the number of files (increasing the number of rows stored in each small file).
Question 2: A file has 100 billion lines of data, each line is an IP address, and the IP addresses need to be sorted.
Convert IP address to numbers
# 方法一:手动计算 In [62]: ip Out[62]: '10.3.81.150' In [63]: ip.split('.')[::-1] Out[63]: ['150', '81', '3', '10'] In [64]: [ '{}-{}'.format(idx,num) for idx,num in enumerate(ip.split('.')[::-1]) ] Out[64]: ['0-150', '1-81', '2-3', '3-10'] In [65]: [256**idx*int(num) for idx,num in enumerate(ip.split('.')[::-1])] Out[65]: [150, 20736, 196608, 167772160] In [66]: sum([256**idx*int(num) for idx,num in enumerate(ip.split('.')[::-1])]) Out[66]: 167989654 In [67]: # 方法二:使用C扩展库来计算 In [71]: import socket,struct In [72]: socket.inet_aton(ip) Out[72]: b'\n\x03Q\x96' In [73]: struct.unpack("!I", socket.inet_aton(ip)) # !表示使用网络字节顺序解析, 后面的I表示unsigned int, 对应Python里的integer or long Out[73]: (167989654,) In [74]: struct.unpack("!I", socket.inet_aton(ip))[0] Out[74]: 167989654 In [75]: socket.inet_ntoa(struct.pack("!I", 167989654)) Out[75]: '10.3.81.150' In [76]:
Question 3: There is a 1.3GB file (a total of 100 million lines). Each line in it is a string. Please find duplicates in the file. The most frequent string.
Basic idea: Read large files iteratively, split the large files into multiple small files; and finally merge these small files.
Splitting rules:
Read large files iteratively, maintain a dictionary in memory, the key is a string, and the value is the number of times the string appears;
When the number of string types maintained by the dictionary reaches 10,000 (customizable), the dictionary is sorted from small to large by key , and then written into a small file, each line is key\tvalue;
Then clear the dictionary and continue reading until the large file is finished.
Merge rules:
First get the file descriptors of all small files, and then read out the first line (that is, each small file The string with the smallest ascii value of the file string) is compared.
Find the string with the smallest ascii value. If there are duplicates, add up the number of occurrences, and then store the current string and the total number of times into a list in memory.
Then move the read pointer of the file where the smallest string is located downward, that is, read another line from the corresponding small file for the next round of comparison.
When the number of lists in the memory reaches 10,000, the contents of the list are written to a final file at once and stored on the hard disk. At the same time, clear the list for subsequent comparisons.
Until all the small files have been read, the final file is a large file sorted in ascending order according to the string ascii value. The content of each line is the string\trepetition times,
The last iteration is to read the final file and find the one with the most repetitions.
1. Split
def readline_by_yield(bfile): with open(bfile, 'r') as rf: for line in rf: yield line def split_bfile(bfile): count = 0 d = {} for line in readline_by_yield(bfile): line = line.strip() if line not in d: d[line] = 0 d[line] += 1 if 10000 == len(d): text = '' for string in sorted(d): text += '{}\t{}\n'.format(string,d[string]) with open('subfile/subfile{}.txt'.format(count+1),'w') as wf: wf.write(text.strip()) d.clear() count += 1 text = '' for string in sorted(d): text += '{}\t{}\n'.format(string,d[string]) with open('subfile/subfile_end.txt','w') as wf: wf.write(text.strip()) split_bfile('bigfile.txt')
2. Merge
import os import json import time import traceback testdir = '/ssd/subfile' now = time.time() # Step 1 : 获取全部文件描述符 fds = [] for f in os.listdir(testdir): ff = os.path.join(testdir,f) fds.append(open(ff,'r')) # Step 2 : 每个文件获取第一行 tmp_strings = [] tmp_count = [] for fd in fds: line = fd.readline() string,count = line.strip().split('\t') tmp_strings.append(string) tmp_count.append(int(count)) # Step 3 : 获取当前最小值放入暂存区,并读取对应文件的下一行;循环遍历。 result = [] need2del = [] while True: min_str = min(tmp_strings) str_idx = [i for i,v in enumerate(tmp_strings) if v==min_str] str_count = sum([ int(tmp_count[idx]) for idx in str_idx ]) result.append('{}\t{}\n'.format(min_str,str_count)) for idx in str_idx: next = fds[idx].readline() # IndexError: list index out of range # 文件读完了 if not next: need2del.append(idx) else: next_string,next_count = next.strip().split('\t') tmp_strings[idx] = next_string tmp_count[idx] = next_count # 暂存区保存10000个记录,一次性写入硬盘,然后清空继续读。 if 10000 == len(result): with open('merged.txt','a') as wf: wf.write(''.join(result)) result[:] = [] # 注意: 文件读完需要删除文件描述符的时候, 需要逆序删除 need2del.reverse() for idx in need2del: del fds[idx] del tmp_strings[idx] del tmp_count[idx] need2del[:] = [] if 0 == len(fds): break with open('merged.txt','a') as wf: wf.write(''.join(result)) result[:] = []
Merge result analysis:
Dictionary size maintained in memory during splitting | Number of small files to be split | Number of file descriptors to be maintained during merging | Memory usage during merging | Merge takes time | |
First time | 10000 | 9000 | 9000 ~ 0 | 200M | The merge speed is slow and the completion time has not yet been calculated |
Second time | 100000 | 900 | 900 ~ 0 | 27M | The merge speed is fast, only 2572 seconds |
3. Find the most frequent occurrences Strings and their times
import time def read_line(filepath): with open(filepath,'r') as rf: for line in rf: yield line start_ts = time.time() max_str = None max_count = 0 for line in read_line('merged.txt'): string,count = line.strip().split('\t') if int(count) > max_count: max_count = int(count) max_str = string print(max_str,max_count) print('Runtime {}'.format(time.time()-start_ts))
The merged file has a total of 9999788 lines and a size of 256M; it takes 27 seconds to perform the search and occupies 6480KB of memory.
The above is the detailed content of One article to understand the problem of large file sorting/external memory sorting. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undress AI Tool
Undress images for free

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics











Polymorphism is a core concept in Python object-oriented programming, referring to "one interface, multiple implementations", allowing for unified processing of different types of objects. 1. Polymorphism is implemented through method rewriting. Subclasses can redefine parent class methods. For example, the spoke() method of Animal class has different implementations in Dog and Cat subclasses. 2. The practical uses of polymorphism include simplifying the code structure and enhancing scalability, such as calling the draw() method uniformly in the graphical drawing program, or handling the common behavior of different characters in game development. 3. Python implementation polymorphism needs to satisfy: the parent class defines a method, and the child class overrides the method, but does not require inheritance of the same parent class. As long as the object implements the same method, this is called the "duck type". 4. Things to note include the maintenance

A class method is a method defined in Python through the @classmethod decorator. Its first parameter is the class itself (cls), which is used to access or modify the class state. It can be called through a class or instance, which affects the entire class rather than a specific instance; for example, in the Person class, the show_count() method counts the number of objects created; when defining a class method, you need to use the @classmethod decorator and name the first parameter cls, such as the change_var(new_value) method to modify class variables; the class method is different from the instance method (self parameter) and static method (no automatic parameters), and is suitable for factory methods, alternative constructors, and management of class variables. Common uses include:

ListslicinginPythonextractsaportionofalistusingindices.1.Itusesthesyntaxlist[start:end:step],wherestartisinclusive,endisexclusive,andstepdefinestheinterval.2.Ifstartorendareomitted,Pythondefaultstothebeginningorendofthelist.3.Commonusesincludegetting

Parameters are placeholders when defining a function, while arguments are specific values passed in when calling. 1. Position parameters need to be passed in order, and incorrect order will lead to errors in the result; 2. Keyword parameters are specified by parameter names, which can change the order and improve readability; 3. Default parameter values are assigned when defined to avoid duplicate code, but variable objects should be avoided as default values; 4. args and *kwargs can handle uncertain number of parameters and are suitable for general interfaces or decorators, but should be used with caution to maintain readability.

Iterators are objects that implement __iter__() and __next__() methods. The generator is a simplified version of iterators, which automatically implement these methods through the yield keyword. 1. The iterator returns an element every time he calls next() and throws a StopIteration exception when there are no more elements. 2. The generator uses function definition to generate data on demand, saving memory and supporting infinite sequences. 3. Use iterators when processing existing sets, use a generator when dynamically generating big data or lazy evaluation, such as loading line by line when reading large files. Note: Iterable objects such as lists are not iterators. They need to be recreated after the iterator reaches its end, and the generator can only traverse it once.

There are many ways to merge two lists, and choosing the right way can improve efficiency. 1. Use number splicing to generate a new list, such as list1 list2; 2. Use = to modify the original list, such as list1 =list2; 3. Use extend() method to operate on the original list, such as list1.extend(list2); 4. Use number to unpack and merge (Python3.5), such as [list1,*list2], which supports flexible combination of multiple lists or adding elements. Different methods are suitable for different scenarios, and you need to choose based on whether to modify the original list and Python version.

The key to dealing with API authentication is to understand and use the authentication method correctly. 1. APIKey is the simplest authentication method, usually placed in the request header or URL parameters; 2. BasicAuth uses username and password for Base64 encoding transmission, which is suitable for internal systems; 3. OAuth2 needs to obtain the token first through client_id and client_secret, and then bring the BearerToken in the request header; 4. In order to deal with the token expiration, the token management class can be encapsulated and automatically refreshed the token; in short, selecting the appropriate method according to the document and safely storing the key information is the key.

Python's magicmethods (or dunder methods) are special methods used to define the behavior of objects, which start and end with a double underscore. 1. They enable objects to respond to built-in operations, such as addition, comparison, string representation, etc.; 2. Common use cases include object initialization and representation (__init__, __repr__, __str__), arithmetic operations (__add__, __sub__, __mul__) and comparison operations (__eq__, ___lt__); 3. When using it, make sure that their behavior meets expectations. For example, __repr__ should return expressions of refactorable objects, and arithmetic methods should return new instances; 4. Overuse or confusing things should be avoided.
