Introduction
Given two files a and b, with x and y rows of data respectively, where (x, y are both greater than 10 billion), the machine memory limit is 100M, how to find the same records?
Thoughts
How to deal with this problem The main difficulty is that this massive amount of data cannot be read into the memory at one time.
If it cannot be read into the memory at one time, can it be considered multiple times? If it is possible, how can we calculate the same value after reading it multiple times?
We can use divide and conquer thinking to reduce the big to the small. If the values of the same string are equal after hashing, then we can consider using hash modulo to disperse the records into n files. How to get this n? PHP has 100M memory, and the array can store approximately 1 million data. So considering that records a and b only have 1 billion rows, n must be at least greater than 200.
There are 200 files at this time. The same records must be in the same file, and each file can be read into the memory. Then you can find the same records in these 200 files in sequence, and then output them to the same file. The final result is the same records in the two files a and b.
It is very simple to find the same record in a small file. Use each row of records as the key of the hash table and count the number of occurrences of the key >= 2.
Practical operation
1 billion files are too big. Practical operation is a waste of time. Just achieve the practical purpose.
The problem size is reduced to: 1M memory limit, a and b each have 100,000 rows of records. The memory limit can be limited by PHP'sini_set('memory_limit', '1M');
.
Generate test file
Generate random numbers to fill the file:
/** * 生成随机数填充文件 * Author: ClassmateLin * Email: classmatelin.site@gmail.com * Site: https://www.classmatelin.top * @param string $filename 输出文件名 * @param int $batch 按多少批次生成数据 * @param int $batchSize 每批数据的大小 */ function generate(string $filename, int $batch=1000, int $batchSize=10000) { for ($i=0; $i<$batch; $i++) { $str = ''; for ($j=0; $j<$batchSize; $j++) { $str .= rand($batch, $batchSize) . PHP_EOL; // 生成随机数 } file_put_contents($filename, $str, FILE_APPEND); // 追加模式写入文件 } } generate('a.txt', 10); generate('b.txt', 10);
Split the file
Place
a. txt
,b.txt
Split into n files by hash modulus.
/** * 用hash取模方式将文件分散到n个文件中 * Author: ClassmateLin * Email: classmatelin.site@gmail.com * Site: https://www.classmatelin.top * @param string $filename 输入文件名 * @param int $mod 按mod取模 * @param string $dir 文件输出目录 */ function spiltFile(string $filename, int $mod=20, string $dir='files') { if (!is_dir($dir)){ mkdir($dir); } $fp = fopen($filename, 'r'); while (!feof($fp)){ $line = fgets($fp); $n = crc32(hash('md5', $line)) % $mod; // hash取模 $filepath = $dir . '/' . $n . '.txt'; // 文件输出路径 file_put_contents($filepath, $line, FILE_APPEND); // 追加模式写入文件 } fclose($fp); } spiltFile('a.txt'); spiltFile('b.txt');
Execute
The splitFile
function gets 20 files in thefiles
directory as shown below.
Find duplicate records
Now we need to find the same records in 20 files. In fact, we need to find the same records in one file and operate 20 times.
Find the same record in a file:
/** * 查找一个文件中相同的记录输出到指定文件中 * Author: ClassmateLin * Email: classmatelin.site@gmail.com * Site: https://www.classmatelin.top * @param string $inputFilename 输入文件路径 * @param string $outputFilename 输出文件路径 */ function search(string $inputFilename, $outputFilename='output.txt') { $table = []; $fp = fopen($inputFilename, 'r'); while (!feof($fp)) { $line = fgets($fp); !isset($table[$line]) ? $table[$line] = 1 : $table[$line]++; // 未设置的值设1,否则自增 } fclose($fp); foreach ($table as $line => $count) { if ($count >= 2){ // 出现大于2次的则是相同的记录,输出到指定文件中 file_put_contents($outputFilename, $line, FILE_APPEND); } } }
Find the same record in all files:
/** * 从给定目录下文件中分别找出相同记录输出到指定文件中 * Author: ClassmateLin * Email: classmatelin.site@gmail.com * Site: https://www.classmatelin.top * @param string $dirs 指定目录 * @param string $outputFilename 输出文件路径 */ function searchAll($dirs='files', $outputFilename='output.txt') { $files = scandir($dirs); foreach ($files as $file) { $filepath = $dirs . '/' . $file; if (is_file($filepath)){ search($filepath, $outputFilename); } } }
The space problem of large file processing has been solved so far, so how to deal with the time problem? A single machine can use the multi-core processing of the CPU. If it is not enough, it can be processed through multiple servers. deal with.
Complete code
Copy after login
Recommended learning:php video tutorial