can estimate the size of each file as 5G*64=300G, which is much larger than 4G. So it is impossible to fully load it into memory for processing. Consider a divide and conquer approach.
Traverse the file a, obtain hash(url)%1000 for each url, and then store the url into 1000 small file (set to a0,a1,...a999). In this way, the size of each small file is approximately 300M. Traverse the file b, and store the urls into 1000 small files (b0, b1....b999) in the same way as a. After this processing, all possible identical urls are in the corresponding small files (a0 vs b0, a1 vs b1....a999 vs b999), and non-corresponding small files (such as a0 vs b99) It is impossible to have the same url. Then we only need to find the same url in 1000 pairs of small files.
For example, for a0 vs b0, we can traverse a0 and store the url in hash_map. Then traverse b0. If the url is in the hash_map, it means that this url exists in both a and b. Just save it to a file.
If the divided small files are uneven and some small files are too large (for example, larger than 2G), you can consider dividing these too large small files into small files in a similar way
Yesterday the Baidu interviewer asked me to study it today