Scene description:
现有许多行日志文本,按天压缩成一个个TB级的gzip文件。 使用流对每个压缩文件的数据段进行传输然后解压,对解压出的文本分词并索引 以后查到这个词时,定位到这个词所在的文件和段,再用流传输并解压 (实际上是想利用已有的压缩文件构造一个类似ES的搜索引擎)
The problem now is that because what is received is not a complete compressed file but a block of binary data, the received data cannot be decompressed due to incomplete information
Now I want to implement such a function: first decompress the received stream data and restore it to complete data (the original log data is separated by newlines, and you can get the text before compression of each stream data and the offset of the corresponding file) Good), and then consider that processes such as transmission and storage may cause data errors, so for each data stream, decompress as much data as possible in the event of errors.
Part of the relevant code is as follows: (modified from https://stackoverflow.com/que...)
import zlib import traceback CHUNKSIZE=30 d = zlib.decompressobj(16 + zlib.MAX_WBITS) f = open('test.py.gz','rb') buffer = f.read(CHUNKSIZE) i = 0 while buffer : i += 1 try: #skip two chunk if i < 3 or i > 4: outstr = d.decompress(buffer) print('*'*10 + outstr + '#'*10) except Exception, e: print(traceback.print_exc()) finally: buffer = f.read(CHUNKSIZE) outstr = d.flush() print(outstr) f.close()
When i>=3, an error is reported every time in the loop
My conclusion is that if the stream is not continuous (skipping to receive part of the data), then the subsequent data cannot be decompressed.
Question 1: What if we can correctly decompress each part of the data received? (Because it may involve the algorithm and data structure of gzip compression, I am looking at the relevant code. If the problem can be solved by appending a certain chuck in the transmission header or some chuck before and after the data that needs to be decompressed, it can be considered)
Question 2:
If you cannot correctly decompress every part of the data received, how can you decompress as much data as possible?
I think we can make a function to resume transmission when an error occurs. Back up the current data stream before transmission. You have to judge whether the current data stream is completely transmitted. This requires that the transmission protocol between the transmitter and the receiver can be changed by you. If an error occurs, a fail message will be reported to the transmitter immediately. The transmission will be resumed from the previous paragraph. If there is no error, an OK message will be returned and the next paragraph can be transmitted. This ensures data integrity. If the file is too large, you can back up more data segments in memory and make detailed judgments.
Not sure about the problem you describe, but some questions and answers on stackoverflow may be helpful.
How can I decompress a gzip stream with zlib?
Python decompressing gzip chunk-by-chunk