Currently, the company uses HDFS to store logs uploaded by each node server. Because of historical issues, the logs are quite mixed. That is, all kinds of data will be stored in the log. A log file is about 200mb. Sometimes to filter some content, you need to use the cat command of hdfs based on the timestamp, and then grep the keyword. Then input it into a python script through stdin to do some processing on the relevant data.
Now I want to make it input the time, node, keyword and matching model to be queried, and then complete the task in one step. In this way, this gadget can be pushed to all people who need data, instead of asking operation and maintenance to query it every time.
So I started to study python related modules. This hdfs can be uploaded and downloaded, and the contents of directory files can be queried. But when it comes to reading this part, it becomes more troublesome.
with client.read(hdfs_path=....., other parameters) as reader:
content = reader.read()
然后再对content处理
This method is not feasible because there is too much content to be matched at one time, probably several gigabytes of data, and it is impossible to read it all and then process it. It must be filtered and processed during the reading process
I tried for line in content, but it couldn't match the content.
How should we solve this problem? Is it possible to use python to record the file paths that meet the conditions, then execute the HDFS command, and then transfer the matching data to python? That seems too troublesome, and the stability is definitely not good
Later I took a look and found out that when creating the client of the hdfs module, it was connected to the web management page of hadoop 50070. I thought about whether this module itself is not used for data analysis? Hope you can give me some help
What about multi-threading and parallel computing? Reading several Gb at once is naturally slow. Since it is a hadoop framework, it should be enough to use mapreduce well. This thing is probably not designed for speed.