From beginner to proficient, learn Linux redirection and pipeline tools to speed up your workflow!-LINUX-php.cn

Improving work efficiency, operating system optimization, automation, etc. are the goals pursued by every IT practitioner. In the Linux operating system, being able to skillfully use redirection and pipeline command line tools is one of the skills that must be mastered. This article will explain in detail the usage and principles of redirection and pipeline tools through examples.

I like the Linux system very much, especially some of the designs of Linux are very beautiful. For example, some complex problems can be decomposed into several small problems, and can be solved flexibly with ready-made tools through the pipe character and redirection mechanism. It can be written as a shell script. Very efficient.

From beginner to proficient, learn Linux redirection and pipeline tools to speed up your workflow!

This article will share some of the pitfalls I encountered when using redirection and pipe characters in practice. Understanding some underlying principles can improve the efficiency of writing scripts a lot.

> and >> redirection characters pitfalls

Let’s talk about the first question first. What will happen if we execute the following command?

$ cat file.txt > file.txt

Copy after login

Reading and writing to the same file feels like nothing will happen, right?

Actually, the result of running the above command is to clear the contents of the file.txt file.

PS: Some Linux distributions may report an error directly. You can execute catfile.txt to bypass this detection.

As mentioned above about Linux processes and file descriptors, the program itself does not need to care about where its standard input/output points. It is the shell that modifies the location of the program's standard input/output through pipe characters and redirection symbols.

So when executing the command cat file.txt > file.txt, the shell will first open file.txt. Since the redirection symbol is >, the content in the file will be cleared, and then the shell will set the standard output of the cat command. is file.txt, then the cat command starts to be executed.

That is the following process:

1. Shell opens file.txt and clears its contents.
2. Shell points the standard output of the cat command to the file.txt file.
3. The shell executes the cat command and reads an empty file.
4. The cat command writes an empty string to the standard output (file.txt file).

So, the final result is that file.txt becomes an empty file.

We know that > will clear the target file, and >> will append content to the end of the target file, so what will happen if the redirection symbol > is changed to >>?

$ echo hello world > file.txt # 文件中只有一行内容 
$ cat file.txt >> file.txt # 这个命令会死循环

Copy after login

One line of content is first written into file.txt. After executing cat file.txt >> file.txt, the expected result should be two lines of content.

Unfortunately, the running result is not as expected. Instead, it will continue to write hello world to file.txt in an infinite loop. The file will soon become very large, and the command can only be stopped with Control C.

This is interesting, why is there an infinite loop? In fact, after a little analysis, you can think of the reason:

First, recall the behavior of the cat command. If you only execute the cat command, the keyboard input will be read from the command line. Every time you press Enter, the cat command will echo the input. In other words, the cat command It reads data line by line and then outputs the data.

Then, the execution process of cat file.txt >> file.txt command is as follows:

1. Open file.txt and prepare to append content to the end of the file.
2. Point the standard output of the cat command to the file.txt file.
3. The cat command reads a line of content in file.txt and writes it to the standard output (append to the file.txt file).
4. Since a line of data has just been written, the cat command finds that there is still content that can be read in file.txt, and will repeat step 3.

The above process is like traversing the list and appending elements to the list at the same time. It will never be traversed completely, so our command will loop in an infinite loop.

> The redirection character and the | pipe character work together

We often encounter such a requirement: intercept the first XX lines of the file and delete the rest.

In Linux, the head command can complete the function of intercepting the first few lines of the file:

$ cat file.txt # file.txt 中有五行内容 
1 
2 
3 
4 
5 
$ head -n 2 file.txt # head 命令读取前两行 
1 
2 
$ cat file.txt | head -n 2 # head 也可以读取标准输入 
1 
2

Copy after login

If we want to keep the first 2 lines of the file and delete the others, we may use the following command:

$ head -n 2 file.txt > file.txt

Copy after login

But this makes the mistake mentioned above. In the end, file.txt will be cleared, which cannot meet our needs.

Can we avoid pitfalls by writing commands like this:

$ cat file.txt | head -n 2 > file.txt

Copy after login

The conclusion is that it does not work, the file content will still be cleared.

What? Is there a leak in the pipeline and all the data is missing?

In the previous article, Linux processes and file descriptors, I also said that the implementation principle of the pipe character is essentially to connect the standard input and output of two commands, so that the standard output of the previous command can be used as the standard input of the next command.

However, if you think that writing commands like this can get the expected results, it may be because you think that the commands connected by the pipe character are executed serially. This is a common mistake. In fact, multiple commands connected by the pipe character are executed serially. are executed in parallel.

You may think that the shell will first execute the cat file.txt command, read all the contents in file.txt normally, and then pass these contents to the head -n 2 > file.txt command through the pipe.

Although the contents of file.txt will be cleared at this time, head does not read data from the file, but reads data from the pipe, so it should be possible to write two lines of data to file.txt correctly.

但实际上，上述理解是错误的，shell 会并行执行管道符连接的命令，比如说执行如下命令：

$ sleep 5 | sleep 5

Copy after login

shell 会同时启动两个sleep进程，所以执行结果是睡眠 5 秒，而不是 10 秒。

这是有点违背直觉的，比如这种常见的命令：

$ cat filename | grep 'pattern'

Copy after login

直觉好像是先执行cat命令一次性读取了filename中所有的内容，然后传递给grep命令进行搜索。

但实际上是cat和grep命令是同时执行的，之所以能得到预期的结果，是因为grep ‘pattern’会阻塞等待标准输入，而cat通过 Linux 管道向grep的标准输入写入数据。

执行下面这个命令能直观感受到cat和grep是在同时执行的，grep在实时处理我们用键盘输入的数据：

$ cat | grep 'pattern'

Copy after login

说了这么多，再回顾一开始的问题：

$ cat file.txt | head -n 2 > file.txt

Copy after login

cat命令和head会并行执行，谁先谁后不确定，执行结果也就不确定。

如果head命令先于cat执行，那么file.txt就会被先清空，cat也就读取不到任何内容;反之，如果cat先把文件的内容读取出来，那么可以得到预期的结果。

不过，通过我的实验(将这种并发情况重复 1w 次)发现，file.txt被清空这种错误情况出现的概率远大于预期结果出现的概率，这个暂时还不清楚是为什么，应该和 Linux 内核实现进程和管道的逻辑有关。

解决方案

说了这么多管道符和重定向符的特点，如何才能避免这个文件被清空的坑呢?

最靠谱的办法就是不要同时对同一个文件进行读写，而是通过临时文件的方式做一个中转。

比如说只保留file.txt文件中的头两行，可以这样写代码：

# 先把数据写入临时文件，然后覆盖原始文件

$ cat file.txt | head -n 2 > temp.txt && mv temp.txt file.txt

Copy after login

这是最简单，最可靠，万无一失的方法。

你如果嫌这段命令太长，也可以通过apt/brew/yum等包管理工具安装moreutils包，就会多出一个sponge命令，像这样使用：

# 先把数据传给 sponge，然后由 sponge 写入原始文件 
$ cat file.txt | head -n 2 | sponge file.txt

Copy after login

sponge这个单词的意思是海绵，挺形象的，它会先把输入的数据「吸收」起来，最后再写入file.txt，核心思路和我们使用临时文件时类似的，这个「海绵」就好比一个临时文件，就可以避免同时打开同一个文件进行读写的问题。

在Linux操作系统中，重定向和管道是非常有用的命令行工具，可以让我们更好地掌握系统的运行状态和信息。掌握相关技能能够帮助我们更好地进行系统优化和自动化工作，从而更好地提高工作效率。相信通过本文的介绍，读者对重定向和管道的原理和使用方法都有了更为深入的了解。

The above is the detailed content of From beginner to proficient, learn Linux redirection and pipeline tools to speed up your workflow!. For more information, please follow other related articles on the PHP Chinese website!