The linux command line provides very powerful text processing functions. Many powerful functions can be achieved by using a combination of linux commands. This article gives an example of how to use the Linux command line to deduplicate text by line and sort by the number of repetitions. The main commands used are sort, uniq and cut. Among them, the main function of sort is to sort, the main function of uniq is to realize the deduplication of adjacent text lines, and cut can extract the corresponding text columns from the text lines (simply put, it is to operate the text lines by columns).
Remove duplicate text lines and sort them by the number of repetitions
Example:
First, deduplicate the text lines and count the number of repetitions (adding the -c option to the uniq command can count the number of repetitions).
$ sort test.txt | uniq -c 2 Apple and Nokia. 4 Hello World. 1 I wanna buy an Apple device. 1 My name is Friendfish. 2 The Iphone of Apple company.
Sort lines of text by the number of repetitions.
sort -n identifies the number at the beginning of each line and sorts the text lines by their size. The default is to sort in ascending order. If you want to sort in descending order, add the -r option (sort -rn).
$ sort test.txt | uniq -c | sort -rn 4 Hello World. 2 The Iphone of Apple company. 2 Apple and Nokia. 1 My name is Friendfish.
The number of deleted duplicates in front of each line.
#cut command can operate text lines column by column. It can be seen that the previous number of repetitions occupies 8 characters. Therefore, you can use the command cut -c 9- to remove the 9th and subsequent characters of each line.
$ sort test.txt | uniq -c | sort -rn | cut -c 9- Hello World. The Iphone of Apple company. Apple and Nokia. My name is Friendfish. I wanna buy an Apple device.
The above is the detailed content of How to remove duplicate statistics in linux. For more information, please follow other related articles on the PHP Chinese website!