Compare columns between different files

Question 1

I have multiple files ( about 20 files with 30000 lines and 32 columns) and I need to keep only the lines that start with the same string. I found these cases that are quite similar to what I need but I don't know how to adapt them..

compare multiple files(more than two) with two different columns

how to compare values in two columns in two different files, echoing full lines where absolute value of difference is < a small maximum value?

In my case each file has a first column made of strings of 12 characters, I need to keep only the lines starting with strings that are present in ALL the files. (one file for every input file or also a single output file like the one in the above mentioned cases is fine). My files are like this:

file1:

 -13 -5 0 19.3769 46.9197 1
 -13 -4 -2 347.911 57.7232 1
 -13 -4 -1 38.5696 39.0027 1
 -13 -4 0 2227.39 124.894 1
 -13 -3 -3 113.001 40.2117 1
 -13 -3 -2 850.847 78.2881 1

file2:

 -13 -5 0 2.19085 50.4632 1
 -13 -4 -2 283.628 56.7731 1
 -13 -4 -1 41.179 48.6423 1
 -13 -4 0 1753.54 125.88 1
 -13 -3 -3 28.2363 40.6518 1
 -13 -3 -2 562.736 66.0301 1
 -13 -3 -1 750.747 77.2795 1

Output file1:

 -13 -5 0 19.3769 46.9197 1
 -13 -4 -2 347.911 57.7232 1
 -13 -4 -1 38.5696 39.0027 1
 -13 -3 -3 113.001 40.2117 1
 -13 -3 -2 850.847 78.2881 1

Output file2

 -13 -5 0 2.19085 50.4632 1
 -13 -4 -2 283.628 56.7731 1
 -13 -4 -1 41.179 48.6423 1
 -13 -3 -3 28.2363 40.6518 1
 -13 -3 -2 562.736 66.0301 1

Question 2

Shouldn't your output for file1 also contain -13 -4 -1 38.5696 39.0027 1? The string ` -13 -4 -1` is in both file1 and file2.

Question 3

One approach would be to first find all sets of 12 initial characters that are present in more than one file:

cut -c-12 file* | sort | uniq -c

The cut command above prints the 1st 12 characters from every file whose name starts with file, these are then sorted and the number of times each line is found is appended by uniq -c. Running this on your example files returns:

$ cut -c-12 file* | sort | uniq -c
 1 -13 -3 -1
 2 -13 -3 -2
 2 -13 -3 -3
 2 -13 -4 0
 2 -13 -4 -1
 2 -13 -4 -2
 2 -13 -5 0

So, all lines but the 1st appear in both files. Now, keep only those lines that appear the desired number of times (20 in your case):

cut -c-12 file* | sort | uniq -c | rev | sed -n 's/ 20 *$//p' | rev

rev simply prints its input reversed. I am using it here to make the number of times each line was seen the last field. This is then passed to sed which is told to only print lines which end with a space, a 20 and 0 or more spaces. This keeps only lines that appeared 20 times and the final rev brings us back to the original format.

You can now pass the whole thing to grep as a list of strings to search for:

$ grep -f <(cut -c-12 file* | sort | uniq -c | 
 rev | sed -n 's/ 20 *$//p' | rev) file*
 -13 -5 0 19.3769 46.9197 1
 -13 -4 -2 347.911 57.7232 1
 -13 -4 -1 38.5696 39.0027 1
 -13 -4 0 2227.39 124.894 1
 -13 -3 -3 113.001 40.2117 1
 -13 -3 -2 850.847 78.2881 1

If your shell doesn't support the <() format, you could save the results of cut in a separate file and use that, or just run it in a loop:

cut -c-12 file* | sort | uniq -d | 
 while IFS= read -r l; do grep -- "^$l" file1; done

To have each file's output in a separate file, use:

cut -c-12 file* | sort | uniq -c | rev | sed -n 's/ 20 *$//p' | rev > list
for f in file*; do grep -f list "$f" > "$f.new"; done

Question 4

Thanks for answering, the problem is that I need to keep only the strings that are present in all the files, let's say if it is present in 9 files over 10 it has to be discarded.

Question 5

@Eleonora sorry about that, try the updated answer.

Question 6

rev's definitely not needed here: sed '/^[[:blank:]]\{1,\}20 /!d;//s///'

Question 7

@don_crissti I find the two revs clearer and simpler.

Question 8

@Terdon thank you very much, this is exactly what I needed. It works perfectly.

terdon ♦terdon 252k69 gold badges480 silver badges717 bronze badges · Accepted Answer · 2015-09-25 15:28:44Z

One approach would be to first find all sets of 12 initial characters that are present in more than one file:

cut -c-12 file* | sort | uniq -c

The cut command above prints the 1st 12 characters from every file whose name starts with file, these are then sorted and the number of times each line is found is appended by uniq -c. Running this on your example files returns:

$ cut -c-12 file* | sort | uniq -c
 1 -13 -3 -1
 2 -13 -3 -2
 2 -13 -3 -3
 2 -13 -4 0
 2 -13 -4 -1
 2 -13 -4 -2
 2 -13 -5 0

So, all lines but the 1st appear in both files. Now, keep only those lines that appear the desired number of times (20 in your case):

cut -c-12 file* | sort | uniq -c | rev | sed -n 's/ 20 *$//p' | rev

rev simply prints its input reversed. I am using it here to make the number of times each line was seen the last field. This is then passed to sed which is told to only print lines which end with a space, a 20 and 0 or more spaces. This keeps only lines that appeared 20 times and the final rev brings us back to the original format.

You can now pass the whole thing to grep as a list of strings to search for:

$ grep -f <(cut -c-12 file* | sort | uniq -c | 
 rev | sed -n 's/ 20 *$//p' | rev) file*
 -13 -5 0 19.3769 46.9197 1
 -13 -4 -2 347.911 57.7232 1
 -13 -4 -1 38.5696 39.0027 1
 -13 -4 0 2227.39 124.894 1
 -13 -3 -3 113.001 40.2117 1
 -13 -3 -2 850.847 78.2881 1

If your shell doesn't support the <() format, you could save the results of cut in a separate file and use that, or just run it in a loop:

cut -c-12 file* | sort | uniq -d | 
 while IFS= read -r l; do grep -- "^$l" file1; done

To have each file's output in a separate file, use:

cut -c-12 file* | sort | uniq -c | rev | sed -n 's/ 20 *$//p' | rev > list
for f in file*; do grep -f list "$f" > "$f.new"; done

Thanks for answering, the problem is that I need to keep only the strings that are present in all the files, let's say if it is present in 9 files over 10 it has to be discarded.
rev's definitely not needed here: sed '/^[[:blank:]]\{1,\}20 /!d;//s///'
@Terdon thank you very much, this is exactly what I needed. It works perfectly.

Stack Exchange Network

Compare columns between different files

1 Answer 1

You must log in to answer this question.

Linked

Hot Network Questions

Compare columns between different files

1 Answer 1

You must log in to answer this question.

Linked

Related

Hot Network Questions