I have multiple files ( about 20 files with 30000 lines and 32 columns) and I need to keep only the lines that start with the same string. I found these cases that are quite similar to what I need but I don't know how to adapt them..
compare multiple files(more than two) with two different columns
In my case each file has a first column made of strings of 12 characters, I need to keep only the lines starting with strings that are present in ALL the files. (one file for every input file or also a single output file like the one in the above mentioned cases is fine). My files are like this:
file1:
-13 -5 0 19.3769 46.9197 1
-13 -4 -2 347.911 57.7232 1
-13 -4 -1 38.5696 39.0027 1
-13 -4 0 2227.39 124.894 1
-13 -3 -3 113.001 40.2117 1
-13 -3 -2 850.847 78.2881 1
file2:
-13 -5 0 2.19085 50.4632 1
-13 -4 -2 283.628 56.7731 1
-13 -4 -1 41.179 48.6423 1
-13 -4 0 1753.54 125.88 1
-13 -3 -3 28.2363 40.6518 1
-13 -3 -2 562.736 66.0301 1
-13 -3 -1 750.747 77.2795 1
Output file1:
-13 -5 0 19.3769 46.9197 1
-13 -4 -2 347.911 57.7232 1
-13 -4 -1 38.5696 39.0027 1
-13 -3 -3 113.001 40.2117 1
-13 -3 -2 850.847 78.2881 1
Output file2
-13 -5 0 2.19085 50.4632 1
-13 -4 -2 283.628 56.7731 1
-13 -4 -1 41.179 48.6423 1
-13 -3 -3 28.2363 40.6518 1
-13 -3 -2 562.736 66.0301 1
1 Answer 1
One approach would be to first find all sets of 12 initial characters that are present in more than one file:
cut -c-12 file* | sort | uniq -c
The cut
command above prints the 1st 12 characters from every file whose name starts with file
, these are then sorted and the number of times each line is found is appended by uniq -c
. Running this on your example files returns:
$ cut -c-12 file* | sort | uniq -c
1 -13 -3 -1
2 -13 -3 -2
2 -13 -3 -3
2 -13 -4 0
2 -13 -4 -1
2 -13 -4 -2
2 -13 -5 0
So, all lines but the 1st appear in both files. Now, keep only those lines that appear the desired number of times (20 in your case):
cut -c-12 file* | sort | uniq -c | rev | sed -n 's/ 20 *$//p' | rev
rev
simply prints its input reversed. I am using it here to make the number of times each line was seen the last field. This is then passed to sed
which is told to only print lines which end with a space, a 20 and 0 or more spaces. This keeps only lines that appeared 20 times and the final rev
brings us back to the original format.
You can now pass the whole thing to grep
as a list of strings to search for:
$ grep -f <(cut -c-12 file* | sort | uniq -c |
rev | sed -n 's/ 20 *$//p' | rev) file*
-13 -5 0 19.3769 46.9197 1
-13 -4 -2 347.911 57.7232 1
-13 -4 -1 38.5696 39.0027 1
-13 -4 0 2227.39 124.894 1
-13 -3 -3 113.001 40.2117 1
-13 -3 -2 850.847 78.2881 1
If your shell doesn't support the <()
format, you could save the results of cut
in a separate file and use that, or just run it in a loop:
cut -c-12 file* | sort | uniq -d |
while IFS= read -r l; do grep -- "^$l" file1; done
To have each file's output in a separate file, use:
cut -c-12 file* | sort | uniq -c | rev | sed -n 's/ 20 *$//p' | rev > list
for f in file*; do grep -f list "$f" > "$f.new"; done
-
1Thanks for answering, the problem is that I need to keep only the strings that are present in all the files, let's say if it is present in 9 files over 10 it has to be discarded.Eleonora– Eleonora2015年09月25日 15:35:48 +00:00Commented Sep 25, 2015 at 15:35
-
@Eleonora sorry about that, try the updated answer.2015年09月25日 15:52:19 +00:00Commented Sep 25, 2015 at 15:52
-
1
rev
's definitely not needed here:sed '/^[[:blank:]]\{1,\}20 /!d;//s///'
don_crissti– don_crissti2015年09月25日 16:30:13 +00:00Commented Sep 25, 2015 at 16:30 -
@don_crissti I find the two revs clearer and simpler.2015年09月25日 17:24:41 +00:00Commented Sep 25, 2015 at 17:24
-
@Terdon thank you very much, this is exactly what I needed. It works perfectly.Eleonora– Eleonora2015年09月28日 07:02:39 +00:00Commented Sep 28, 2015 at 7:02
You must log in to answer this question.
Explore related questions
See similar questions with these tags.
file1
also contain-13 -4 -1 38.5696 39.0027 1
? The string ` -13 -4 -1` is in bothfile1
andfile2
.