I have two files, let's say
File1:
Locus_1
Locus_2
Locus_3
File2:
3 3 Locus_1 Locus_40 etc_849
3 2 Locus_2 Locus_94 *
2 2 Locus_6 Locus_1 *
2 3 Locus_3,Locus_4 Locus_50 *
3 3 Locus_9 Locus_3 etc_667
I want to do a grep -F
for the first file only on the third column of the second file (in the original File2
fields are separated by tabs), such as the output should be:
Output:
3 3 Locus_1 Locus_40 etc_849
3 2 Locus_2 Locus_94 *
2 3 Locus_3,Locus_4 Locus_50 *
How can I do it?
Edit
To Chaos: no, the comma is not a mistake. I can have more than one Locus_* in a column - and in case the second Locus_* (the one after the comma) matches one of the lines of File1
I want it to be retrieved, too!
6 Answers 6
If grep
is not necessary, one simple solution would be to use join
for that:
$ join -1 1 -2 3 <(sort file1) <(sort -k3 file2)
Locus_1 3 3 Locus_40 etc_849
Locus_2 3 2 Locus_94 *
Locus_3 2 3 Locus_4 Locus_50 *
Explanation:
join -1 1 -2 3
: join the two files where in the first file the first (and only) field is used and in the second file the third field. They are printed when they are equal.<(sort file1)
:join
needs sorted input<(sort -k3 file2)
: the input must be sorted on the join field (3rd field here)
-
1As the OP pointed out in a later edit, the ',' is not a typo, and if you run the stated
join -1 1 -2 3 <(sort File1) <(sort -k3 File2)
command with OPs original example data, you will only get two lines. While join is generally a great tool for similar cases, in this case it is - in my opinion - probably better running with a awk/perl/... solution, rather than doing multiple preprocessing passes to get a suitable input for join.IsoLinearCHiP– IsoLinearCHiP2015年07月13日 09:34:07 +00:00Commented Jul 13, 2015 at 9:34
Adapting a solution from https://stackoverflow.com/a/9937241/1673337 you can use (g)awk to get:
awk 'NR==FNR{a[0ドル]=1;next} {for(i in a){if(3ドル~i){print;break}}}' File1 File2
which provides the given Output.
While you could craft a RegEx to feed into grep to satisfy only matching on the third column, I feel using awk at this point is more understandable.
the if(3ドル~i){print;break}
part takes care of printing only if the third column matches a line from File1 (which is stored in the array a). See the linked post for an explanation of the rest.
Be aware this reads the entire contents of File1 into memory, however this should only be a concern if it is large, in which case you would want to optimize anyway, because of the multiplicative nature of the comparison.
-
Why are you not doing: awk 'NR==FNR{a[0ドル]=1;next}; 3ドル in a' File1 File2joepd– joepd2015年07月13日 09:10:48 +00:00Commented Jul 13, 2015 at 9:10
-
Because that would not find the line with ´Locus_3,Locus_4´. I had tried that optimization first as well, but then noticed the difference in the case where the searchterms were joined with the ','.IsoLinearCHiP– IsoLinearCHiP2015年07月13日 09:22:10 +00:00Commented Jul 13, 2015 at 9:22
Using the grep -F
option searches for literal strings anywhere in the current line. By definition, literal means that you cannot use regular expressions to narrow down your search to be within just field 3 (TAB delimited).
You can, however, use the grep -f
to read your pattern input file1 - but you do need to modify it into a list of regular expression. Here is one way using bash process substitution and sed to generate list of standard regular expressions which grep -f
can handle.
Using grep with Basic Regular Expressions:
grep -f <(sed 's/.*/^\\([^\t]\\+\t\\)\\{2\\}\\([^\t]\\+,\\)*&[,\t]/' file1) file2
For grep's Basic regex, file1
is dynamically converted to:
^\([^ ]\+ \)\{2\}\([^ ]\+,\)*Locus_1[, ]
^\([^ ]\+ \)\{2\}\([^ ]\+,\)*Locus_2[, ]
^\([^ ]\+ \)\{2\}\([^ ]\+,\)*Locus_3[, ]
OR: Using grep -E
with Extended Regular Expressions visually simplifies the code by avoiding the need for most backslashes in both grep
and sed
grep -Ef <(sed 's/.*/^([^\t]+\t){2}([^\t]+,)*&[,\t]/' file1) file2
For grep's Extended regex, file1
is dynamically converted to:
^([^ ]+ ){2}([^ ]+,)*Locus_1[, ]
^([^ ]+ ){2}([^ ]+,)*Locus_2[, ]
^([^ ]+ ){2}([^ ]+,)*Locus_3[, ]
The output (in both cases):
3 3 Locus_1 Locus_40 etc_849
3 2 Locus_2 Locus_94 *
2 3 Locus_3,Locus_4 Locus_50 *
Note that -f
and -F
can slow things down dramatically while file1
is large
-
It looks very simple in fact !Archemar– Archemar2015年07月08日 14:15:42 +00:00Commented Jul 8, 2015 at 14:15
-
@don_crissti: No I meant it as shown. I can understand
-F
would be faster than-f
because it is a case of literal vs. regex matching. My meaning was that they both tend to bog down with large pattern files vs, for exampleawk
using an array lookup. (I probably should have added that bit, I was rushing it towards the end :)Peter.O– Peter.O2015年07月09日 08:44:09 +00:00Commented Jul 9, 2015 at 8:44 -
@mikeserv: Again, my rushing it mode at work... I copied an pasted a commented out test line.. The
.
should be\t
Peter.O– Peter.O2015年07月09日 08:50:20 +00:00Commented Jul 9, 2015 at 8:50 -
@mikeserv: A notable point and it is good you have mentioned.it. As to why I excluded empty fields- Basically there is no mention of empty fields in the question, and catering for possibly invalid data seemed unnecessary to show the concept of how to use grep in the suggested manner.. And where to stop? Does one also allow for empty/missing comma-separated components in field 3?Peter.O– Peter.O2015年07月09日 09:20:15 +00:00Commented Jul 9, 2015 at 9:20
-
Well... you stop at the tab. It's tab delimited. So just get two of those and stop. But I think it's a lot to assume that the third field can only be populated if the first two are. About the commas - no, I don't think so. I think it's a little silly. It was a fun problem though.mikeserv– mikeserv2015年07月09日 09:22:05 +00:00Commented Jul 9, 2015 at 9:22
A grep -P
solution:
regexp=$( echo -n '('; < File1 tr '\n' '|' | sed 's/|$//'; echo ')' )
grep -P "^[^\s]+\s+[^s]+\s+([^\s]*,)*$regexp" File2
Output:
3 3 Locus_1 Locus_40 etc_849
3 2 Locus_2 Locus_94 *
2 3 Locus_3,Locus_4 Locus_50 *
If your File1
may contain special regexp characters, you'll need to escape them:
regexp_escape() { ruby -pe '$_ = Regexp.escape($_.chomp("\n")) + "\n"'; }
regexp=$( echo -n '('; < File1 regexp_escape | tr '\n' '|' | sed 's/|$//'; echo ')' )
grep -P "^[^\s]+\s+[^s]+\s+([^\s]*,)*$regexp" File2
Explanation:
The second line creates strings such as:
(Locus_1|Locus_2|Locus_3)
.
and
"^[^\s]+\s+[^s]+\s+([^\s]*,)*"
means:
[word] [whitespace(s)] [word] [whitespace(s)] [(optional word followed by comma) zero or arbitrarily many times]
( t=$(printf \\t) ntt=[^$t]*$t ntc=[^$t,]*
### ^just makes it easy regardless of your sed version.
sed -ne"s/..*/^($ntt){2}($ntc,)*&(,$ntc)*$t/p" |
grep -Ef- ./File2
) <File1
3 3 Locus_1 Locus_40 etc_849
3 2 Locus_2 Locus_94 *
2 3 Locus_3,Locus_4 Locus_50 *
That will get a match for a line in File1 in the third column of File2 regardless of how many ($ntc,)*
groups precede or (,$ntc)*
groups follow it. It does depend, though, on there being no metacharacters in the search strings in File1. If there might be metachars in File1, then we have to clean it up, first:
( t=$(printf \\t) ntt=[^$t]*$t ntc=[^$t,]*
sed -ne's/[]?{(^$|*.+)}\[]/\\&/g' \
-e"s/..*/^($ntt){2}($ntc,)*&(,$ntc)*$t/p" |
grep -Ef- ./File2
) <File1
To "grep" columns awk
is the tool of choice
BEGIN { f="Locus_2" }
3ドル==f { print 0ドル; }
so you can loop through File1
for x in `cat File1`
do awk -v X="$x" '3ドル~X { print 0ドル }' <File2
done
.
-
1This will only search Locus_2, not Locus_1 and Locus_3.Archemar– Archemar2015年07月08日 10:56:21 +00:00Commented Jul 8, 2015 at 10:56
-
1I didn't meant to solve it completely. This is a question and answer platform not a "do my work" platform.ikrabbe– ikrabbe2015年07月08日 11:08:23 +00:00Commented Jul 8, 2015 at 11:08
awk 'FNR == NR { s[1ドル]=1 ; next ; } { if ( 3ドル in s ) print ; }' File1 File2