Grep a file on specific field

Question 1

I have two files, let's say

File1:

Locus_1
Locus_2
Locus_3

File2:

3 3 Locus_1 Locus_40 etc_849 
3 2 Locus_2 Locus_94 * 
2 2 Locus_6 Locus_1 * 
2 3 Locus_3,Locus_4 Locus_50 * 
3 3 Locus_9 Locus_3 etc_667

I want to do a grep -F for the first file only on the third column of the second file (in the original File2 fields are separated by tabs), such as the output should be:

Output:

3 3 Locus_1 Locus_40 etc_849 
3 2 Locus_2 Locus_94 * 
2 3 Locus_3,Locus_4 Locus_50 *

How can I do it?

Edit To Chaos: no, the comma is not a mistake. I can have more than one Locus_* in a column - and in case the second Locus_* (the one after the comma) matches one of the lines of File1 I want it to be retrieved, too!

Question 2

In File2 line 4 there should also be a tab and not a comma right?

Question 3

grep will give you trouble, try this awk awk 'FNR == NR { s[1ドル]=1 ; next ; } { if ( 3ドル in s ) print ; }' File1 File2

Question 4

If this task ends up getting much more complex, you may want to consider loading the fields into a database and letting SQL help you do the matching. It's a little extra work up front but can help you keep these sorts of parsing tasks organized over time.

Question 5

If grep is not necessary, one simple solution would be to use join for that:

$ join -1 1 -2 3 <(sort file1) <(sort -k3 file2)
Locus_1 3 3 Locus_40 etc_849
Locus_2 3 2 Locus_94 *
Locus_3 2 3 Locus_4 Locus_50 *

Explanation:

join -1 1 -2 3: join the two files where in the first file the first (and only) field is used and in the second file the third field. They are printed when they are equal.
<(sort file1): join needs sorted input
<(sort -k3 file2): the input must be sorted on the join field (3rd field here)

Question 6

As the OP pointed out in a later edit, the ',' is not a typo, and if you run the stated join -1 1 -2 3 <(sort File1) <(sort -k3 File2) command with OPs original example data, you will only get two lines. While join is generally a great tool for similar cases, in this case it is - in my opinion - probably better running with a awk/perl/... solution, rather than doing multiple preprocessing passes to get a suitable input for join.

Question 7

Adapting a solution from https://stackoverflow.com/a/9937241/1673337 you can use (g)awk to get:

awk 'NR==FNR{a[0ドル]=1;next} {for(i in a){if(3ドル~i){print;break}}}' File1 File2

which provides the given Output.

While you could craft a RegEx to feed into grep to satisfy only matching on the third column, I feel using awk at this point is more understandable.

the if(3ドル~i){print;break} part takes care of printing only if the third column matches a line from File1 (which is stored in the array a). See the linked post for an explanation of the rest.

Be aware this reads the entire contents of File1 into memory, however this should only be a concern if it is large, in which case you would want to optimize anyway, because of the multiplicative nature of the comparison.

Question 8

Why are you not doing: awk 'NR==FNR{a[0ドル]=1;next}; 3ドル in a' File1 File2

Question 9

Because that would not find the line with ´Locus_3,Locus_4´. I had tried that optimization first as well, but then noticed the difference in the case where the searchterms were joined with the ','.

Question 10

Using the grep -F option searches for literal strings anywhere in the current line. By definition, literal means that you cannot use regular expressions to narrow down your search to be within just field 3 (TAB delimited).

You can, however, use the grep -f to read your pattern input file1 - but you do need to modify it into a list of regular expression. Here is one way using bash process substitution and sed to generate list of standard regular expressions which grep -f can handle.

Using grep with Basic Regular Expressions:

 grep -f <(sed 's/.*/^\\([^\t]\\+\t\\)\\{2\\}\\([^\t]\\+,\\)*&[,\t]/' file1) file2

For grep's Basic regex, file1 is dynamically converted to:

^\([^ ]\+ \)\{2\}\([^ ]\+,\)*Locus_1[, ]
^\([^ ]\+ \)\{2\}\([^ ]\+,\)*Locus_2[, ]
^\([^ ]\+ \)\{2\}\([^ ]\+,\)*Locus_3[, ]

OR: Using grep -E with Extended Regular Expressions visually simplifies the code by avoiding the need for most backslashes in both grep and sed

grep -Ef <(sed 's/.*/^([^\t]+\t){2}([^\t]+,)*&[,\t]/' file1) file2

For grep's Extended regex, file1 is dynamically converted to:

^([^ ]+ ){2}([^ ]+,)*Locus_1[, ]
^([^ ]+ ){2}([^ ]+,)*Locus_2[, ]
^([^ ]+ ){2}([^ ]+,)*Locus_3[, ]

The output (in both cases):

3 3 Locus_1 Locus_40 etc_849 
3 2 Locus_2 Locus_94 * 
2 3 Locus_3,Locus_4 Locus_50 *

Note that -f and -F can slow things down dramatically while file1 is large

Question 11

It looks very simple in fact !

Question 12

@don_crissti: No I meant it as shown. I can understand -F would be faster than -f because it is a case of literal vs. regex matching. My meaning was that they both tend to bog down with large pattern files vs, for example awk using an array lookup. (I probably should have added that bit, I was rushing it towards the end :)

Question 13

@mikeserv: Again, my rushing it mode at work... I copied an pasted a commented out test line.. The . should be \t

Question 14

@mikeserv: A notable point and it is good you have mentioned.it. As to why I excluded empty fields- Basically there is no mention of empty fields in the question, and catering for possibly invalid data seemed unnecessary to show the concept of how to use grep in the suggested manner.. And where to stop? Does one also allow for empty/missing comma-separated components in field 3?

Question 15

Well... you stop at the tab. It's tab delimited. So just get two of those and stop. But I think it's a lot to assume that the third field can only be populated if the first two are. About the commas - no, I don't think so. I think it's a little silly. It was a fun problem though.

Question 16

A `grep -P` solution:

regexp=$( echo -n '('; < File1 tr '\n' '|' | sed 's/|$//'; echo ')' )
grep -P "^[^\s]+\s+[^s]+\s+([^\s]*,)*$regexp" File2

Output:

3 3 Locus_1 Locus_40 etc_849 
3 2 Locus_2 Locus_94 * 
2 3 Locus_3,Locus_4 Locus_50 *

If your File1 may contain special regexp characters, you'll need to escape them:

regexp_escape() { ruby -pe '$_ = Regexp.escape($_.chomp("\n")) + "\n"'; }
regexp=$( echo -n '('; < File1 regexp_escape | tr '\n' '|' | sed 's/|$//'; echo ')' )
grep -P "^[^\s]+\s+[^s]+\s+([^\s]*,)*$regexp" File2

Explanation:

The second line creates strings such as: (Locus_1|Locus_2|Locus_3). and

"^[^\s]+\s+[^s]+\s+([^\s]*,)*"

means:

[word] [whitespace(s)] [word] [whitespace(s)] [(optional word followed by comma) zero or arbitrarily many times]

Question 17

( t=$(printf \\t) ntt=[^$t]*$t ntc=[^$t,]*
### ^just makes it easy regardless of your sed version.
 sed -ne"s/..*/^($ntt){2}($ntc,)*&(,$ntc)*$t/p" |
 grep -Ef- ./File2
) <File1

3 3 Locus_1 Locus_40 etc_849
3 2 Locus_2 Locus_94 *
2 3 Locus_3,Locus_4 Locus_50 *

That will get a match for a line in File1 in the third column of File2 regardless of how many ($ntc,)* groups precede or (,$ntc)* groups follow it. It does depend, though, on there being no metacharacters in the search strings in File1. If there might be metachars in File1, then we have to clean it up, first:

( t=$(printf \\t) ntt=[^$t]*$t ntc=[^$t,]*
 sed -ne's/[]?{(^$|*.+)}\[]/\\&/g' \
 -e"s/..*/^($ntt){2}($ntc,)*&(,$ntc)*$t/p" |
 grep -Ef- ./File2
) <File1

Question 18

To "grep" columns awk is the tool of choice

BEGIN { f="Locus_2" }
3ドル==f { print 0ドル; }

so you can loop through File1

for x in `cat File1`
do awk -v X="$x" '3ドル~X { print 0ドル }' <File2
done

.

Question 19

This will only search Locus_2, not Locus_1 and Locus_3.

Question 20

I didn't meant to solve it completely. This is a question and answer platform not a "do my work" platform.

chaos chaos 49.3k11 gold badges126 silver badges147 bronze badges · Answer 1 · 2015-07-08 10:49:45Z

If grep is not necessary, one simple solution would be to use join for that:

$ join -1 1 -2 3 <(sort file1) <(sort -k3 file2)
Locus_1 3 3 Locus_40 etc_849
Locus_2 3 2 Locus_94 *
Locus_3 2 3 Locus_4 Locus_50 *

Explanation:

join -1 1 -2 3: join the two files where in the first file the first (and only) field is used and in the second file the third field. They are printed when they are equal.
<(sort file1): join needs sorted input
<(sort -k3 file2): the input must be sorted on the join field (3rd field here)

As the OP pointed out in a later edit, the ',' is not a typo, and if you run the stated join -1 1 -2 3 <(sort File1) <(sort -k3 File2) command with OPs original example data, you will only get two lines. While join is generally a great tool for similar cases, in this case it is - in my opinion - probably better running with a awk/perl/... solution, rather than doing multiple preprocessing passes to get a suitable input for join.

IsoLinearCHiP IsoLinearCHiP 2241 silver badge5 bronze badges · Answer 2 · 2015-07-08 11:01:35Z

Adapting a solution from https://stackoverflow.com/a/9937241/1673337 you can use (g)awk to get:

awk 'NR==FNR{a[0ドル]=1;next} {for(i in a){if(3ドル~i){print;break}}}' File1 File2

which provides the given Output.

While you could craft a RegEx to feed into grep to satisfy only matching on the third column, I feel using awk at this point is more understandable.

the if(3ドル~i){print;break} part takes care of printing only if the third column matches a line from File1 (which is stored in the array a). See the linked post for an explanation of the rest.

Be aware this reads the entire contents of File1 into memory, however this should only be a concern if it is large, in which case you would want to optimize anyway, because of the multiplicative nature of the comparison.

Why are you not doing: awk 'NR==FNR{a[0ドル]=1;next}; 3ドル in a' File1 File2
Because that would not find the line with ´Locus_3,Locus_4´. I had tried that optimization first as well, but then noticed the difference in the case where the searchterms were joined with the ','.

Peter.O Peter.O 33.7k32 gold badges119 silver badges166 bronze badges · Answer 3 · 2015-07-08 14:07:19Z

Using the grep -F option searches for literal strings anywhere in the current line. By definition, literal means that you cannot use regular expressions to narrow down your search to be within just field 3 (TAB delimited).

You can, however, use the grep -f to read your pattern input file1 - but you do need to modify it into a list of regular expression. Here is one way using bash process substitution and sed to generate list of standard regular expressions which grep -f can handle.

Using grep with Basic Regular Expressions:

 grep -f <(sed 's/.*/^\\([^\t]\\+\t\\)\\{2\\}\\([^\t]\\+,\\)*&[,\t]/' file1) file2

For grep's Basic regex, file1 is dynamically converted to:

^\([^ ]\+ \)\{2\}\([^ ]\+,\)*Locus_1[, ]
^\([^ ]\+ \)\{2\}\([^ ]\+,\)*Locus_2[, ]
^\([^ ]\+ \)\{2\}\([^ ]\+,\)*Locus_3[, ]

OR: Using grep -E with Extended Regular Expressions visually simplifies the code by avoiding the need for most backslashes in both grep and sed

grep -Ef <(sed 's/.*/^([^\t]+\t){2}([^\t]+,)*&[,\t]/' file1) file2

For grep's Extended regex, file1 is dynamically converted to:

^([^ ]+ ){2}([^ ]+,)*Locus_1[, ]
^([^ ]+ ){2}([^ ]+,)*Locus_2[, ]
^([^ ]+ ){2}([^ ]+,)*Locus_3[, ]

The output (in both cases):

3 3 Locus_1 Locus_40 etc_849 
3 2 Locus_2 Locus_94 * 
2 3 Locus_3,Locus_4 Locus_50 *

Note that -f and -F can slow things down dramatically while file1 is large

@don_crissti: No I meant it as shown. I can understand -F would be faster than -f because it is a case of literal vs. regex matching. My meaning was that they both tend to bog down with large pattern files vs, for example awk using an array lookup. (I probably should have added that bit, I was rushing it towards the end :)
@mikeserv: Again, my rushing it mode at work... I copied an pasted a commented out test line.. The . should be \t
@mikeserv: A notable point and it is good you have mentioned.it. As to why I excluded empty fields- Basically there is no mention of empty fields in the question, and catering for possibly invalid data seemed unnecessary to show the concept of how to use grep in the suggested manner.. And where to stop? Does one also allow for empty/missing comma-separated components in field 3?
Well... you stop at the tab. It's tab delimited. So just get two of those and stop. But I think it's a lot to assume that the third field can only be populated if the first two are. About the commas - no, I don't think so. I think it's a little silly. It was a fun problem though.

Petr Skocik Petr Skocik 29.6k18 gold badges89 silver badges154 bronze badges · Answer 4 · 2015-07-08 11:45:50Z

A `grep -P` solution:

regexp=$( echo -n '('; < File1 tr '\n' '|' | sed 's/|$//'; echo ')' )
grep -P "^[^\s]+\s+[^s]+\s+([^\s]*,)*$regexp" File2

Output:

3 3 Locus_1 Locus_40 etc_849 
3 2 Locus_2 Locus_94 * 
2 3 Locus_3,Locus_4 Locus_50 *

If your File1 may contain special regexp characters, you'll need to escape them:

regexp_escape() { ruby -pe '$_ = Regexp.escape($_.chomp("\n")) + "\n"'; }
regexp=$( echo -n '('; < File1 regexp_escape | tr '\n' '|' | sed 's/|$//'; echo ')' )
grep -P "^[^\s]+\s+[^s]+\s+([^\s]*,)*$regexp" File2

Explanation:

The second line creates strings such as: (Locus_1|Locus_2|Locus_3). and

"^[^\s]+\s+[^s]+\s+([^\s]*,)*"

means:

[word] [whitespace(s)] [word] [whitespace(s)] [(optional word followed by comma) zero or arbitrarily many times]

mikeserv mikeserv 59.3k10 gold badges121 silver badges242 bronze badges · Answer 5 · 2015-07-08 15:17:27Z

( t=$(printf \\t) ntt=[^$t]*$t ntc=[^$t,]*
### ^just makes it easy regardless of your sed version.
 sed -ne"s/..*/^($ntt){2}($ntc,)*&(,$ntc)*$t/p" |
 grep -Ef- ./File2
) <File1

3 3 Locus_1 Locus_40 etc_849
3 2 Locus_2 Locus_94 *
2 3 Locus_3,Locus_4 Locus_50 *

That will get a match for a line in File1 in the third column of File2 regardless of how many ($ntc,)* groups precede or (,$ntc)* groups follow it. It does depend, though, on there being no metacharacters in the search strings in File1. If there might be metachars in File1, then we have to clean it up, first:

( t=$(printf \\t) ntt=[^$t]*$t ntc=[^$t,]*
 sed -ne's/[]?{(^$|*.+)}\[]/\\&/g' \
 -e"s/..*/^($ntt){2}($ntc,)*&(,$ntc)*$t/p" |
 grep -Ef- ./File2
) <File1

ikrabbe ikrabbe 2,21314 silver badges20 bronze badges · Answer 6 · 2015-07-08 10:50:32Z

1

To "grep" columns awk is the tool of choice

BEGIN { f="Locus_2" }
3ドル==f { print 0ドル; }

so you can loop through File1

for x in `cat File1`
do awk -v X="$x" '3ドル~X { print 0ドル }' <File2
done

.

Share

Improve this answer

edited Jul 8, 2015 at 11:04

answered Jul 8, 2015 at 10:50

ikrabbe's user avatar

ikrabbe ikrabbe

2,21314 silver badges20 bronze badges

2

1

This will only search Locus_2, not Locus_1 and Locus_3.

Archemar
– Archemar

2015年07月08日 10:56:21 +00:00
Commented Jul 8, 2015 at 10:56
1

I didn't meant to solve it completely. This is a question and answer platform not a "do my work" platform.

ikrabbe
– ikrabbe

2015年07月08日 11:08:23 +00:00
Commented Jul 8, 2015 at 11:08

Add a comment |

Stack Exchange Network

Grep a file on specific field

6 Answers 6

A `grep -P` solution:

Output:

Explanation:

You must log in to answer this question.

Hot Network Questions

Grep a file on specific field

6 Answers 6

A grep -P solution:

Output:

Explanation:

You must log in to answer this question.

Related

Hot Network Questions

A `grep -P` solution: