Search for pattern and append line to another file

Question 1

I have a file like this (five tab-separated columns)

head allKO.txt
Metabolism Carbohydrate metabolism Glycolisis K07448
Metabolism Protein metabolism protesome K02217

and I want to search for the pattern (string) in column 5 in the file KEGG.annotations, and, if it is found, I want to print in another file both the line from KEGG.annotations where the pattern was found and all the columns of allKO.txt. The file where I'm looking for the pattern is:

head KEGG.annotations
>aai:AARI_24510 proP; proline/betaine transporter; K03762 MFS transporter, MHS family, proline/betaine transporter
>aai:AARI_26600 ferritin-like protein; K02217 ferritin [EC:1.16.3.1]
>aai:AARI_28260 hypothetical protein
>aai:AARI_29060 ABC drug resistance transporter, inner membrane subunit; K09686 antibiotic transport system permease protein
>aai:AARI_29070 ABC drug resistance transporter, ATP-binding subunit (EC:3.6.3.-); K09687 antibiotic transport system ATP-binding protein
>aai:AARI_29650 hypothetical protein
>aai:AARI_32480 iron-siderophore ABC transporter ATP-binding subunit (EC:3.6.3.-); K02013 iron complex transport system ATP-binding protein [EC:3.6.3.34]
>aai:AARI_33320 mrr; restriction system protein Mrr; K07448 restriction system protein

I want something like this:

Metabolism Carbohydrate metabolism Glycolisis K07448 >aai:AARI_33320 mrr; restriction system protein Mrr; K07448 restriction system
Metabolism Protein metabolism proteasome K02217 >aai:AARI_26600 ferritin-like protein; K02217 ferritin [EC:1.16.3.1]

Note that the >aai:AARI_33320 mrr; restriction ... text that is appended to the first line is eighth line from KEGG.annotations, which is the one that contains K07448 (which is the ID field (fifth field) from the first line of allKO.txt).

How can I modify this code in order to use my pattern file? This works with a pattern file with only one column containing the specific pattern to find.

while read pat; do
 grep "$pat" --label="$pat" -H < KEGG.annotations;
done < allKO.txt > test1

Question 2

Show difference between allKO.txt and what you want. In present it's much difficult to find it out.

Question 3

I can't understand what you want. Your output looks identical to allKO.txt. The third column of allKO.txt is the word metabolism, that pattern doesn't appear anywhere in KEGG.annotations.

Question 4

@Barmar the file allKO.txt is a sort of database. In the file kegg., I have the annotations of my genes (within the string I have the ID starting with K0). The same IDs are in the third column of allKO.txt. The first Lines of the output I want are just an example, not all the KO in allKO.txt will be present in kegg.annotation

Question 5

I want to search for the ID in column 3 of allKO in kegg.annotations and if found, I want to add the other columns of allKO

Question 6

It seems not to be clear where the column boundaries are in allKO.txt. To me it looke like this for the first line: Column 1 = Metabolism, Column 2 = Carbohydrate, Column 3 = metabolism, Column 4 = Glycolisis, Column 5 = K07488. So if you want to search for the pattern in column 3, you want to search for metabolism. If that's not what you mean, please clarify the question.

Question 7

You could work with the code you already have. Store the line into an array and match for the fifth element:

while read -r line; do
 [ -z "$line" ] && continue
 patlist=($line)
 pat=${patlist[4]}
 grep "$pat" --label="$line" -H < KEGG.annotations
done < allKO.txt

returns:

Metabolism Carbohydrate metabolism Glycolisis K07448:>aai:AARI_33320 mrr; restriction system protein Mrr; K07448 restriction system protein
Metabolism Protein metabolism protesome K02217:>aai:AARI_26600 ferritin-like protein; K02217 ferritin [EC:1.16.3.1]

Question 8

This seems to do what you seem to be asking for:

while read w1 w2 w3 w4 ID
do
 printf "%s " "$w1 $w2 $w3 $w4 $ID"
 if ! grep "$ID" KEGG.annotations
 then
 echo
 fi
done < allKO.txt

This will write output to the screen. Add an output (>) redirection (e.g., > test1) to the last line to capture the output in a file.

Based on your examples, the key/ID field ("pattern") is the fifth of five fields in the allKO.txt file, so we read w1 w2 w3 w4 ID. You say this is a tab-delimited file; I’m assuming that none of the fields contain spaces.
Write (printf) the line (i.e., the fields) from allKO.txt, with a space at the end but no terminating newline.
Search (grep) the KEGG.annotations file for the ID (fifth field from the line from allKO.txt). These will be complete lines (including newlines).
If the grep fails, write a newline, since the printf didn’t.

This will result in lines whose ID isn’t present in KEGG.annotations to be simply written to the output:

Metabolism Protein metabolism proteasome K02217 >aai:AARI_26600 ferritin-like protein; K02217 ferritin [EC:1.16.3.1]
This ID doesn’t exist: K99999

and IDs that exist more than once are written as additional lines (not repeating the data from allKO.txt):

Metabolism Protein metabolism proteasome K02217 >aai:AARI_26600 ferritin-like protein; K02217 ferritin [EC:1.16.3.1]
This is a hypothetical additional line from KEGG.annotations that mentions "K02217".

Question 9

Hi Scott, thanks for the suggestion, but it's not exactly what I need. The file allKO has 1 million rows and many of the ID are not present in my kegg.annotations, so I don't want to add so many Lines that I don't need. Then many ID in allKO are present several Times in kegg.annotations and I need them to be printed with all the other information more than once

score 0 · Accepted Answer · 2014-11-18 04:50:09Z

You could work with the code you already have. Store the line into an array and match for the fifth element:

while read -r line; do
 [ -z "$line" ] && continue
 patlist=($line)
 pat=${patlist[4]}
 grep "$pat" --label="$line" -H < KEGG.annotations
done < allKO.txt

returns:

Metabolism Carbohydrate metabolism Glycolisis K07448:>aai:AARI_33320 mrr; restriction system protein Mrr; K07448 restriction system protein
Metabolism Protein metabolism protesome K02217:>aai:AARI_26600 ferritin-like protein; K02217 ferritin [EC:1.16.3.1]

Stack Exchange Network

Search for pattern and append line to another file

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Search for pattern and append line to another file

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions