I have a file like this (five tab-separated columns)
head allKO.txt
Metabolism Carbohydrate metabolism Glycolisis K07448
Metabolism Protein metabolism protesome K02217
and I want to search for the pattern (string) in column 5 in the file KEGG.annotations
, and, if it is found, I want to print in another file both the line from KEGG.annotations
where the pattern was found and all the columns of allKO.txt
.
The file where I'm looking for the pattern is:
head KEGG.annotations
>aai:AARI_24510 proP; proline/betaine transporter; K03762 MFS transporter, MHS family, proline/betaine transporter
>aai:AARI_26600 ferritin-like protein; K02217 ferritin [EC:1.16.3.1]
>aai:AARI_28260 hypothetical protein
>aai:AARI_29060 ABC drug resistance transporter, inner membrane subunit; K09686 antibiotic transport system permease protein
>aai:AARI_29070 ABC drug resistance transporter, ATP-binding subunit (EC:3.6.3.-); K09687 antibiotic transport system ATP-binding protein
>aai:AARI_29650 hypothetical protein
>aai:AARI_32480 iron-siderophore ABC transporter ATP-binding subunit (EC:3.6.3.-); K02013 iron complex transport system ATP-binding protein [EC:3.6.3.34]
>aai:AARI_33320 mrr; restriction system protein Mrr; K07448 restriction system protein
I want something like this:
Metabolism Carbohydrate metabolism Glycolisis K07448 >aai:AARI_33320 mrr; restriction system protein Mrr; K07448 restriction system
Metabolism Protein metabolism proteasome K02217 >aai:AARI_26600 ferritin-like protein; K02217 ferritin [EC:1.16.3.1]
Note that the >aai:AARI_33320 mrr; restriction ...
text that is appended to the first line is eighth line from KEGG.annotations
, which is the one that contains K07448
(which is the ID field (fifth field) from the first line of allKO.txt
).
How can I modify this code in order to use my pattern file? This works with a pattern file with only one column containing the specific pattern to find.
while read pat; do
grep "$pat" --label="$pat" -H < KEGG.annotations;
done < allKO.txt > test1
2 Answers 2
You could work with the code you already have. Store the line into an array and match for the fifth element:
while read -r line; do
[ -z "$line" ] && continue
patlist=($line)
pat=${patlist[4]}
grep "$pat" --label="$line" -H < KEGG.annotations
done < allKO.txt
returns:
Metabolism Carbohydrate metabolism Glycolisis K07448:>aai:AARI_33320 mrr; restriction system protein Mrr; K07448 restriction system protein
Metabolism Protein metabolism protesome K02217:>aai:AARI_26600 ferritin-like protein; K02217 ferritin [EC:1.16.3.1]
This seems to do what you seem to be asking for:
while read w1 w2 w3 w4 ID
do
printf "%s " "$w1 $w2 $w3 $w4 $ID"
if ! grep "$ID" KEGG.annotations
then
echo
fi
done < allKO.txt
This will write output to the screen.
Add an output (>
) redirection (e.g., > test1
) to the last line
to capture the output in a file.
- Based on your examples, the key/ID field ("pattern")
is the fifth of five fields in the
allKO.txt
file, so weread w1 w2 w3 w4 ID
. You say this is a tab-delimited file; I’m assuming that none of the fields contain spaces. - Write (
printf
) the line (i.e., the fields) fromallKO.txt
, with a space at the end but no terminating newline. - Search (
grep
) theKEGG.annotations
file for the ID (fifth field from the line fromallKO.txt
). These will be complete lines (including newlines). - If the
grep
fails, write a newline, since theprintf
didn’t. This will result in lines whose ID isn’t present in
KEGG.annotations
to be simply written to the output:Metabolism Protein metabolism proteasome K02217 >aai:AARI_26600 ferritin-like protein; K02217 ferritin [EC:1.16.3.1] This ID doesn’t exist: K99999
and IDs that exist more than once are written as additional lines (not repeating the data from
allKO.txt
):Metabolism Protein metabolism proteasome K02217 >aai:AARI_26600 ferritin-like protein; K02217 ferritin [EC:1.16.3.1] This is a hypothetical additional line from KEGG.annotations that mentions "K02217".
-
Hi Scott, thanks for the suggestion, but it's not exactly what I need. The file allKO has 1 million rows and many of the ID are not present in my kegg.annotations, so I don't want to add so many Lines that I don't need. Then many ID in allKO are present several Times in kegg.annotations and I need them to be printed with all the other information more than onceFrancesca de Filippis– Francesca de Filippis2014年11月18日 01:54:19 +00:00Commented Nov 18, 2014 at 1:54
allKO.txt
. The third column ofallKO.txt
is the wordmetabolism
, that pattern doesn't appear anywhere inKEGG.annotations
.Metabolism
, Column 2 =Carbohydrate
, Column 3 =metabolism
, Column 4 =Glycolisis
, Column 5 =K07488
. So if you want to search for the pattern in column 3, you want to search formetabolism
. If that's not what you mean, please clarify the question.