I have one file: combined.txt like this:
GO_GLUTAMINE_FAMILY_AMINO_ACID_METABOLIC_PROCESS
REACTOME_APC_CDC20_MEDIATED_DEGRADATION_OF_NEK2A
LEE_METASTASIS_AND_RNA_PROCESSING_UP
RB_DN.V1_UP
REACTOME_ABORTIVE_ELONGATION_OF_HIV1_TRANSCRIPT_IN_THE_ABSENCE_OF_TAT
...
and in my current directory I have multiple .xls files which are named like lines in combined.txt, for example: GO_GLUTAMINE_FAMILY_AMINO_ACID_METABOLIC_PROCESS.xls
In those .xls files I want to retrieve everything in column named: GENE_TITLE for which I have "Yes" in column named: "METRIC SCORE"
those files look like:
NAME PROBE GENE SYMBOL GENE_TITLE RANK IN GENE LIST RANK METRIC SCORE RUNNING ES CORE ENRICHMENT
row_0 MKI67 null null 51 3.389514923095703 0.06758767 Yes
row_1 CDCA8 null null 96 2.8250465393066406 0.123790346 Yes
row_2 NUSAP1 null null 118 2.7029471397399902 0.17939204 Yes
row_3 H2AFX null null 191 2.3259851932525635 0.22256653 Yes
row_4 DLGAP5 null null 193 2.324765920639038 0.2718671 Yes
row_5 SMC2 null null 229 2.2023487091064453 0.31562105 No
row_6 CKS1B null null 279 2.0804455280303955 0.3555722 No
row_7 UBE2C null null 403 1.816525936126709 0.38350475 No
And in the output file I would have just in every line:
GO_GLUTAMINE_FAMILY_AMINO_ACID_METABOLIC_PROCESS 51 96 118 191 193
<name of the particular line in combined.txt> <list of all entries in GENE_TITLE which have METRIC SCORE=Yes>
What I tried so far is:
grep -iw -f combined.txt *.xls > out1
I also tried this but here I am not using information from combined.txt neither getting values labeled with "Yes" just extracting 5th column from all files
awk '{ a[FNR] = (a[FNR] ? a[FNR] FS : "") 5ドル } END { for(i=1;i<=FNR;i++) print a[i] }' $(ls -1v *.xls) > out2
this is maybe a little bit closer but still not there:
awk 'BEGIN {ORS=" "} BEGINFILE{print FILENAME} {print 5ドル " " 8ドル} ENDFILE{ printf("\n")}' *.xls > out3
I am getting something like:
GENE_TITLE GENE 1 Yes 4 Yes 11 Yes 23 Yes 49 Yes 76 Yes 85 Yes 118 No 161 No....
GENE_TITLE GENE 0 Yes 16 No 28 Yes 51 Yes 63 No 96 Yes 182 Yes 191 Yes
...
so my desired output would have instead of "GENE_TITLE GENE" the name of the file from where it did grab those values (without .xls suffix) : 0 Yes 16 No 28 Yes 51 Yes 63 No 96...not including the one which have "No"
UPDATE
I did get the file I needed but I wrote the ugliest code possible (see bellow). If someone has something a little bit more elegant please do share.
This is how I got it:
awk '{print FILENAME " "5ドル " "8ドル}' *.xls | awk '!/^ranked/' | awk '!/^gsea/'| awk '!/^gene/' | awk '3ドル!="No" {print 1ドル " " 2ドル}' | awk '2ドル!="GENE_TITLE" {print}' |awk -v ncr=4 '{1ドル=substr(1,0,ドルlength(1ドル)-ncr)}1' | awk -F' ' -v OFS=' ' '{x=1ドル;1ドル="";a[x]=a[x]0ドル}END{for(x in a)print x,a[x]}'>out3
grep -iw -f combined.txt out3 > ENTR_combined_SET.txt
2 Answers 2
xargs -I {} awk '8ドル == "Yes" { title = title OFS 5ドル } END { print substr(FILENAME,1,length(FILENAME)-4), title }' {}.xls <combined.txt
This uses xargs
to execute an awk
program for each name listed in your combined.txt
file.
The awk
program is given whatever names is read from the combined.txt
file with .xls
added onto the end of the name as its input file.
The awk
program collects the data from the 5th column for each row whose 8th column is Yes
. This string is then printed together with the filename with its last four characters (the file name suffix) chopped off.
-
Hi How would I change this command so that it prints me file name as it is and 2nd column which is called "PROBE", instead of "GENE_TITLE" that I have now?anikaM– anikaM2019年04月23日 18:19:53 +00:00Commented Apr 23, 2019 at 18:19
-
@anikaM You would change
5ドル
to2ドル
and use justFILENAME
instead ofsubstr(...)
.2019年04月23日 18:36:56 +00:00Commented Apr 23, 2019 at 18:36 -
Is it like this: xargs -I {} awk '8ドル == "Yes" { title = title OFS 2ドル } END { print FILENAME, title }' {}.xls < combined.txtanikaM– anikaM2019年04月23日 19:01:12 +00:00Commented Apr 23, 2019 at 19:01
-
@anikaM I believe so, yes.2019年04月23日 19:06:37 +00:00Commented Apr 23, 2019 at 19:06
Bash script:
#!/bin/bash
# read combined.txt line by line
while read -r line; do
# skip missing file ${line}.xls
[ ! -f "$line".xls ] && continue
# echo line and one space character (without newline)
echo -n "$line " >> out
# get 5th column if line ends with "Yes" and optional whitespace at end of line
# replace newline '\n' with space ' '
sed -nE 's/^\S+\s+\S+\s+\S+\s+\S+\s+(\S+).*\sYes\s*$/1円/p' "$line".xls | tr '\n' ' ' >> out
# add newline
echo >> out
done < combined.txt
in one line:
while read -r line; do [ ! -f "$line".xls ] && continue; echo -n "$line " >> out; sed -nE 's/^\S+\s+\S+\s+\S+\s+\S+\s+(\S+).*\sYes\s*$/1円/p' "$line".xls | tr '\n' ' ' >> out; echo >> out; done < combined.txt
Note that each line in out
will have one additional space character at the end of the line.
You must log in to answer this question.
Explore related questions
See similar questions with these tags.
python
(or a similar language). It will make your code more readable and easier to maintain.