2

I have one file: combined.txt like this:

GO_GLUTAMINE_FAMILY_AMINO_ACID_METABOLIC_PROCESS
REACTOME_APC_CDC20_MEDIATED_DEGRADATION_OF_NEK2A
LEE_METASTASIS_AND_RNA_PROCESSING_UP
RB_DN.V1_UP
REACTOME_ABORTIVE_ELONGATION_OF_HIV1_TRANSCRIPT_IN_THE_ABSENCE_OF_TAT
...

and in my current directory I have multiple .xls files which are named like lines in combined.txt, for example: GO_GLUTAMINE_FAMILY_AMINO_ACID_METABOLIC_PROCESS.xls

In those .xls files I want to retrieve everything in column named: GENE_TITLE for which I have "Yes" in column named: "METRIC SCORE"

those files look like:

 NAME PROBE GENE SYMBOL GENE_TITLE RANK IN GENE LIST RANK METRIC SCORE RUNNING ES CORE ENRICHMENT
row_0 MKI67 null null 51 3.389514923095703 0.06758767 Yes
row_1 CDCA8 null null 96 2.8250465393066406 0.123790346 Yes
row_2 NUSAP1 null null 118 2.7029471397399902 0.17939204 Yes
row_3 H2AFX null null 191 2.3259851932525635 0.22256653 Yes
row_4 DLGAP5 null null 193 2.324765920639038 0.2718671 Yes
row_5 SMC2 null null 229 2.2023487091064453 0.31562105 No
row_6 CKS1B null null 279 2.0804455280303955 0.3555722 No
row_7 UBE2C null null 403 1.816525936126709 0.38350475 No

And in the output file I would have just in every line:

 GO_GLUTAMINE_FAMILY_AMINO_ACID_METABOLIC_PROCESS 51 96 118 191 193
<name of the particular line in combined.txt> <list of all entries in GENE_TITLE which have METRIC SCORE=Yes>

What I tried so far is:

grep -iw -f combined.txt *.xls > out1

I also tried this but here I am not using information from combined.txt neither getting values labeled with "Yes" just extracting 5th column from all files

awk '{ a[FNR] = (a[FNR] ? a[FNR] FS : "") 5ドル } END { for(i=1;i<=FNR;i++) print a[i] }' $(ls -1v *.xls) > out2

this is maybe a little bit closer but still not there:

awk 'BEGIN {ORS=" "} BEGINFILE{print FILENAME} {print 5ドル " " 8ドル} ENDFILE{ printf("\n")}' *.xls > out3

I am getting something like:

GENE_TITLE GENE 1 Yes 4 Yes 11 Yes 23 Yes 49 Yes 76 Yes 85 Yes 118 No 161 No....
GENE_TITLE GENE 0 Yes 16 No 28 Yes 51 Yes 63 No 96 Yes 182 Yes 191 Yes
... 

so my desired output would have instead of "GENE_TITLE GENE" the name of the file from where it did grab those values (without .xls suffix) : 0 Yes 16 No 28 Yes 51 Yes 63 No 96...not including the one which have "No"

UPDATE

I did get the file I needed but I wrote the ugliest code possible (see bellow). If someone has something a little bit more elegant please do share.

This is how I got it:

awk '{print FILENAME " "5ドル " "8ドル}' *.xls | awk '!/^ranked/' | awk '!/^gsea/'| awk '!/^gene/' | awk '3ドル!="No" {print 1ドル " " 2ドル}' | awk '2ドル!="GENE_TITLE" {print}' |awk -v ncr=4 '{1ドル=substr(1,0,ドルlength(1ドル)-ncr)}1' | awk -F' ' -v OFS=' ' '{x=1ドル;1ドル="";a[x]=a[x]0ドル}END{for(x in a)print x,a[x]}'>out3
grep -iw -f combined.txt out3 > ENTR_combined_SET.txt
asked Apr 15, 2019 at 22:06
1
  • To be honest, this seems complicated enough that you might want to switch to python (or a similar language). It will make your code more readable and easier to maintain. Commented Apr 16, 2019 at 7:37

2 Answers 2

0
xargs -I {} awk '8ドル == "Yes" { title = title OFS 5ドル } END { print substr(FILENAME,1,length(FILENAME)-4), title }' {}.xls <combined.txt

This uses xargs to execute an awk program for each name listed in your combined.txt file.

The awk program is given whatever names is read from the combined.txt file with .xls added onto the end of the name as its input file.

The awk program collects the data from the 5th column for each row whose 8th column is Yes. This string is then printed together with the filename with its last four characters (the file name suffix) chopped off.

answered Apr 22, 2019 at 21:03
4
  • Hi How would I change this command so that it prints me file name as it is and 2nd column which is called "PROBE", instead of "GENE_TITLE" that I have now? Commented Apr 23, 2019 at 18:19
  • @anikaM You would change 5ドル to 2ドル and use just FILENAME instead of substr(...). Commented Apr 23, 2019 at 18:36
  • Is it like this: xargs -I {} awk '8ドル == "Yes" { title = title OFS 2ドル } END { print FILENAME, title }' {}.xls < combined.txt Commented Apr 23, 2019 at 19:01
  • @anikaM I believe so, yes. Commented Apr 23, 2019 at 19:06
0

Bash script:

#!/bin/bash
# read combined.txt line by line
while read -r line; do
 # skip missing file ${line}.xls
 [ ! -f "$line".xls ] && continue
 # echo line and one space character (without newline)
 echo -n "$line " >> out
 # get 5th column if line ends with "Yes" and optional whitespace at end of line
 # replace newline '\n' with space ' '
 sed -nE 's/^\S+\s+\S+\s+\S+\s+\S+\s+(\S+).*\sYes\s*$/1円/p' "$line".xls | tr '\n' ' ' >> out
 # add newline
 echo >> out
done < combined.txt

in one line:

while read -r line; do [ ! -f "$line".xls ] && continue; echo -n "$line " >> out; sed -nE 's/^\S+\s+\S+\s+\S+\s+\S+\s+(\S+).*\sYes\s*$/1円/p' "$line".xls | tr '\n' ' ' >> out; echo >> out; done < combined.txt

Note that each line in out will have one additional space character at the end of the line.

answered Apr 16, 2019 at 8:28
0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.