I have a very big file and I want to remove if the column value is 9.
Sample:
My file value like this:
1 5 8 3 5 9 5 7 6 9
2 5 7 4 2 9 7 6 3 1
5 9 7 4 1 9 5 7 9 1
I want to delete any columns where the value, on all rows, is 9 ( my column size is very big then I can not check first column = 9 second column = 9 ... etc). I need a dynamic script.
Output should be like this:
1 5 8 3 5 5 7 6 9
2 5 7 4 2 7 6 3 1
5 9 7 4 1 5 7 9 1
I am new and I tried many things and did not do it.
How can I do it?
Thanks for your help
-
1You "tried many things". Can you explain further, please?Chris Davies– Chris Davies2015年03月23日 12:34:35 +00:00Commented Mar 23, 2015 at 12:34
-
you want to remove first occurrence of 9 as in your sample, or all ?Archemar– Archemar2015年03月23日 12:41:03 +00:00Commented Mar 23, 2015 at 12:41
-
all columns Archemar. If the all column value is 9 then I want to remove that columnJohn– John2015年03月23日 12:45:45 +00:00Commented Mar 23, 2015 at 12:45
-
So, individual 9s are not a problem? You only want to remove if the entire columns consists exclusively of 9?terdon– terdon ♦2015年03月23日 13:05:24 +00:00Commented Mar 23, 2015 at 13:05
-
2How big is your big file? Is it so big that manipulating it in an array will use up all your free memory?Peter.O– Peter.O2015年03月23日 17:40:39 +00:00Commented Mar 23, 2015 at 17:40
5 Answers 5
In python:
#! /usr/bin/env python3
import sys
# Get the numbers
numbers = [[int(x) for x in line.strip().split()] for line in sys.stdin]
# Get indexes of 9 in sets for each row
index_9 = (set(x for x, num in enumerate(line) if num == 9) for line in numbers)
common_column = next(index_9).intersection(*index_9)
for line in numbers:
print(' '.join((str(num) for x, num in enumerate(line) if x not in common_column)))
This awk
method assumes that each row has the same number of fields... (as shown in the example given in the question). It also assumes there to be no empty fields.
cat <<EOF >file
1 5 8 3 5 9 5 7 6 9
2 5 7 4 2 9 7 6 3 1
5 9 7 4 1 9 5 7 9 1
EOF
awk '{ for (c=1; c<=NF; c++) a[NR,c]=$c }
END { for (c=1; c<=NF; c++) {
vc="" # values in column
for (r=1; r<=NR; r++) {
vc = vc " " a[r,c] }
if ( ! gensub( /[9 ]/,"","g",vc) ) {
for (r=1; r<=NR; r++) {
a[r,c] = "" } }
}
for (r=1; r<=NR; r++) {
for (c=1; c<=NF; c++) {
if ( a[r,c] ) printf a[r,c]" " }
print "" }
}' file
# output
1 5 8 3 5 5 7 6 9
2 5 7 4 2 7 6 3 1
5 9 7 4 1 5 7 9 1
Here's a possible approach using bash / GNU coreutils, that doesn't require much storage:
cut
the file column-by-column and record the indexes of any columns that do not consist entirely of 9s; if you know how many columns your file has (in this case 10) it could be as simple asfor ((i=1;i<11;i++)); do [[ $(cut -d' ' -f${i} file | sed '/^9$/d' | wc -l) -eq 0 ]] || a+=($i) done
(using the fact that only columns consisting entirely of 9s have length 0 after all the 9s are deleted); then
pass the list of columns to be retained to a further
cut
command, using a change ofIFS
to turn the array to a comma-separated list(IFS=, ; cut -d' ' -f"${a[*]}" file)
If your version of cut
supports the --complement
flag you could record the columns that do contain all-9s and cut
all except those:
for ((i=1;i<11;i++)); do
[[ $(cut -d' ' -f${i} file | sed '/^9$/d' | wc -l) -eq 0 ]] && a+=($i)
done
(IFS=, ; cut -d' ' --complement -f"${a[*]}" file)
-
+1, I had the same in mind. You could simplify the first part:
cut -d' ' -f${i} infile | grep -qv '^9$' && a+=($i)
(or|| a+=($i)
if usingcut
with--complement
). Now, it'd be interesting to run the four working solutions here against a big file and see which one is faster...don_crissti– don_crissti2015年03月24日 02:04:58 +00:00Commented Mar 24, 2015 at 2:04 -
@don_crissti thanks - I like the
grep
idea - much neater. With the complement version, one could maybe even make use of the-m
option to quit processing as soon as it finds a non-9 entry? i.e.... | grep -vqm1 '^9$' || a+=($i)
. Might speed it up a bit further.steeldriver– steeldriver2015年03月24日 03:39:30 +00:00Commented Mar 24, 2015 at 3:39 -
Indeed, that should speed it up (it does work the other way around too:
grep -vqm1 '^9$' && a+=($i)
for your first approach but if you want to keep it portable then you'll have to do without-m1
&--complement
).don_crissti– don_crissti2015年03月24日 11:54:32 +00:00Commented Mar 24, 2015 at 11:54
You can try this with awk:
awk '{ t[NR] = 0ドル; if (NR == 1) { for (i = 1; i <= NF; i++) if ($i == 9) { met[i] = 1 } } else { for (i = 1; i <= NF; i++) { if (met[i] != 1 || $i != 9) { met[i] = 0; } } } }
END { for (i = 1; i <= NR; i++) { n = split(t[i], a); for (z = 1; z <= NF; z++) if (met[z] != 1) { printf("%s ", a[z]); } print "" } }' file
or as of this answer we can construct params to cut which is faster:
awk '{ t[NR] = 0ドル; if (NR == 1) { for (i = 1; i <= NF; i++) if ($i == 9) { met[i] = 1 } } else { for (i = 1; i <= NF; i++) { if (met[i] != 1 || $i != 9) { met[i] = 0; } } } }
END { c = 0; s = " -f"; for ( i = 1; i <= NF; i++) { if (met[i] == 1) { if (c == 0) s = s " " i; else s = s "," i; c++; } } s = s " -d\" \" "; if (c != 0) { system("cut --complement " s " " FILENAME); } else { system("cat " FILENAME) } }' file
And of course both are open for criticism.
-
Good! The
cut
version is approximately 3 times faster thansplit
version, and even that one is significantly faster than what I came up with (+1). BTW, you seem to have an extraprint s, c;
in thecut
versionPeter.O– Peter.O2015年03月24日 01:41:33 +00:00Commented Mar 24, 2015 at 1:41
Given the info in the question I can currently come up with:
awk '{for (i=1; i<NF; i++){ a[i]+=$i; b[i]=b[i]" " $i}} END{for (i=1; i<NF; i++) if (a[i]/NR!=9) {printf "%s\n", b[i]}}' same-column-values
Where the function iterates through the file, calculates the sum into variable 'a' and appends the value to an index array 'b'. After the file is completely read, the array of sums is iterated through and if the sum divided by the number of records (NR) not equals to 9 than that corresponding line in array 'b' is printed.
This gets me an output of
1 2 5
5 5 9
8 7 7
3 4 4
5 2 1
5 7 5
7 6 7
6 3 9
Drawback on this is that the output should be read top->bottom and should be translated from top->bottom to left->right.
Alternatively you can get a list of columns which only contains values of 9 with the following command:
awk '{for (i=1; i<NF; i++){ a[i]+=$i; b[i]=b[i]" " $i}} END{for (i=1; i<NF; i++) if (a[i]/NR==9){print i; }}' same-column-values
-
The add and divide method can easily give false positives. eg. 3 lines of a file may have a column with 8+ 9+10=27, which gives the same result as 3 lines containing 9+9+9=27..Peter.O– Peter.O2015年03月23日 18:47:17 +00:00Commented Mar 23, 2015 at 18:47