Remove Same Column Values

Question 1

I have a very big file and I want to remove if the column value is 9.

Sample:

My file value like this:

1 5 8 3 5 9 5 7 6 9
2 5 7 4 2 9 7 6 3 1
5 9 7 4 1 9 5 7 9 1

I want to delete any columns where the value, on all rows, is 9 ( my column size is very big then I can not check first column = 9 second column = 9 ... etc). I need a dynamic script.

Output should be like this:

1 5 8 3 5 5 7 6 9
2 5 7 4 2 7 6 3 1
5 9 7 4 1 5 7 9 1

I am new and I tried many things and did not do it.

How can I do it?

Thanks for your help

Question 2

You "tried many things". Can you explain further, please?

Question 3

you want to remove first occurrence of 9 as in your sample, or all ?

Question 4

all columns Archemar. If the all column value is 9 then I want to remove that column

Question 5

So, individual 9s are not a problem? You only want to remove if the entire columns consists exclusively of 9?

Question 6

How big is your big file? Is it so big that manipulating it in an array will use up all your free memory?

Question 7

In python:

#! /usr/bin/env python3
import sys
# Get the numbers
numbers = [[int(x) for x in line.strip().split()] for line in sys.stdin] 
# Get indexes of 9 in sets for each row
index_9 = (set(x for x, num in enumerate(line) if num == 9) for line in numbers) 
common_column = next(index_9).intersection(*index_9)
for line in numbers:
 print(' '.join((str(num) for x, num in enumerate(line) if x not in common_column)))

Question 8

This awk method assumes that each row has the same number of fields... (as shown in the example given in the question). It also assumes there to be no empty fields.

cat <<EOF >file
1 5 8 3 5 9 5 7 6 9
2 5 7 4 2 9 7 6 3 1
5 9 7 4 1 9 5 7 9 1
EOF
awk '{ for (c=1; c<=NF; c++) a[NR,c]=$c }
 END { for (c=1; c<=NF; c++) { 
 vc="" # values in column
 for (r=1; r<=NR; r++) { 
 vc = vc " " a[r,c] }
 if ( ! gensub( /[9 ]/,"","g",vc) ) {
 for (r=1; r<=NR; r++) {
 a[r,c] = "" } }
 }
 for (r=1; r<=NR; r++) {
 for (c=1; c<=NF; c++) {
 if ( a[r,c] ) printf a[r,c]" " } 
 print "" }
 }' file
# output
1 5 8 3 5 5 7 6 9 
2 5 7 4 2 7 6 3 1 
5 9 7 4 1 5 7 9 1

Question 9

Here's a possible approach using bash / GNU coreutils, that doesn't require much storage:

cut the file column-by-column and record the indexes of any columns that do not consist entirely of 9s; if you know how many columns your file has (in this case 10) it could be as simple as
```
for ((i=1;i<11;i++)); do 
 [[ $(cut -d' ' -f${i} file | sed '/^9$/d' | wc -l) -eq 0 ]] || a+=($i)
done
```
(using the fact that only columns consisting entirely of 9s have length 0 after all the 9s are deleted); then
pass the list of columns to be retained to a further cut command, using a change of IFS to turn the array to a comma-separated list
```
(IFS=, ; cut -d' ' -f"${a[*]}" file)
```

If your version of cut supports the --complement flag you could record the columns that do contain all-9s and cut all except those:

 for ((i=1;i<11;i++)); do
 [[ $(cut -d' ' -f${i} file | sed '/^9$/d' | wc -l) -eq 0 ]] && a+=($i)
 done
 (IFS=, ; cut -d' ' --complement -f"${a[*]}" file)

Question 10

+1, I had the same in mind. You could simplify the first part: cut -d' ' -f${i} infile | grep -qv '^9$' && a+=($i) (or || a+=($i) if using cut with --complement). Now, it'd be interesting to run the four working solutions here against a big file and see which one is faster...

Question 11

@don_crissti thanks - I like thegrep idea - much neater. With the complement version, one could maybe even make use of the -m option to quit processing as soon as it finds a non-9 entry? i.e. ... | grep -vqm1 '^9$' || a+=($i). Might speed it up a bit further.

Question 12

Indeed, that should speed it up (it does work the other way around too: grep -vqm1 '^9$' && a+=($i) for your first approach but if you want to keep it portable then you'll have to do without -m1 & --complement).

Question 13

You can try this with awk:

awk '{ t[NR] = 0ドル; if (NR == 1) { for (i = 1; i <= NF; i++) if ($i == 9) { met[i] = 1 } } else { for (i = 1; i <= NF; i++) { if (met[i] != 1 || $i != 9) { met[i] = 0; } } } }
END { for (i = 1; i <= NR; i++) { n = split(t[i], a); for (z = 1; z <= NF; z++) if (met[z] != 1) { printf("%s ", a[z]); } print "" } }' file

or as of this answer we can construct params to cut which is faster:

awk '{ t[NR] = 0ドル; if (NR == 1) { for (i = 1; i <= NF; i++) if ($i == 9) { met[i] = 1 } } else { for (i = 1; i <= NF; i++) { if (met[i] != 1 || $i != 9) { met[i] = 0; } } } }
END { c = 0; s = " -f"; for ( i = 1; i <= NF; i++) { if (met[i] == 1) { if (c == 0) s = s " " i; else s = s "," i; c++; } } s = s " -d\" \" "; if (c != 0) { system("cut --complement " s " " FILENAME); } else { system("cat " FILENAME) } }' file

And of course both are open for criticism.

Question 14

Good! The cut version is approximately 3 times faster than split version, and even that one is significantly faster than what I came up with (+1). BTW, you seem to have an extra print s, c; in the cut version

Question 15

Given the info in the question I can currently come up with:

awk '{for (i=1; i<NF; i++){ a[i]+=$i; b[i]=b[i]" " $i}} END{for (i=1; i<NF; i++) if (a[i]/NR!=9) {printf "%s\n", b[i]}}' same-column-values

Where the function iterates through the file, calculates the sum into variable 'a' and appends the value to an index array 'b'. After the file is completely read, the array of sums is iterated through and if the sum divided by the number of records (NR) not equals to 9 than that corresponding line in array 'b' is printed.

This gets me an output of 1 2 5 5 5 9 8 7 7 3 4 4 5 2 1 5 7 5 7 6 7 6 3 9

Drawback on this is that the output should be read top->bottom and should be translated from top->bottom to left->right.

Alternatively you can get a list of columns which only contains values of 9 with the following command:

awk '{for (i=1; i<NF; i++){ a[i]+=$i; b[i]=b[i]" " $i}} END{for (i=1; i<NF; i++) if (a[i]/NR==9){print i; }}' same-column-values

Question 16

The add and divide method can easily give false positives. eg. 3 lines of a file may have a column with 8+ 9+10=27, which gives the same result as 3 lines containing 9+9+9=27..

muru muru 77.8k16 gold badges211 silver badges316 bronze badges · Answer 1 · 2015-03-23 17:22:08Z

In python:

#! /usr/bin/env python3
import sys
# Get the numbers
numbers = [[int(x) for x in line.strip().split()] for line in sys.stdin] 
# Get indexes of 9 in sets for each row
index_9 = (set(x for x, num in enumerate(line) if num == 9) for line in numbers) 
common_column = next(index_9).intersection(*index_9)
for line in numbers:
 print(' '.join((str(num) for x, num in enumerate(line) if x not in common_column)))

Peter.O Peter.O 33.7k32 gold badges119 silver badges166 bronze badges · Answer 2 · 2015-03-23 17:56:14Z

This awk method assumes that each row has the same number of fields... (as shown in the example given in the question). It also assumes there to be no empty fields.

cat <<EOF >file
1 5 8 3 5 9 5 7 6 9
2 5 7 4 2 9 7 6 3 1
5 9 7 4 1 9 5 7 9 1
EOF
awk '{ for (c=1; c<=NF; c++) a[NR,c]=$c }
 END { for (c=1; c<=NF; c++) { 
 vc="" # values in column
 for (r=1; r<=NR; r++) { 
 vc = vc " " a[r,c] }
 if ( ! gensub( /[9 ]/,"","g",vc) ) {
 for (r=1; r<=NR; r++) {
 a[r,c] = "" } }
 }
 for (r=1; r<=NR; r++) {
 for (c=1; c<=NF; c++) {
 if ( a[r,c] ) printf a[r,c]" " } 
 print "" }
 }' file
# output
1 5 8 3 5 5 7 6 9 
2 5 7 4 2 7 6 3 1 
5 9 7 4 1 5 7 9 1

steeldriver steeldriver 83.8k12 gold badges124 silver badges174 bronze badges · Answer 3 · 2015-03-23 18:53:14Z

Here's a possible approach using bash / GNU coreutils, that doesn't require much storage:

cut the file column-by-column and record the indexes of any columns that do not consist entirely of 9s; if you know how many columns your file has (in this case 10) it could be as simple as
```
for ((i=1;i<11;i++)); do 
 [[ $(cut -d' ' -f${i} file | sed '/^9$/d' | wc -l) -eq 0 ]] || a+=($i)
done
```
(using the fact that only columns consisting entirely of 9s have length 0 after all the 9s are deleted); then
pass the list of columns to be retained to a further cut command, using a change of IFS to turn the array to a comma-separated list
```
(IFS=, ; cut -d' ' -f"${a[*]}" file)
```

If your version of cut supports the --complement flag you could record the columns that do contain all-9s and cut all except those:

 for ((i=1;i<11;i++)); do
 [[ $(cut -d' ' -f${i} file | sed '/^9$/d' | wc -l) -eq 0 ]] && a+=($i)
 done
 (IFS=, ; cut -d' ' --complement -f"${a[*]}" file)

+1, I had the same in mind. You could simplify the first part: cut -d' ' -f${i} infile | grep -qv '^9$' && a+=($i) (or || a+=($i) if using cut with --complement). Now, it'd be interesting to run the four working solutions here against a big file and see which one is faster...
@don_crissti thanks - I like thegrep idea - much neater. With the complement version, one could maybe even make use of the -m option to quit processing as soon as it finds a non-9 entry? i.e. ... | grep -vqm1 '^9$' || a+=($i). Might speed it up a bit further.
Indeed, that should speed it up (it does work the other way around too: grep -vqm1 '^9$' && a+=($i) for your first approach but if you want to keep it portable then you'll have to do without -m1 & --complement).

taliezin taliezin 9,4551 gold badge37 silver badges39 bronze badges · Answer 4 · 2015-03-23 13:42:07Z

You can try this with awk:

awk '{ t[NR] = 0ドル; if (NR == 1) { for (i = 1; i <= NF; i++) if ($i == 9) { met[i] = 1 } } else { for (i = 1; i <= NF; i++) { if (met[i] != 1 || $i != 9) { met[i] = 0; } } } }
END { for (i = 1; i <= NR; i++) { n = split(t[i], a); for (z = 1; z <= NF; z++) if (met[z] != 1) { printf("%s ", a[z]); } print "" } }' file

or as of this answer we can construct params to cut which is faster:

awk '{ t[NR] = 0ドル; if (NR == 1) { for (i = 1; i <= NF; i++) if ($i == 9) { met[i] = 1 } } else { for (i = 1; i <= NF; i++) { if (met[i] != 1 || $i != 9) { met[i] = 0; } } } }
END { c = 0; s = " -f"; for ( i = 1; i <= NF; i++) { if (met[i] == 1) { if (c == 0) s = s " " i; else s = s "," i; c++; } } s = s " -d\" \" "; if (c != 0) { system("cut --complement " s " " FILENAME); } else { system("cat " FILENAME) } }' file

And of course both are open for criticism.

Good! The cut version is approximately 3 times faster than split version, and even that one is significantly faster than what I came up with (+1). BTW, you seem to have an extra print s, c; in the cut version

Lambert Lambert 12.8k2 gold badges28 silver badges35 bronze badges · Answer 5 · 2015-03-23 13:31:34Z

Given the info in the question I can currently come up with:

awk '{for (i=1; i<NF; i++){ a[i]+=$i; b[i]=b[i]" " $i}} END{for (i=1; i<NF; i++) if (a[i]/NR!=9) {printf "%s\n", b[i]}}' same-column-values

Where the function iterates through the file, calculates the sum into variable 'a' and appends the value to an index array 'b'. After the file is completely read, the array of sums is iterated through and if the sum divided by the number of records (NR) not equals to 9 than that corresponding line in array 'b' is printed.

This gets me an output of 1 2 5 5 5 9 8 7 7 3 4 4 5 2 1 5 7 5 7 6 7 6 3 9

Drawback on this is that the output should be read top->bottom and should be translated from top->bottom to left->right.

Alternatively you can get a list of columns which only contains values of 9 with the following command:

awk '{for (i=1; i<NF; i++){ a[i]+=$i; b[i]=b[i]" " $i}} END{for (i=1; i<NF; i++) if (a[i]/NR==9){print i; }}' same-column-values

The add and divide method can easily give false positives. eg. 3 lines of a file may have a column with 8+ 9+10=27, which gives the same result as 3 lines containing 9+9+9=27..

Stack Exchange Network

Remove Same Column Values

5 Answers 5

You must log in to answer this question.

Linked

Hot Network Questions

Remove Same Column Values

5 Answers 5

You must log in to answer this question.

Linked

Related

Hot Network Questions