Remove fields containing specific string

Question 1

I have file1 containing multiple tab-separated fields, in which I would like to remove only the fields containing a specific string, in my case the underscore character _ (not removing all the row):

cat file1
357M 2054_
357_ 154= 1900_
511_ 419X 1481_ 34=

I would like to obtain the following:

cat file2
357M
154=
419X 34=

I managed to remove the fields as follows:

cat file1 | perl -pe 's/\w+_\s*//g'
357M 154= 419X 34=

But the format is not good, because I would like not to alter the number of columns.

I also tried:

cat file1 | sed 's/[0-9]*_//g'
357M
 154=
 419X 34=

But I would like to get rid of those empty columns.

A brute force approach that actually also works:

cat file1 | sed 's/[0-9]*_//g' | tr -s '\t' '\t' | sed 's/^[ \t]*//g'
357M
154=
419X 34=

This last command: (1) removes all fields containing a underscore; (2) replaces multiple tabs in a row with just one tab; (3) removes leading tabs. Not so elegant though.

Any suggestions?

Question 2

How did "357M" become "357=" (and 419X become 419= .. and so on). Your input and output don't appear to match the requirements...

Question 3

my bad - wrong copypaste. edited

Question 4

Consider:

sed 's/[^\t]*_//; s/\t[^\t]*_/\t/g' < input

This performs two (conditional) substitutions:

the first says "any (zero or more) non-tab characters followed by an underscore", replace with "(nothing)"
the second says "replace a tab followed by any (zero or more) non-tab characters followed by an underscore" with "tab", and do that as many times as you find that search pattern.

The first search is needed in order to find leading fields that should be removed; the second sweeps up the rest.

This leaves the original fields in place in their columns:

357M
 154=
 419X 34=

To strip the fields completely, simply include the tabs in the search-and-replace text:

sed 's/[^\t]*_\t//; s/\t[^\t]*_//g' < input

Results in:

357M
154=
419X 34=

Question 5

I thought the requirements were to not keep the fields in the columns... your output doesn't match the 'file2' example.

Question 6

good point, Stephen - thank you. I've updated the answer accordingly.

Question 7

This assumes the underscore MUST be the final character of a field for the field to be removed. (And if it's not, strange things will occur.) This is true in the provided input, but is not stated as a requirement.

Question 8

Please note that [^\t] will work with GNU sed only.

Question 9

You could use this simple sed.

sed 's/\w*_\s*//;/^$/d' infile.txt

/^$/d will delete empty lines where the line is including only one field ending with underscore foo_ or _ alone.

Giving result:

357M
154=
419X 34=

Question 10

FWIW, this may leave a trailing TAB if the last field ends in an _

Question 11

@StephenHarris Not trailing TABs but empty lines. I have updated my answer. thanks

Question 12

There's always the "brute force and ignorance" approach.

Strip out the bad fields
convert multiple tabs to single tab
Remove single tab from front of line
remove single tab from end of line

It's not smart, it's not clever, but it works.

In the following, TAB means the literal TAB character

sed -e 's/[0-9]*_//g' -e 's/TABTAB/TAB/g' -e 's/^TAB//' -e 's/TAB$//'

eg

$ cat x
357M 2054_
357_ 154= 1900_
511_ 419X 1481_ 34=
$ sed -e 's/[0-9]*_//g' -e 's/ / /g' -e 's/^ //' -e 's/ $//' < x
357M
154=
419X 34=

Question 13

awk:

awk 'a=""; {for(i=1; i<=NF; ++i) {if($i ~ /[MX=]$/) a=(a?a"\t":"")$i}; \
 if(a) print a}' file.txt

a="" sets variable a to null for the current record i.e. making a record specific
for(i=1; i<=NF; ++i) {if($i ~ /[MX=]$/) a=(a?a"\t":"")$i} iterates over the fields, checks if the field is ending in M or X or =, if so adds the field to variable a with a tab for separation between any previously save field
if(a) print a prints a if it's not null

Golfed:

awk 'a="";{for(i=1;i<=NF;++i)if($i~/[MX=]$/)a=(a?a"\t":"")$i;if(a)print a}'

Example:

% cat file.txt 
357M 2054_
357_ 154= 1900_
511_ 419X 1481_ 34=
% awk 'a=""; {for(i=1; i<=NF; ++i) {if($i ~ /[MX=]$/) a=(a?a"\t":"")$i}; if(a) print a}' file.txt
357M
154=
419X 34=

Question 14

This would be somewhat easier if you were concerned only with interior fields (i.e., not the first or last field on a line). But you want to look at all the fields. So I have a solution that makes it look like we’re not handling the last field on each line:

sed -e 's/$/\t/' -e 's/[^\t]*_[^\t]*\t//g' -e 's/\t$//'

This

Adds a tab at the end of every line (thus creating, in effect, an n+1 th field, which is null).
Finds all fields (strings of non-tab characters) that contain an _ and removes them, and the following tab, by replacing them with nothing. This works on the n th field (i.e., the last field on the original line) because step 1 added a tab at the end.
Removes the superfluous tab from the end of the line.

This has the feature (which I know you didn’t ask for, but you might appreciate once you see that it’s available) that it preserves null fields:

$ cat file3
The brown jumps the dog.
 quick fox over lazy
Four and_ years
 score seven ago...
$ (the_above_command) file3
The brown jumps the dog.
 quick fox over lazy
Four years
 score seven ago...

P.S. Depending on what version of sed you have, you may need to type actual tabs into the command instead of \t. Or, if you’re using bash, you can use $'...' for the sed command strings which contain \t.

score 4 · Accepted Answer · 2017-08-30 01:25:19Z

Consider:

sed 's/[^\t]*_//; s/\t[^\t]*_/\t/g' < input

This performs two (conditional) substitutions:

the first says "any (zero or more) non-tab characters followed by an underscore", replace with "(nothing)"
the second says "replace a tab followed by any (zero or more) non-tab characters followed by an underscore" with "tab", and do that as many times as you find that search pattern.

The first search is needed in order to find leading fields that should be removed; the second sweeps up the rest.

This leaves the original fields in place in their columns:

357M
 154=
 419X 34=

To strip the fields completely, simply include the tabs in the search-and-replace text:

sed 's/[^\t]*_\t//; s/\t[^\t]*_//g' < input

Results in:

357M
154=
419X 34=

I thought the requirements were to not keep the fields in the columns... your output doesn't match the 'file2' example.
good point, Stephen - thank you. I've updated the answer accordingly.
This assumes the underscore MUST be the final character of a field for the field to be removed. (And if it's not, strange things will occur.) This is true in the provided input, but is not stated as a requirement.

Stack Exchange Network

Remove fields containing specific string

5 Answers 5

You must log in to answer this question.

Hot Network Questions

Remove fields containing specific string

5 Answers 5

You must log in to answer this question.

Related

Hot Network Questions