3

I have file1 containing multiple tab-separated fields, in which I would like to remove only the fields containing a specific string, in my case the underscore character _ (not removing all the row):

cat file1
357M 2054_
357_ 154= 1900_
511_ 419X 1481_ 34=

I would like to obtain the following:

cat file2
357M
154=
419X 34=

I managed to remove the fields as follows:

cat file1 | perl -pe 's/\w+_\s*//g'
357M 154= 419X 34=

But the format is not good, because I would like not to alter the number of columns.

I also tried:

cat file1 | sed 's/[0-9]*_//g'
357M
 154=
 419X 34=

But I would like to get rid of those empty columns.

A brute force approach that actually also works:

cat file1 | sed 's/[0-9]*_//g' | tr -s '\t' '\t' | sed 's/^[ \t]*//g'
357M
154=
419X 34=

This last command: (1) removes all fields containing a underscore; (2) replaces multiple tabs in a row with just one tab; (3) removes leading tabs. Not so elegant though.

Any suggestions?

Peter Mortensen
1,0191 gold badge8 silver badges10 bronze badges
asked Aug 30, 2017 at 0:42
2
  • How did "357M" become "357=" (and 419X become 419= .. and so on). Your input and output don't appear to match the requirements... Commented Aug 30, 2017 at 1:06
  • my bad - wrong copypaste. edited Commented Aug 30, 2017 at 1:09

5 Answers 5

4

Consider:

sed 's/[^\t]*_//; s/\t[^\t]*_/\t/g' < input

This performs two (conditional) substitutions:

  • the first says "any (zero or more) non-tab characters followed by an underscore", replace with "(nothing)"
  • the second says "replace a tab followed by any (zero or more) non-tab characters followed by an underscore" with "tab", and do that as many times as you find that search pattern.

The first search is needed in order to find leading fields that should be removed; the second sweeps up the rest.

This leaves the original fields in place in their columns:

357M
 154=
 419X 34=

To strip the fields completely, simply include the tabs in the search-and-replace text:

sed 's/[^\t]*_\t//; s/\t[^\t]*_//g' < input

Results in:

357M
154=
419X 34=
answered Aug 30, 2017 at 1:25
4
  • I thought the requirements were to not keep the fields in the columns... your output doesn't match the 'file2' example. Commented Aug 30, 2017 at 1:55
  • good point, Stephen - thank you. I've updated the answer accordingly. Commented Aug 30, 2017 at 1:58
  • This assumes the underscore MUST be the final character of a field for the field to be removed. (And if it's not, strange things will occur.) This is true in the provided input, but is not stated as a requirement. Commented Aug 30, 2017 at 3:48
  • Please note that [^\t] will work with GNU sed only. Commented Aug 30, 2017 at 6:25
3

You could use this simple sed.

sed 's/\w*_\s*//;/^$/d' infile.txt 

/^$/d will delete empty lines where the line is including only one field ending with underscore foo_ or _ alone.

Giving result:

357M
154=
419X 34=
answered Aug 30, 2017 at 6:45
2
  • FWIW, this may leave a trailing TAB if the last field ends in an _ Commented Aug 31, 2017 at 0:40
  • @StephenHarris Not trailing TABs but empty lines. I have updated my answer. thanks Commented Aug 31, 2017 at 2:48
2

There's always the "brute force and ignorance" approach.

  • Strip out the bad fields
  • convert multiple tabs to single tab
  • Remove single tab from front of line
  • remove single tab from end of line

It's not smart, it's not clever, but it works.

In the following, TAB means the literal TAB character

sed -e 's/[0-9]*_//g' -e 's/TABTAB/TAB/g' -e 's/^TAB//' -e 's/TAB$//'

eg

$ cat x
357M 2054_
357_ 154= 1900_
511_ 419X 1481_ 34=
$ sed -e 's/[0-9]*_//g' -e 's/ / /g' -e 's/^ //' -e 's/ $//' < x
357M
154=
419X 34=
answered Aug 30, 2017 at 1:17
2

awk:

awk 'a=""; {for(i=1; i<=NF; ++i) {if($i ~ /[MX=]$/) a=(a?a"\t":"")$i}; \
 if(a) print a}' file.txt
  • a="" sets variable a to null for the current record i.e. making a record specific

  • for(i=1; i<=NF; ++i) {if($i ~ /[MX=]$/) a=(a?a"\t":"")$i} iterates over the fields, checks if the field is ending in M or X or =, if so adds the field to variable a with a tab for separation between any previously save field

  • if(a) print a prints a if it's not null


Golfed:

awk 'a="";{for(i=1;i<=NF;++i)if($i~/[MX=]$/)a=(a?a"\t":"")$i;if(a)print a}'

Example:

% cat file.txt 
357M 2054_
357_ 154= 1900_
511_ 419X 1481_ 34=
% awk 'a=""; {for(i=1; i<=NF; ++i) {if($i ~ /[MX=]$/) a=(a?a"\t":"")$i}; if(a) print a}' file.txt
357M
154=
419X 34=
answered Aug 30, 2017 at 1:33
1

This would be somewhat easier if you were concerned only with interior fields (i.e., not the first or last field on a line). But you want to look at all the fields. So I have a solution that makes it look like we’re not handling the last field on each line:

sed -e 's/$/\t/' -e 's/[^\t]*_[^\t]*\t//g' -e 's/\t$//'

This

  1. Adds a tab at the end of every line (thus creating, in effect, an n+1 th field, which is null).
  2. Finds all fields (strings of non-tab characters) that contain an _ and removes them, and the following tab, by replacing them with nothing. This works on the n th field (i.e., the last field on the original line) because step 1 added a tab at the end.
  3. Removes the superfluous tab from the end of the line.

This has the feature (which I know you didn’t ask for, but you might appreciate once you see that it’s available) that it preserves null fields:

$ cat file3
The brown jumps the dog.
 quick fox over lazy
Four and_ years
 score seven ago...
$ (the_above_command) file3
The brown jumps the dog.
 quick fox over lazy
Four years
 score seven ago...

P.S. Depending on what version of sed you have, you may need to type actual tabs into the command instead of \t. Or, if you’re using bash, you can use $'...' for the sed command strings which contain \t.

answered Aug 30, 2017 at 1:51

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.