I have file1
containing multiple tab-separated fields, in which I would like to remove only the fields containing a specific string, in my case the underscore character _
(not removing all the row):
cat file1
357M 2054_
357_ 154= 1900_
511_ 419X 1481_ 34=
I would like to obtain the following:
cat file2
357M
154=
419X 34=
I managed to remove the fields as follows:
cat file1 | perl -pe 's/\w+_\s*//g'
357M 154= 419X 34=
But the format is not good, because I would like not to alter the number of columns.
I also tried:
cat file1 | sed 's/[0-9]*_//g'
357M
154=
419X 34=
But I would like to get rid of those empty columns.
A brute force approach that actually also works:
cat file1 | sed 's/[0-9]*_//g' | tr -s '\t' '\t' | sed 's/^[ \t]*//g'
357M
154=
419X 34=
This last command: (1) removes all fields containing a underscore; (2) replaces multiple tabs in a row with just one tab; (3) removes leading tabs. Not so elegant though.
Any suggestions?
-
How did "357M" become "357=" (and 419X become 419= .. and so on). Your input and output don't appear to match the requirements...Stephen Harris– Stephen Harris2017年08月30日 01:06:38 +00:00Commented Aug 30, 2017 at 1:06
-
my bad - wrong copypaste. editedamina– amina2017年08月30日 01:09:19 +00:00Commented Aug 30, 2017 at 1:09
5 Answers 5
Consider:
sed 's/[^\t]*_//; s/\t[^\t]*_/\t/g' < input
This performs two (conditional) substitutions:
- the first says "any (zero or more) non-tab characters followed by an underscore", replace with "(nothing)"
- the second says "replace a tab followed by any (zero or more) non-tab characters followed by an underscore" with "tab", and do that as many times as you find that search pattern.
The first search is needed in order to find leading fields that should be removed; the second sweeps up the rest.
This leaves the original fields in place in their columns:
357M
154=
419X 34=
To strip the fields completely, simply include the tabs in the search-and-replace text:
sed 's/[^\t]*_\t//; s/\t[^\t]*_//g' < input
Results in:
357M
154=
419X 34=
-
I thought the requirements were to not keep the fields in the columns... your output doesn't match the 'file2' example.Stephen Harris– Stephen Harris2017年08月30日 01:55:23 +00:00Commented Aug 30, 2017 at 1:55
-
good point, Stephen - thank you. I've updated the answer accordingly.2017年08月30日 01:58:21 +00:00Commented Aug 30, 2017 at 1:58
-
This assumes the underscore MUST be the final character of a field for the field to be removed. (And if it's not, strange things will occur.) This is true in the provided input, but is not stated as a requirement.Wildcard– Wildcard2017年08月30日 03:48:12 +00:00Commented Aug 30, 2017 at 3:48
-
Please note that
[^\t]
will work with GNUsed
only.Philippos– Philippos2017年08月30日 06:25:20 +00:00Commented Aug 30, 2017 at 6:25
You could use this simple sed
.
sed 's/\w*_\s*//;/^$/d' infile.txt
/^$/d
will delete empty lines where the line is including only one field ending with underscore foo_
or _
alone.
Giving result:
357M
154=
419X 34=
-
FWIW, this may leave a trailing TAB if the last field ends in an _Stephen Harris– Stephen Harris2017年08月31日 00:40:28 +00:00Commented Aug 31, 2017 at 0:40
-
@StephenHarris Not trailing TABs but empty lines. I have updated my answer. thanksαғsнιη– αғsнιη2017年08月31日 02:48:39 +00:00Commented Aug 31, 2017 at 2:48
There's always the "brute force and ignorance" approach.
- Strip out the bad fields
- convert multiple tabs to single tab
- Remove single tab from front of line
- remove single tab from end of line
It's not smart, it's not clever, but it works.
In the following, TAB means the literal TAB character
sed -e 's/[0-9]*_//g' -e 's/TABTAB/TAB/g' -e 's/^TAB//' -e 's/TAB$//'
eg
$ cat x
357M 2054_
357_ 154= 1900_
511_ 419X 1481_ 34=
$ sed -e 's/[0-9]*_//g' -e 's/ / /g' -e 's/^ //' -e 's/ $//' < x
357M
154=
419X 34=
awk
:
awk 'a=""; {for(i=1; i<=NF; ++i) {if($i ~ /[MX=]$/) a=(a?a"\t":"")$i}; \
if(a) print a}' file.txt
a=""
sets variablea
to null for the current record i.e. makinga
record specificfor(i=1; i<=NF; ++i) {if($i ~ /[MX=]$/) a=(a?a"\t":"")$i}
iterates over the fields, checks if the field is ending inM
orX
or=
, if so adds the field to variablea
with a tab for separation between any previously save fieldif(a) print a
printsa
if it's not null
Golfed:
awk 'a="";{for(i=1;i<=NF;++i)if($i~/[MX=]$/)a=(a?a"\t":"")$i;if(a)print a}'
Example:
% cat file.txt
357M 2054_
357_ 154= 1900_
511_ 419X 1481_ 34=
% awk 'a=""; {for(i=1; i<=NF; ++i) {if($i ~ /[MX=]$/) a=(a?a"\t":"")$i}; if(a) print a}' file.txt
357M
154=
419X 34=
This would be somewhat easier if you were concerned only with interior fields (i.e., not the first or last field on a line). But you want to look at all the fields. So I have a solution that makes it look like we’re not handling the last field on each line:
sed -e 's/$/\t/' -e 's/[^\t]*_[^\t]*\t//g' -e 's/\t$//'
This
- Adds a tab at the end of every line (thus creating, in effect, an n+1 th field, which is null).
- Finds all fields (strings of non-tab characters) that contain an
_
and removes them, and the following tab, by replacing them with nothing. This works on the n th field (i.e., the last field on the original line) because step 1 added a tab at the end. - Removes the superfluous tab from the end of the line.
This has the feature (which I know you didn’t ask for, but you might appreciate once you see that it’s available) that it preserves null fields:
$ cat file3 The brown jumps the dog. quick fox over lazy Four and_ years score seven ago... $ (the_above_command) file3 The brown jumps the dog. quick fox over lazy Four years score seven ago...
P.S. Depending on what version of sed
you have,
you may need to type actual tabs into the command instead of \t
.
Or, if you’re using bash,
you can use $'...'
for the sed
command strings which contain \t
.