Let's say I have a program blackbox
, and a file with the following contents:
in this file
this line contains =TAG=
so does =TAG= this one
as =TAG= does this other line
this line does not
nor does this line
=TAG= here again
gone again
How do I run blackbox
on only the lines containing =TAG=
?
Note 1: One way is to use a while read
loop, but this is considered bad practice. So what is the canonical, right way to do this (if there is one)?
Note 2: Of course, if I were just editing text, a solution with AWK
or sed
would be appropriate—but blackbox
might have desired side effects. This question is for those situations where I need to execute another process.
Note 3: You might ask what happens if blackbox
is something like nl
or sort
—where running it on multiple lines together has a different outcome from running a new process on each line.
In that case, I want to be able to do it each of these three ways:
Block-wise: replace each block of contiguous lines containing
=TAG=
with the result ofblackbox
on that block.Expected output with
blackbox
=nl
:in this file 1 this line contains =TAG= 2 so does =TAG= this one 3 as =TAG= does this other line this line does not nor does this line 1 =TAG= here again gone again
Line-wise: replace each line containing
=TAG=
with the result ofblackbox
on that line.Expected output with
blackbox
=nl
:in this file 1 this line contains =TAG= 1 so does =TAG= this one 1 as =TAG= does this other line this line does not nor does this line 1 =TAG= here again gone again
Continuously: send all the lines containing
=TAG=
to ablackbox
process, and replace each block with the lines that would be newly printed beforeblackbox
received the next block.Expected output with
blackbox
=nl
:in this file 1 this line contains =TAG= 2 so does =TAG= this one 3 as =TAG= does this other line this line does not nor does this line 4 =TAG= here again gone again
(If we used
sort
instead, all matching lines would end up sorted in the last matching block, because they wouldn't get printed until the end.)
I haven't found any questions on here asking about the general problem, but these are all special cases of this problem:
- How to write script prefix braces with backslashes (line-wise)
- Edit file based on existence of a string (line-wise)
- Conditionally replace lines of file1 with the corresponding lines of file2 (line-wise)
- Possible to speed this while read bash script up? (line-wise)
- Replace string with contents of a file using sed (line-wise)
- change newline with space on certain condition with sed (block-wise)
- Sorting block of lines matching only the first one (block-wise)
- Remove lines matching pattern, plus any lines following it matching a different pattern (block-wise)
- Replace lines matching a pattern with lines from another file in order (continuously)
- Something like `paste` but with a vertical alignment after a delimiter? (continuously)
- Pass multiple command line arguments to an executable with text files (continuously)
- Delete Nth line from each line matching a pattern (continuously)
2 Answers 2
Here's one way to do the first thing you asked for (Block-wise), assuming blackbox
is nl
:
$ cat tst.sh
#!/usr/bin/env bash
awk '
BEGIN { cmd = "nl" }
/=TAG=/ {
print | cmd
next
}
{
close(cmd)
print
}
' "${@:--}"
$ ./tst.sh file
in this file
1 this line contains =TAG=
2 so does =TAG= this one
3 as =TAG= does this other line
this line does not
nor does this line
1 =TAG= here again
gone again
If you want to do it 1 line at a time as in your 2nd request (Line-wise), just move the close()
:
$ cat tst.sh
#!/usr/bin/env bash
awk '
BEGIN { cmd = "nl" }
/=TAG=/ {
print | cmd
close(cmd)
next
}
{
print
}
' "${@:--}"
$ ./tst.sh file
in this file
1 this line contains =TAG=
1 so does =TAG= this one
1 as =TAG= does this other line
this line does not
nor does this line
1 =TAG= here again
gone again
GNU awk also supports Coprocesses if you need more control over where the output of the external command appears in the overall output or it needs any other additional processing before printing it.
And talking of which... here's a way to implement your 3rd request, Continuously, using GNU awk for coprocesses:
$ cat tst.sh
#!/usr/bin/env bash
awk '
BEGIN { cmd = "nl" }
{ lines[NR] = 0ドル }
/=TAG=/ {
print |& cmd
tbds[NR]
}
END {
close(cmd, "to")
for ( i=1; i<=NR; i++ ) {
if ( i in tbds ) {
if ( (cmd |& getline line) > 0 ) {
lines[i] = line
}
}
print lines[i]
}
close(cmd)
}
' "${@:--}"
$ ./tst.sh file
in this file
1 this line contains =TAG=
2 so does =TAG= this one
3 as =TAG= does this other line
this line does not
nor does this line
4 =TAG= here again
gone again
That assumes your input isn't so massive that it can't fit in memory. If that's not the case then you can do the same thing with a 2-pass approach reading the input twice instead of storing it in the lines[]
array.
The above might break if the pipe fills up (thanks @Stéphane Chazelas for pointing that out), if your system has stdbuf
then you can do the following to make nl
output line-buffered:
$ cat tst.sh
#!/usr/bin/env bash
awk '
BEGIN { cmd = "stdbuf -oL nl" }
/=TAG=/ {
print |& cmd
if ( (cmd |& getline line) > 0 ) {
0ドル = line
}
}
{ print }
' "${@:--}"
$ ./tst.sh file
in this file
1 this line contains =TAG=
2 so does =TAG= this one
3 as =TAG= does this other line
this line does not
nor does this line
4 =TAG= here again
gone again
With a non-GNU awk you can write to a temp file instead of piping to a process and then read from the tempfile instead of reading from the process, e.g.:
$ cat tst.sh
#!/usr/bin/env bash
tmp=$(mktemp) || exit 1
trap 'rm -f "$tmp"; exit' EXIT
tmp="$tmp" awk '
BEGIN {
tmp = ENVIRON["tmp"]
cmd = "nl > 047円" tmp "047円"
}
{ lines[NR] = 0ドル }
/=TAG=/ {
print | cmd
tbds[NR]
}
END {
close(cmd)
for ( i=1; i<=NR; i++ ) {
if ( i in tbds ) {
if ( (getline line < tmp) > 0 ) {
lines[i] = line
}
}
print lines[i]
}
}
' "${@:--}"
$ ./tst.sh file
in this file
1 this line contains =TAG=
2 so does =TAG= this one
3 as =TAG= does this other line
this line does not
nor does this line
4 =TAG= here again
gone again
or, again, use stdbuf
in the same way as we do for the coprocess solution.
If you're ever considering using getline
make sure to read http://awk.freeshell.org/AllAboutGetline first.
-
2Your co-process approach will deadlock when the pipes get full as you're only getting the output in the end. That and the tmp-file alternative also assume that
cmd
produces one line of output for each line of input.Stéphane Chazelas– Stéphane Chazelas2024年11月22日 12:12:32 +00:00Commented Nov 22, 2024 at 12:12 -
1Note that
-v tmp="$tmp"
assumes$TMPDIR
doesn't contain backslash characters.047円
assumes an ASCII-based system.Stéphane Chazelas– Stéphane Chazelas2024年11月22日 12:13:09 +00:00Commented Nov 22, 2024 at 12:13 -
@StéphaneChazelas thanks. I fixed the
tmp
name issue, and I'm OK with leaving the047円
for ASCII'
, I expect anyone on a non-ASCII system can figure out the equivalent or another way to ensure the shell doesn't interpret the tmp file name if there's an issue. As for the co-process filling up the pipe - fair enough, I've added a fix for that after reading your perl script. And yes, I'm assuming 1 line of output from the command for 1 line of input.Ed Morton– Ed Morton2024年11月22日 14:05:50 +00:00Commented Nov 22, 2024 at 14:05 -
Yeah, this rules! For the continuous one, the way I imagined it in the case where k lines map to k'>k output lines, we tell the script: "while you process these k lines, if
blackbox
prints anything now, put those lines here." So for example ifblackbox
were the awk script{ print 0ドル; print 0ドル }
, the continuous variant would be identical to the block-wise variant. If it weresort
, all the matching lines would go at the end.wobtax– wobtax2024年11月22日 17:09:00 +00:00Commented Nov 22, 2024 at 17:09 -
Are you saying that you'd WANT all the lines to go at the end? Actually, I see you accepted an answer to this question so please just ask a new question with an example demonstrating your requirements.Ed Morton– Ed Morton2024年11月22日 22:00:32 +00:00Commented Nov 22, 2024 at 22:00
With perl
:
$ perl -ne '
if (/=TAG=/) {
STDOUT->flush;
open $cmd, "|-", "nl" unless $cmd;
print $cmd $_;
} else {
close $cmd;
undef $cmd;
print;
}' file
in this file
1 this line contains =TAG=
2 so does =TAG= this one
3 as =TAG= does this other line
this line does not
nor does this line
1 =TAG= here again
gone again
$ perl -ne '
if (/=TAG=/) {
STDOUT->flush;
open $cmd, "|-", "nl";
print $cmd $_;
close $cmd;
} else {
print;
}' file
in this file
1 this line contains =TAG=
1 so does =TAG= this one
1 as =TAG= does this other line
this line does not
nor does this line
1 =TAG= here again
gone again
$ perl -MTime::HiRes=usleep -ne '
if (/=TAG=/) {
STDOUT->flush;
unless ($cmd) {
open $cmd, "|-", "stdbuf", "-oL", "nl";
$cmd->autoflush;
}
print $cmd $_;
usleep(10000)
} else {print}' file
in this file
1 this line contains =TAG=
2 so does =TAG= this one
3 as =TAG= does this other line
this line does not
nor does this line
4 =TAG= here again
gone again
The latter assumes the command does not buffer its output (which we avoid for nl
--actually ask for L
ine o
utput buffering instead-- here with stdbuf -oL
as found on GNU or FreeBSD systems), and that upon reading a line of input, produces the corresponding line(s) of output within the next 10,000 microseconds (0.1 second).
Here the command is hardcoded as "nl"
or "stdbuf", "-oL", "nl"
(which you could also write qw(stdbuf -oL nl)
) inside the perl
code. For a command and its argument to be passed as extra arguments to perl
instead:
perl -ne '
BEGIN {@cmd = @ARGV; @ARGV = ()}
# rest of the code where the open call would become:
open $cmd, "|-", @cmd
' stdbuf -oL nl < file
Instead of a simple command, you can make it any shell code by either using the code above and sh -c 'any shell code'
as the simple command or use the open $cmd, "|any shell code"
syntax1
1 Actually, in modern versions of perl
, perl
will skip executing a shell if the command is simple enough.
-
These work great, including the expected behavior if I replace
nl
withsort
. I'm not familiar with perl—how do I set the variable$cmd
?wobtax– wobtax2024年11月22日 16:32:33 +00:00Commented Nov 22, 2024 at 16:32 -
@wobtax, the one you'd expect as hinted by the line(s).
close $cmd
waits forcmd
to finish and thatusleep(10000)
is meant to give enough time forcmd
to output what it has to output, you'd need to adjust ifcmd
may take longer.Stéphane Chazelas– Stéphane Chazelas2024年11月22日 16:48:58 +00:00Commented Nov 22, 2024 at 16:48 -
Oops, I just deleted my comment because I answered my own question by trying it out. (The question was what it would do if the executable wrote more than one line per line.) Yep, works as expected!wobtax– wobtax2024年11月22日 16:50:26 +00:00Commented Nov 22, 2024 at 16:50
getline
, FWIW