Run program only on matching lines

Question 1

Let's say I have a program blackbox, and a file with the following contents:

in this file
 this line contains =TAG=
 so does =TAG= this one
 as =TAG= does this other line
 this line does not
nor does this line
 =TAG= here again
 gone again

How do I run blackbox on only the lines containing =TAG=?

Note 1: One way is to use a while read loop, but this is considered bad practice. So what is the canonical, right way to do this (if there is one)?

Note 2: Of course, if I were just editing text, a solution with AWK or sed would be appropriate—but blackbox might have desired side effects. This question is for those situations where I need to execute another process.

Note 3: You might ask what happens if blackbox is something like nl or sort—where running it on multiple lines together has a different outcome from running a new process on each line. In that case, I want to be able to do it each of these three ways:

Block-wise: replace each block of contiguous lines containing =TAG= with the result of blackbox on that block.

Expected output with blackbox = nl:

in this file
1 this line contains =TAG=
2 so does =TAG= this one
3 as =TAG= does this other line
 this line does not
nor does this line
1 =TAG= here again
 gone again

Line-wise: replace each line containing =TAG= with the result of blackbox on that line.

Expected output with blackbox = nl:

in this file
1 this line contains =TAG=
1 so does =TAG= this one
1 as =TAG= does this other line
 this line does not
nor does this line
1 =TAG= here again
 gone again

Continuously: send all the lines containing =TAG= to a blackbox process, and replace each block with the lines that would be newly printed before blackbox received the next block.

Expected output with blackbox = nl:
```
in this file
1 this line contains =TAG=
2 so does =TAG= this one
3 as =TAG= does this other line
 this line does not
nor does this line
4 =TAG= here again
 gone again
```
(If we used sort instead, all matching lines would end up sorted in the last matching block, because they wouldn't get printed until the end.)

I haven't found any questions on here asking about the general problem, but these are all special cases of this problem:

How to write script prefix braces with backslashes (line-wise)
Edit file based on existence of a string (line-wise)
Conditionally replace lines of file1 with the corresponding lines of file2 (line-wise)
Possible to speed this while read bash script up? (line-wise)
Replace string with contents of a file using sed (line-wise)
change newline with space on certain condition with sed (block-wise)
Sorting block of lines matching only the first one (block-wise)
Remove lines matching pattern, plus any lines following it matching a different pattern (block-wise)
Replace lines matching a pattern with lines from another file in order (continuously)
Something like `paste` but with a vertical alignment after a delimiter? (continuously)
Pass multiple command line arguments to an executable with text files (continuously)
Delete Nth line from each line matching a pattern (continuously)

Question 2

You can execute another process in awk using getline, FWIW

Question 3

Note that a while read loop is bad practice for processing text, it's not necessarily bad practice as part of sequencing calling of tools (that's what a shell exists to do). Where you run into a grey area is when you have a large file that you need to call a tool on just small parts of - that's when you need to decide whether to use a while read loop for part of it and, if so, how much of it and how you'll handle separating that part from the rest.

Question 4

@EdMorton Thanks for your help and advice on improving the question—I wasn't really sure how far to generalize it, and I think I erred too far on the general side, i.e., "write me an encyclopedia entry for this type of problem." I do want to label the expected output more clearly, but otherwise I'll leave it as is.

Question 5

Here's one way to do the first thing you asked for (Block-wise), assuming blackbox is nl:

$ cat tst.sh
#!/usr/bin/env bash
awk '
 BEGIN { cmd = "nl" }
 /=TAG=/ {
 print | cmd
 next
 }
 {
 close(cmd)
 print
 }
' "${@:--}"

$ ./tst.sh file
in this file
 1 this line contains =TAG=
 2 so does =TAG= this one
 3 as =TAG= does this other line
 this line does not
nor does this line
 1 =TAG= here again
 gone again

If you want to do it 1 line at a time as in your 2nd request (Line-wise), just move the close():

$ cat tst.sh
#!/usr/bin/env bash
awk '
 BEGIN { cmd = "nl" }
 /=TAG=/ {
 print | cmd
 close(cmd)
 next
 }
 {
 print
 }
' "${@:--}"

$ ./tst.sh file
in this file
 1 this line contains =TAG=
 1 so does =TAG= this one
 1 as =TAG= does this other line
 this line does not
nor does this line
 1 =TAG= here again
 gone again

GNU awk also supports Coprocesses if you need more control over where the output of the external command appears in the overall output or it needs any other additional processing before printing it.

And talking of which... here's a way to implement your 3rd request, Continuously, using GNU awk for coprocesses:

$ cat tst.sh
#!/usr/bin/env bash
awk '
 BEGIN { cmd = "nl" }
 { lines[NR] = 0ドル }
 /=TAG=/ {
 print |& cmd
 tbds[NR]
 }
 END {
 close(cmd, "to")
 for ( i=1; i<=NR; i++ ) {
 if ( i in tbds ) {
 if ( (cmd |& getline line) > 0 ) {
 lines[i] = line
 }
 }
 print lines[i]
 }
 close(cmd)
 }
' "${@:--}"

$ ./tst.sh file
in this file
 1 this line contains =TAG=
 2 so does =TAG= this one
 3 as =TAG= does this other line
 this line does not
nor does this line
 4 =TAG= here again
 gone again

That assumes your input isn't so massive that it can't fit in memory. If that's not the case then you can do the same thing with a 2-pass approach reading the input twice instead of storing it in the lines[] array.

The above might break if the pipe fills up (thanks @Stéphane Chazelas for pointing that out), if your system has stdbuf then you can do the following to make nl output line-buffered:

$ cat tst.sh
#!/usr/bin/env bash
awk '
 BEGIN { cmd = "stdbuf -oL nl" }
 /=TAG=/ {
 print |& cmd
 if ( (cmd |& getline line) > 0 ) {
 0ドル = line
 }
 }
 { print }
' "${@:--}"

$ ./tst.sh file
in this file
 1 this line contains =TAG=
 2 so does =TAG= this one
 3 as =TAG= does this other line
 this line does not
nor does this line
 4 =TAG= here again
 gone again

With a non-GNU awk you can write to a temp file instead of piping to a process and then read from the tempfile instead of reading from the process, e.g.:

$ cat tst.sh
#!/usr/bin/env bash
tmp=$(mktemp) || exit 1
trap 'rm -f "$tmp"; exit' EXIT
tmp="$tmp" awk '
 BEGIN {
 tmp = ENVIRON["tmp"]
 cmd = "nl > 047円" tmp "047円"
 }
 { lines[NR] = 0ドル }
 /=TAG=/ {
 print | cmd
 tbds[NR]
 }
 END {
 close(cmd)
 for ( i=1; i<=NR; i++ ) {
 if ( i in tbds ) {
 if ( (getline line < tmp) > 0 ) {
 lines[i] = line
 }
 }
 print lines[i]
 }
 }
' "${@:--}"

$ ./tst.sh file
in this file
 1 this line contains =TAG=
 2 so does =TAG= this one
 3 as =TAG= does this other line
 this line does not
nor does this line
 4 =TAG= here again
 gone again

or, again, use stdbuf in the same way as we do for the coprocess solution.

If you're ever considering using getline make sure to read http://awk.freeshell.org/AllAboutGetline first.

Question 6

Your co-process approach will deadlock when the pipes get full as you're only getting the output in the end. That and the tmp-file alternative also assume that cmd produces one line of output for each line of input.

Question 7

Note that -v tmp="$tmp" assumes $TMPDIR doesn't contain backslash characters. 047円 assumes an ASCII-based system.

Question 8

@StéphaneChazelas thanks. I fixed the tmp name issue, and I'm OK with leaving the 047円 for ASCII ', I expect anyone on a non-ASCII system can figure out the equivalent or another way to ensure the shell doesn't interpret the tmp file name if there's an issue. As for the co-process filling up the pipe - fair enough, I've added a fix for that after reading your perl script. And yes, I'm assuming 1 line of output from the command for 1 line of input.

Question 9

Yeah, this rules! For the continuous one, the way I imagined it in the case where k lines map to k'>k output lines, we tell the script: "while you process these k lines, if blackbox prints anything now, put those lines here." So for example if blackbox were the awk script { print 0ドル; print 0ドル }, the continuous variant would be identical to the block-wise variant. If it were sort, all the matching lines would go at the end.

Question 10

Are you saying that you'd WANT all the lines to go at the end? Actually, I see you accepted an answer to this question so please just ask a new question with an example demonstrating your requirements.

Question 11

With perl:

$ perl -ne '
 if (/=TAG=/) {
 STDOUT->flush;
 open $cmd, "|-", "nl" unless $cmd;
 print $cmd $_;
 } else {
 close $cmd;
 undef $cmd;
 print;
 }' file
in this file
 1 this line contains =TAG=
 2 so does =TAG= this one
 3 as =TAG= does this other line
 this line does not
nor does this line
 1 =TAG= here again
 gone again

$ perl -ne '
 if (/=TAG=/) {
 STDOUT->flush;
 open $cmd, "|-", "nl";
 print $cmd $_;
 close $cmd;
 } else {
 print;
 }' file
in this file
 1 this line contains =TAG=
 1 so does =TAG= this one
 1 as =TAG= does this other line
 this line does not
nor does this line
 1 =TAG= here again
 gone again

$ perl -MTime::HiRes=usleep -ne '
 if (/=TAG=/) {
 STDOUT->flush;
 unless ($cmd) {
 open $cmd, "|-", "stdbuf", "-oL", "nl";
 $cmd->autoflush;
 }
 print $cmd $_;
 usleep(10000)
 } else {print}' file
in this file
 1 this line contains =TAG=
 2 so does =TAG= this one
 3 as =TAG= does this other line
 this line does not
nor does this line
 4 =TAG= here again
 gone again

The latter assumes the command does not buffer its output (which we avoid for nl --actually ask for Line output buffering instead-- here with stdbuf -oL as found on GNU or FreeBSD systems), and that upon reading a line of input, produces the corresponding line(s) of output within the next 10,000 microseconds (0.1 second).

Here the command is hardcoded as "nl" or "stdbuf", "-oL", "nl" (which you could also write qw(stdbuf -oL nl)) inside the perl code. For a command and its argument to be passed as extra arguments to perl instead:

perl -ne '
 BEGIN {@cmd = @ARGV; @ARGV = ()}
 # rest of the code where the open call would become:
 open $cmd, "|-", @cmd
 ' stdbuf -oL nl < file

Instead of a simple command, you can make it any shell code by either using the code above and sh -c 'any shell code' as the simple command or use the open $cmd, "|any shell code" syntax1

^{1 Actually, in modern versions of perl, perl will skip executing a shell if the command is simple enough.}

Question 12

These work great, including the expected behavior if I replace nl with sort. I'm not familiar with perl—how do I set the variable $cmd?

Question 13

@wobtax, the one you'd expect as hinted by the line(s). close $cmd waits for cmd to finish and that usleep(10000) is meant to give enough time for cmd to output what it has to output, you'd need to adjust if cmd may take longer.

Question 14

Oops, I just deleted my comment because I answered my own question by trying it out. (The question was what it would do if the executable wrote more than one line per line.) Yep, works as expected!

Ed Morton Ed Morton 35.7k6 gold badges25 silver badges60 bronze badges · Accepted Answer · 2024-11-22 00:27:11Z

Here's one way to do the first thing you asked for (Block-wise), assuming blackbox is nl:

$ cat tst.sh
#!/usr/bin/env bash
awk '
 BEGIN { cmd = "nl" }
 /=TAG=/ {
 print | cmd
 next
 }
 {
 close(cmd)
 print
 }
' "${@:--}"

$ ./tst.sh file
in this file
 1 this line contains =TAG=
 2 so does =TAG= this one
 3 as =TAG= does this other line
 this line does not
nor does this line
 1 =TAG= here again
 gone again

If you want to do it 1 line at a time as in your 2nd request (Line-wise), just move the close():

$ cat tst.sh
#!/usr/bin/env bash
awk '
 BEGIN { cmd = "nl" }
 /=TAG=/ {
 print | cmd
 close(cmd)
 next
 }
 {
 print
 }
' "${@:--}"

$ ./tst.sh file
in this file
 1 this line contains =TAG=
 1 so does =TAG= this one
 1 as =TAG= does this other line
 this line does not
nor does this line
 1 =TAG= here again
 gone again

GNU awk also supports Coprocesses if you need more control over where the output of the external command appears in the overall output or it needs any other additional processing before printing it.

And talking of which... here's a way to implement your 3rd request, Continuously, using GNU awk for coprocesses:

$ cat tst.sh
#!/usr/bin/env bash
awk '
 BEGIN { cmd = "nl" }
 { lines[NR] = 0ドル }
 /=TAG=/ {
 print |& cmd
 tbds[NR]
 }
 END {
 close(cmd, "to")
 for ( i=1; i<=NR; i++ ) {
 if ( i in tbds ) {
 if ( (cmd |& getline line) > 0 ) {
 lines[i] = line
 }
 }
 print lines[i]
 }
 close(cmd)
 }
' "${@:--}"

$ ./tst.sh file
in this file
 1 this line contains =TAG=
 2 so does =TAG= this one
 3 as =TAG= does this other line
 this line does not
nor does this line
 4 =TAG= here again
 gone again

That assumes your input isn't so massive that it can't fit in memory. If that's not the case then you can do the same thing with a 2-pass approach reading the input twice instead of storing it in the lines[] array.

The above might break if the pipe fills up (thanks @Stéphane Chazelas for pointing that out), if your system has stdbuf then you can do the following to make nl output line-buffered:

$ cat tst.sh
#!/usr/bin/env bash
awk '
 BEGIN { cmd = "stdbuf -oL nl" }
 /=TAG=/ {
 print |& cmd
 if ( (cmd |& getline line) > 0 ) {
 0ドル = line
 }
 }
 { print }
' "${@:--}"

$ ./tst.sh file
in this file
 1 this line contains =TAG=
 2 so does =TAG= this one
 3 as =TAG= does this other line
 this line does not
nor does this line
 4 =TAG= here again
 gone again

With a non-GNU awk you can write to a temp file instead of piping to a process and then read from the tempfile instead of reading from the process, e.g.:

$ cat tst.sh
#!/usr/bin/env bash
tmp=$(mktemp) || exit 1
trap 'rm -f "$tmp"; exit' EXIT
tmp="$tmp" awk '
 BEGIN {
 tmp = ENVIRON["tmp"]
 cmd = "nl > 047円" tmp "047円"
 }
 { lines[NR] = 0ドル }
 /=TAG=/ {
 print | cmd
 tbds[NR]
 }
 END {
 close(cmd)
 for ( i=1; i<=NR; i++ ) {
 if ( i in tbds ) {
 if ( (getline line < tmp) > 0 ) {
 lines[i] = line
 }
 }
 print lines[i]
 }
 }
' "${@:--}"

$ ./tst.sh file
in this file
 1 this line contains =TAG=
 2 so does =TAG= this one
 3 as =TAG= does this other line
 this line does not
nor does this line
 4 =TAG= here again
 gone again

or, again, use stdbuf in the same way as we do for the coprocess solution.

If you're ever considering using getline make sure to read http://awk.freeshell.org/AllAboutGetline first.

Your co-process approach will deadlock when the pipes get full as you're only getting the output in the end. That and the tmp-file alternative also assume that cmd produces one line of output for each line of input.
Note that -v tmp="$tmp" assumes $TMPDIR doesn't contain backslash characters. 047円 assumes an ASCII-based system.
@StéphaneChazelas thanks. I fixed the tmp name issue, and I'm OK with leaving the 047円 for ASCII ', I expect anyone on a non-ASCII system can figure out the equivalent or another way to ensure the shell doesn't interpret the tmp file name if there's an issue. As for the co-process filling up the pipe - fair enough, I've added a fix for that after reading your perl script. And yes, I'm assuming 1 line of output from the command for 1 line of input.
Yeah, this rules! For the continuous one, the way I imagined it in the case where k lines map to k'>k output lines, we tell the script: "while you process these k lines, if blackbox prints anything now, put those lines here." So for example if blackbox were the awk script { print 0ドル; print 0ドル }, the continuous variant would be identical to the block-wise variant. If it were sort, all the matching lines would go at the end.
Are you saying that you'd WANT all the lines to go at the end? Actually, I see you accepted an answer to this question so please just ask a new question with an example demonstrating your requirements.

Stack Exchange Network

Run program only on matching lines

2 Answers 2

You must log in to answer this question.

Linked

Hot Network Questions

Run program only on matching lines

2 Answers 2

You must log in to answer this question.

Linked

Related

Hot Network Questions