Regex pattern with grep for expression2 AFTER expression1

Question 1

I am trying to find which of a bunch of HTML files have a heading with the word Agent with the name of a particular agent anywhere after that heading.

So typically something like

<h3>Agent</h3>
<p>Blah blah blah </p>
<p>Their agent is XYZ Corp.</p>

should be found

But I cant guarantee any regularity on markup or content between the heading and the instance of XYZ Corp. So in something like DOS I might search for 'Agent*XYZ' meaning

-match the string 'Agent'
-followed by anything
-followed by the string 'XYZ'

How do I write that in grep on Ubuntu? I have tried

grep -lc 'Agent*XYZ' *.html
grep -lc 'Agent.*?XYZ' *.html

both without success. I can find the pattern manually in more than one of the files, so I know it exists.

TIA

Question 2

Possibly related: How can I "grep" patterns across multiple lines?

Question 3

Something like this seems good for your target:

$ cat d2.txt
<h3>Agent</h3>
<p>Blah blah blah </p>
<p>Their agent is XYZ Corp.</p>
$ grep -i 'agent' d2.txt #-i = ignore case. By default grep returns lines containing agent followed by anything or even alone
<h3>Agent</h3>
<p>Their agent is XYZ Corp.</p>
$ grep -iE 'agent.*XYZ' d2.txt #match agent followed by XYZ
<p>Their agent is XYZ Corp.</p>

Question 4

Assuming the h3 heading always occurs on a line separate from the name of the agent, sed seems to be able to do what you ask.

Given the input file

some data
at the top
<h3>Agent</h3>
<p>Blah blah blah </p>
<p>Their agent is XYZ Corp.</p>
some data
at the bottom

the command

sed -n '\#<h3>Agent</h3>#,/XYZ/p' input.html

will generate

<h3>Agent</h3>
<p>Blah blah blah </p>
<p>Their agent is XYZ Corp.</p>

The sed command will output anything between the lines matching the two regular expressions <h3>Agent</h3> and XYZ (inclusively). The funky looking \#...# that is delimiting the first regular expression is just how to use a custom delimiter. I did it that way rather than escaping the / in the pattern.

score 0 · Answer 1 · 2017-01-31 13:19:58Z

Something like this seems good for your target:

$ cat d2.txt
<h3>Agent</h3>
<p>Blah blah blah </p>
<p>Their agent is XYZ Corp.</p>
$ grep -i 'agent' d2.txt #-i = ignore case. By default grep returns lines containing agent followed by anything or even alone
<h3>Agent</h3>
<p>Their agent is XYZ Corp.</p>
$ grep -iE 'agent.*XYZ' d2.txt #match agent followed by XYZ
<p>Their agent is XYZ Corp.</p>

score 0 · Answer 2 · 2017-01-31 21:11:53Z

Assuming the h3 heading always occurs on a line separate from the name of the agent, sed seems to be able to do what you ask.

Given the input file

some data
at the top
<h3>Agent</h3>
<p>Blah blah blah </p>
<p>Their agent is XYZ Corp.</p>
some data
at the bottom

the command

sed -n '\#<h3>Agent</h3>#,/XYZ/p' input.html

will generate

<h3>Agent</h3>
<p>Blah blah blah </p>
<p>Their agent is XYZ Corp.</p>

The sed command will output anything between the lines matching the two regular expressions <h3>Agent</h3> and XYZ (inclusively). The funky looking \#...# that is delimiting the first regular expression is just how to use a custom delimiter. I did it that way rather than escaping the / in the pattern.

Stack Exchange Network

Regex pattern with grep for expression2 AFTER expression1

2 Answers 2

You must log in to answer this question.

Linked

Hot Network Questions

Regex pattern with grep for expression2 AFTER expression1

2 Answers 2

You must log in to answer this question.

Linked

Related

Hot Network Questions