XML context grepping

Question 1

Below is the contents of my file,

<A>
<number>100</number>
<name>Word1</name>
</A>
<A>
<number>101</number>
<name> Word2</name>
</A>

if I grep for Word1, I'm trying to see the output as below,

<A>
<number>100</number>
<name>Word1</name>
</A>

if I grep for Word2, I'm trying to see the output as below,

<A>
<number>101</number>
<name>Word2</name>
</A>

Someone could help with this please ?

Question 2

what did you try?

Question 3

Will it always be formatted so you can just keep 2 lines before and 1 after? If not, are there other constraints on how it is formatted? If not, you'll probably have to use an XML parser. But if so, then this can likely be done with grep or a related tool. Also, are you limited to using features of grep that are required by POSIX, or are you willing to use extended features provided by your grep implementation? If you're willing to use extended features then what OS is this and what does grep -V show? Please edit with details.

Question 4

Have a look at xmlgrep.

Question 5

I tried the below command, which gives me the output what Im expecting, but I need some better code, even the format of the content is changed... grep -A1 -B2 Word1 Filename

Question 6

↑ that's good information. Please add it to your question

Question 7

If this is part of a well formed XML document you can extract the required part with an XML parser.

To satisfy the well formed requirement, I've wrapped your XML fragment with <root> and </root>.

xmlstarlet sel -t -c '//A[name="Word1"]' -n file.xml

If you cannot satisfy this directly, you can wrap it explicitly

( echo '<root>'; cat file.xml; echo '</root>' ) | xmlstarlet sel -t -c '//A[name="Word1"]' -n

In either case, the output is this:

<A>
<number>100</number>
<name>Word1</name>
</A>

Question 8

Note that it wouldn't work for Word2 because of the extra space before it in the sample. Note that it still outputs an empty line if there's no match. xmlstarlet sel -t -m '//A[name="Word1"]' -c . -n or xmlstarlet sel -t -m '//A[contains(name,"Word2")]' -c . -n may be better.

Question 9

@StéphaneChazelas I was hoping that leading space was a typo. (Yes, yes I know...) BTW why -m '//...contains()' -c . rather than -c '//...contains()?

Question 10

See also xmlstarlet sel -t -m '//name[contains(.,"Word2")]' -c .. -n for the parent of name nodes that contain Word2.

Question 11

-c //... -n outputs an empty line if there's no match. -m //... -c . -n doesn't.

Question 12

With pcregrep:

<file.xml pcregrep -Mo '(?s)<A>(?:.(?!</A>))*Word1.*?</A>'

With GNU grep:

<file.xml grep -zPo '(?s)<A>(?:.(?!</A>))*Word1.*?</A>' | tr '0円' '\n'

(though that means the whole file is loaded in memory and assumes it doesn't contain NUL bytes).

Some PCRE operators:

(?s) turns on the s flag (. matches even line delimiters)
.(?!</A>) any character provided it's not at the start of </A>.
.*? non-greedy version of .*
(:...) just grouping.

It's fooled by things like <![CDATA[</A>]]> or wouldn't find a Word2 expressed as <![CDATA[W]]>ord2> or Word2 for which you'd need a XML parser. But then a XML parser would need valid XML input which your sample is not unless you enclose it in a top-level element, would need to read the file in full (but then again that's generally your lot when working with that format) and would potentially transform the content (expand the <![CDATA and some &...; sequences). And an xpath expression would make it difficult to find those Word1 anywhere including in comments or XML tags or attributes.

Chris Davies Chris Davies 127k16 gold badges176 silver badges321 bronze badges · Accepted Answer · 2017-11-21 12:07:31Z

9

If this is part of a well formed XML document you can extract the required part with an XML parser.

To satisfy the well formed requirement, I've wrapped your XML fragment with <root> and </root>.

xmlstarlet sel -t -c '//A[name="Word1"]' -n file.xml

If you cannot satisfy this directly, you can wrap it explicitly

( echo '<root>'; cat file.xml; echo '</root>' ) | xmlstarlet sel -t -c '//A[name="Word1"]' -n

In either case, the output is this:

<A>
<number>100</number>
<name>Word1</name>
</A>

Share

Improve this answer

answered Nov 21, 2017 at 12:07

Chris Davies's user avatar

Chris Davies Chris Davies

127k16 gold badges176 silver badges321 bronze badges

4

Note that it wouldn't work for Word2 because of the extra space before it in the sample. Note that it still outputs an empty line if there's no match. xmlstarlet sel -t -m '//A[name="Word1"]' -c . -n or xmlstarlet sel -t -m '//A[contains(name,"Word2")]' -c . -n may be better.

Stéphane Chazelas
– Stéphane Chazelas

2017年11月21日 17:01:29 +00:00
Commented Nov 21, 2017 at 17:01
@StéphaneChazelas I was hoping that leading space was a typo. (Yes, yes I know...) BTW why -m '//...contains()' -c . rather than -c '//...contains()?

Chris Davies
– Chris Davies

2017年11月21日 17:06:45 +00:00
Commented Nov 21, 2017 at 17:06
See also xmlstarlet sel -t -m '//name[contains(.,"Word2")]' -c .. -n for the parent of name nodes that contain Word2.

Stéphane Chazelas
– Stéphane Chazelas

2017年11月21日 17:12:58 +00:00
Commented Nov 21, 2017 at 17:12
1

-c //... -n outputs an empty line if there's no match. -m //... -c . -n doesn't.

Stéphane Chazelas
– Stéphane Chazelas

2017年11月22日 08:32:19 +00:00
Commented Nov 22, 2017 at 8:32

Add a comment |

Stack Exchange Network

XML context grepping

2 Answers 2

You must log in to answer this question.

Hot Network Questions

XML context grepping

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions