Multiline pattern match using sed, awk or grep [duplicate]

Question 1

Is it possible to do a multiline pattern match using sed, awk or grep? Take for example, I would like to get all the lines between { and }

So it should be able to match

 1. {}
 2. {.....}
 3. {.....

 .....}

Initially the question used <p> as an example. Edited the question to use { and}.

Question 2

afaik, you can do it with perl regex but not with sed/awk/grep.

Question 3

@forcefsck> You can do multiline pattern matching with 'sed' and 'awk', but in both cases you need more than a single command...

Question 4

don't ask like "is it possible to use sed to do ...." you can use sed to do anything within the area of text processing. LOL

Question 5

@CiroSantilli - there's nothing wrong with a similar Q showing up on the various SE sites, only if the original poster posted the identical Q on multiple sites.

Question 6

@sim I did not mean to imply that =)

Question 7

While I agree with the advice above, that you'll want to get a parser for anything more than tiny or completely ad-hoc, it is (barely ;-) possible to match multi-line blocks between curly braces with sed.

Here's a debugging version of the sed code

sed -n '/[{]/,/[}]/{
 p
 /[}]/a\
 end of block matching brace
 }' *.txt

Some notes,

-n means 'no default print lines as processed'.
'p' means now print the line.
The construct /[{]/,/[}]/ is a range expression. It means scan until you find something that matches the first pattern (/[{]/) AND then scan until you find the 2nd pattern (/[}]/) THEN perform whatever actions you find in between the { } in the sed code. In this case 'p' and the debugging code. (not explained here, use it, mod it or take it out as works best for you).

You can remove the /[}]/a\ end of block debugging when you prove to your satisfaction that the code is really matching blocks delimited by {,}.

This code sample will skip over anything not inside a curly brace pair. It will, as noted by others above, be easly confused if you have any extra {,} embedded in strings, reg-exps, etc., OR where the closing brace is the same line, (with thanks to fred.bear)

I hope this helps.

Question 8

> When a range expression matches the first pattern, it will only start searching for a match to the second pattern after it has finished processing that current line... This means that if { and } are on the same line, things are going to get messy... here is as test script which shows it:

div==========; echo $div in; text="fred\n{block 1}\nbetty\n{block 2 line 1\n block 2 line 2}\nbarney"; echo -e "$text\n$div out"; echo -e "$text" |sed -n '/[{]/,/[}]/{'$'\n''p'$'\n''/[}]/a\'$'\n''end of block matching brace'$'\n''}'; echo "$div"

... The "betty" line shouldn't be there.

Question 9

@fred.bear :You're definitely right. I have extended my cautionary last paragraph to mention this. Thanks!

Question 10

Do you have a link to docs explaining this "range expression"? Because I cannot find it anywhere, I only get stuff about e.g [1-9] type "range expressions" when searching for that term. EDIT: Ahh maybe here? pement.org/sed/sedfaq4.html#s4.23.1

Question 11

You can use the -M (multiline) option for pcregrep:

pcregrep -M '\{(\s*.*\s*)*\}' test.txt

\s is whitespace (including newlines), so this matches zero or more occurrences of (whitespace followed by .* followed by whitespace), all enclosed in braces.

Update:

This should do the non-greedy matching:

pcregrep -n -M '\{(\n*.*?\n*)*?\}' test.txt

Question 12

> It seems like a handy tool... Yes, it is being greedy... Can you show how to invert the greedy nature? ... and I noticed in my Ubuntu man pcregrep: ...8K characters are available for forward matching, and 8K for previous matching...

Question 13

Adding a ? after a quantifier makes it non-greedy. (asdf)* is greedy, and (asdf)*? is non greedy.

Question 14

> Thanks.. It's brilliant... It works as "advertised" and with (optional) line numbers! :)

Question 15

Thanks for mentioning pcregrep! This was the only tool I succeeded in eliminating arbitrary multiline patterns in multiline input strings (with pcregrep -v -M -F -- "$pattern"

Question 16

XML like expressions (infinintely recursive tags) is not a 'regular language' therefore cannot be parsed with regular expressions (regex). Here's why:

https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/

http://www.perlmonks.org/?node_id=668353

https://stackoverflow.com/questions/1379524/textual-protocol-which-is-not-a-regular-language

Question 17

FYI, I have used <p></p> just for an example. Let's change this to a C block {,}, Will I still be able to do a pattern matching to extract out a C block, which can span multiple line

Question 18

@johnsamuel: One problem is that unless you can fully parse a particular language, you can't tell if "{" (for example) is part of a comment or a quoted literal or is actually the start of your "block"... It only takes one misinterpretation to upset everything

Question 19

(1) The slogan about non-regular languages concerns a technical notion of regular expressions that is more limited than most regex engines now in use. (2) The question was about what can be done with sed or awk, not with their regex engines specifically. And these languages are Turing complete. I'm not saying that writing anything beyond a trivial parser in them is going to be pretty or efficient.

Question 20

parser.awk:

#!/usr/bin/awk -f 
function die(msg) { print msg > "/dev/stderr"; exit 1 }
BEGIN {
 FS=opener
 if (mode=="l") linewise=1
 else if (mode=="i") trim_closer=length(closer)
 else if (mode!="a") die("mode must be one of: l,i,a")
}
{
 live=level
 for (f=1; f<=NF; f++) {
 if (f>1) {
 live=++level
 if (mode=="i" && level>1 || mode=="a") printf "%s", opener
 }
 cur=$f
 level-=gsub(closer, "", cur)
 if (level<0) die("Unbalanced")
 if (!linewise) {
 cur=$f
 if (sub(".*" closer, "", cur)) printf "%s", 
 substr($f, 1, length($f) - length(cur) - (level ? 0 : trim_closer))
 else if (live) printf "%s", $f
 }
 }
 if (live) {
 if (linewise) print
 else print ""
 }
}
END { if (level>0) die("Unbalanced") }

Call as awk -v'opener={' -v'closer=}' -v'mode=a' -f parser.awk. If mode is a, it prints the brackets and contents of all outermost, balanced {...}; if mode is i, it prints only their contents; if mode is l, it prints complete lines where an outermost {...} begins, remains open, or closes.

Question 21

Regular expressions cannot find matching nested parentheses.

If you are certain that there will be no pair of parentheses nested inside the one you are searching, you can search until the first closing one. For example:

sed -r 's#\{([^}])\}#1円#'

This will replace all the text from '{' to '}' with what's between them.

Question 22

> s#\{([^}])\}#1円# will only match a single non-} char... It needs a zero-to-many * wildcard after the closing square bracket ]*... and also, sed always operates only on a single line, unless you do some buffer hold of the line and sub-process subsequent lines until you find a matching '}'

shellter shellter 6997 silver badges10 bronze badges · Accepted Answer · 2011-03-28 22:01:10Z

While I agree with the advice above, that you'll want to get a parser for anything more than tiny or completely ad-hoc, it is (barely ;-) possible to match multi-line blocks between curly braces with sed.

Here's a debugging version of the sed code

sed -n '/[{]/,/[}]/{
 p
 /[}]/a\
 end of block matching brace
 }' *.txt

Some notes,

-n means 'no default print lines as processed'.
'p' means now print the line.
The construct /[{]/,/[}]/ is a range expression. It means scan until you find something that matches the first pattern (/[{]/) AND then scan until you find the 2nd pattern (/[}]/) THEN perform whatever actions you find in between the { } in the sed code. In this case 'p' and the debugging code. (not explained here, use it, mod it or take it out as works best for you).

You can remove the /[}]/a\ end of block debugging when you prove to your satisfaction that the code is really matching blocks delimited by {,}.

This code sample will skip over anything not inside a curly brace pair. It will, as noted by others above, be easly confused if you have any extra {,} embedded in strings, reg-exps, etc., OR where the closing brace is the same line, (with thanks to fred.bear)

I hope this helps.

> When a range expression matches the first pattern, it will only start searching for a match to the second pattern after it has finished processing that current line... This means that if { and } are on the same line, things are going to get messy... here is as test script which shows it: div==========; echo $div in; text="fred\n{block 1}\nbetty\n{block 2 line 1\n block 2 line 2}\nbarney"; echo -e "$text\n$div out"; echo -e "$text" |sed -n '/[{]/,/[}]/{'$'\n''p'$'\n''/[}]/a\'$'\n''end of block matching brace'$'\n''}'; echo "$div" ... The "betty" line shouldn't be there.
@fred.bear :You're definitely right. I have extended my cautionary last paragraph to mention this. Thanks!
Do you have a link to docs explaining this "range expression"? Because I cannot find it anywhere, I only get stuff about e.g [1-9] type "range expressions" when searching for that term. EDIT: Ahh maybe here? pement.org/sed/sedfaq4.html#s4.23.1

Stack Exchange Network

Multiline pattern match using sed, awk or grep [duplicate]

5 Answers 5

Linked

Hot Network Questions

Multiline pattern match using sed, awk or grep [duplicate]

5 Answers 5

Linked

Related

Hot Network Questions