Is it possible to do a multiline pattern match using sed
, awk
or grep
? Take for example, I would like to get all the lines between {
and }
So it should be able to match
1. {}
2. {.....}
3. {.....
.....}
Initially the question used <p>
as an example. Edited the question to use {
and}
.
-
afaik, you can do it with perl regex but not with sed/awk/grep.forcefsck– forcefsck2011年03月28日 11:19:23 +00:00Commented Mar 28, 2011 at 11:19
-
1@forcefsck> You can do multiline pattern matching with 'sed' and 'awk', but in both cases you need more than a single command...Peter.O– Peter.O2011年03月29日 15:03:29 +00:00Commented Mar 29, 2011 at 15:03
-
1don't ask like "is it possible to use sed to do ...." you can use sed to do anything within the area of text processing. LOLzinking– zinking2013年10月06日 03:13:09 +00:00Commented Oct 6, 2013 at 3:13
-
@CiroSantilli - there's nothing wrong with a similar Q showing up on the various SE sites, only if the original poster posted the identical Q on multiple sites.slm– slm ♦2014年09月16日 01:34:47 +00:00Commented Sep 16, 2014 at 1:34
-
@sim I did not mean to imply that =)Ciro Santilli OurBigBook.com– Ciro Santilli OurBigBook.com2014年09月16日 06:28:17 +00:00Commented Sep 16, 2014 at 6:28
5 Answers 5
While I agree with the advice above, that you'll want to get a parser for anything more than tiny or completely ad-hoc, it is (barely ;-) possible to match multi-line blocks between curly braces with sed.
Here's a debugging version of the sed code
sed -n '/[{]/,/[}]/{
p
/[}]/a\
end of block matching brace
}' *.txt
Some notes,
- -n means 'no default print lines as processed'.
- 'p' means now print the line.
- The construct
/[{]/,/[}]/
is a range expression. It means scan until you find something that matches the first pattern(/[{]/)
AND then scan until you find the 2nd pattern(/[}]/)
THEN perform whatever actions you find in between the { } in the sed code. In this case 'p' and the debugging code. (not explained here, use it, mod it or take it out as works best for you).
You can remove the /[}]/a\ end of block debugging when you prove to your satisfaction that the code is really matching blocks delimited by {,}.
This code sample will skip over anything not inside a curly brace pair. It will, as noted by others above, be easly confused if you have any extra {,} embedded in strings, reg-exps, etc., OR where the closing brace is the same line, (with thanks to fred.bear)
I hope this helps.
-
1> When a range expression matches the first pattern, it will only start searching for a match to the second pattern after it has finished processing that current line... This means that if { and } are on the same line, things are going to get messy... here is as test script which shows it:
div==========; echo $div in; text="fred\n{block 1}\nbetty\n{block 2 line 1\n block 2 line 2}\nbarney"; echo -e "$text\n$div out"; echo -e "$text" |sed -n '/[{]/,/[}]/{'$'\n''p'$'\n''/[}]/a\'$'\n''end of block matching brace'$'\n''}'; echo "$div"
... The "betty" line shouldn't be there.Peter.O– Peter.O2011年03月29日 13:54:20 +00:00Commented Mar 29, 2011 at 13:54 -
@fred.bear :You're definitely right. I have extended my cautionary last paragraph to mention this. Thanks!shellter– shellter2011年03月29日 17:13:18 +00:00Commented Mar 29, 2011 at 17:13
-
Do you have a link to docs explaining this "range expression"? Because I cannot find it anywhere, I only get stuff about e.g [1-9] type "range expressions" when searching for that term. EDIT: Ahh maybe here? pement.org/sed/sedfaq4.html#s4.23.1Ben Farmer– Ben Farmer2023年11月12日 22:40:10 +00:00Commented Nov 12, 2023 at 22:40
You can use the -M (multiline) option for pcregrep:
pcregrep -M '\{(\s*.*\s*)*\}' test.txt
\s is whitespace (including newlines), so this matches zero or more occurrences of (whitespace followed by .* followed by whitespace), all enclosed in braces.
Update:
This should do the non-greedy matching:
pcregrep -n -M '\{(\n*.*?\n*)*?\}' test.txt
-
> It seems like a handy tool... Yes, it is being greedy... Can you show how to invert the greedy nature? ... and I noticed in my Ubuntu
man pcregrep
: ...8K characters are available for forward matching, and 8K for previous matching...Peter.O– Peter.O2011年03月29日 14:33:57 +00:00Commented Mar 29, 2011 at 14:33 -
Adding a ? after a quantifier makes it non-greedy. (asdf)* is greedy, and (asdf)*? is non greedy.Cooper– Cooper2011年03月30日 13:17:39 +00:00Commented Mar 30, 2011 at 13:17
-
> Thanks.. It's brilliant... It works as "advertised" and with (optional) line numbers! :)Peter.O– Peter.O2011年03月30日 14:31:28 +00:00Commented Mar 30, 2011 at 14:31
-
Thanks for mentioning
pcregrep
! This was the only tool I succeeded in eliminating arbitrary multiline patterns in multiline input strings (withpcregrep -v -M -F -- "$pattern"
stefanct– stefanct2016年03月02日 23:19:05 +00:00Commented Mar 2, 2016 at 23:19
XML like expressions (infinintely recursive tags) is not a 'regular language' therefore cannot be parsed with regular expressions (regex). Here's why:
https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/
http://www.perlmonks.org/?node_id=668353
https://stackoverflow.com/questions/1379524/textual-protocol-which-is-not-a-regular-language
-
FYI, I have used <p></p> just for an example. Let's change this to a C block {,}, Will I still be able to do a pattern matching to extract out a C block, which can span multiple lineuser6106– user61062011年03月28日 12:38:48 +00:00Commented Mar 28, 2011 at 12:38
-
@johnsamuel: One problem is that unless you can fully parse a particular language, you can't tell if "{" (for example) is part of a comment or a quoted literal or is actually the start of your "block"... It only takes one misinterpretation to upset everythingPeter.O– Peter.O2011年03月28日 13:08:06 +00:00Commented Mar 28, 2011 at 13:08
-
1(1) The slogan about non-regular languages concerns a technical notion of regular expressions that is more limited than most regex engines now in use. (2) The question was about what can be done with sed or awk, not with their regex engines specifically. And these languages are Turing complete. I'm not saying that writing anything beyond a trivial parser in them is going to be pretty or efficient.dubiousjim– dubiousjim2012年04月19日 21:20:52 +00:00Commented Apr 19, 2012 at 21:20
parser.awk:
#!/usr/bin/awk -f
function die(msg) { print msg > "/dev/stderr"; exit 1 }
BEGIN {
FS=opener
if (mode=="l") linewise=1
else if (mode=="i") trim_closer=length(closer)
else if (mode!="a") die("mode must be one of: l,i,a")
}
{
live=level
for (f=1; f<=NF; f++) {
if (f>1) {
live=++level
if (mode=="i" && level>1 || mode=="a") printf "%s", opener
}
cur=$f
level-=gsub(closer, "", cur)
if (level<0) die("Unbalanced")
if (!linewise) {
cur=$f
if (sub(".*" closer, "", cur)) printf "%s",
substr($f, 1, length($f) - length(cur) - (level ? 0 : trim_closer))
else if (live) printf "%s", $f
}
}
if (live) {
if (linewise) print
else print ""
}
}
END { if (level>0) die("Unbalanced") }
Call as awk -v'opener={' -v'closer=}' -v'mode=a' -f parser.awk
. If mode is a
, it prints the brackets and contents of all outermost, balanced {...}
; if mode is i
, it prints only their contents; if mode is l
, it prints complete lines where an outermost {...}
begins, remains open, or closes.
Regular expressions cannot find matching nested parentheses.
If you are certain that there will be no pair of parentheses nested inside the one you are searching, you can search until the first closing one. For example:
sed -r 's#\{([^}])\}#1円#'
This will replace all the text from '{' to '}' with what's between them.
-
1>
s#\{([^}])\}#1円#
will only match a single non-} char... It needs a zero-to-many*
wildcard after the closing square bracket]*
... and also,sed
always operates only on a single line, unless you do some buffer hold of the line and sub-process subsequent lines until you find a matching '}'Peter.O– Peter.O2011年03月29日 08:53:42 +00:00Commented Mar 29, 2011 at 8:53