Regular Expression for parsing PHP annotations containing multiple lines and markup

Question 1

I am writing a regular expression for parsing PHP annotations in a flexible way. I'd like to accomplish all the goals listed above with one single expression. So I would appreciate any suggestions about the quality of it (in terms of corner cases, performance, best practices and correctness).

PCRE expression:

/[\*\s]*@(?P<name>\w+[\\\w]*?)(?:\s|\()
(?P<value>
 (?:[\/\w\s\"\<\>\_\#\=\-\.\'\{\}:;,\*\(\)\[\]]*[^\R\*\s\/\)])
)?
(?:\s | $|\))/gsxmu

Goals of the regular expression:

List all annotations with their values
Values can be multi-lined and have markup (html, json or markdown)
The initial space + * of each line should be removed from the value
Many annotations can be in the same line
Annotation names can be namespaced

Sample PHPDoc string:

/**
 * Description
 * 
 * @Tag name name @annotation beee @aaf dsfsd fgdg
 * @Tag name name {"json":"dfsf"}
 * @Tag asdasd <html> #markdown ==markdown== __markdown__
 * - markdown
 * > mark 1
.mark
"string"
'string'
 * @Annotation()
 * @Tag name name @annotation beee @aaf dsfsd fgdg <markdown> #markdown ==markdown==
 * @a() @b("name") @c()
 * @Annotation\Name("var1()", "var2") @n("name()_name")
 * @Annotation(["var1", "var2"], "var3")
 * @Annotation\Filter\Name(["var1", "var2"], "var3", {"var4": "var5"})
 * @Annotation(
 * ["GET", "POST"]
 ) @Name({"name": "Tomas"})
 * @Tag name name
 */

Expected result:

<?php
array(
 [
 'name' => 'Tag',
 'value' => 'name name'
 ],
 [
 'name' => 'annotation',
 'value' => 'bee'
 ],
 [
 'name' => 'aaf',
 'value' => 'dsfsd fgdg'
 ]
 // ... and so on... (see live example)
);

A live demo can be found here.

Question 2

Why is \R there?

Question 3

If you're using it in a PHP code, you are not restricted to using / as the pattern delimiter. Using any other character, such as ~ or % frees you of using \/ everywhere, thereby shortening it.
I don't know what \R is meant to do there, but I think it was supposed to be \r. If so, you do not need it at all.
For the annotation name parameter match, you can just specify that it start with a \w character, followed by a lazy match on [\w\\] character set. Even using [\w\\]+ as the name parameter would not be wrong.
When inside a character set, you absolutely do not need to escape any characters other than the closing square bracket (]) and the exponent (^) if it is the first character. You can move the hyphen (-) to either the beginning or the end. So, the whole value group reduces to
```
(?P<value>(?:[-/\w\s"<>_#=.'{}:;,*()[\]]*[^*\s/)]))?
```
PS: You might need to escape at most one of ' or " depending on how you are using the pattern in your code.
In the above, it appears you want to match everything except the next @ character, so; [^@]* would be my next suggestion.
Towards the end of you pattern, you have: (?:\s | $|\)) which could become: (?:\s+|$|\)) or simply (?:\s+|\)) since you would not be needing the anchor at all.
You are not counting on a string like:
```
@name (something)
```
where, you receive name and \s\s\s(something as resulting value (\s is space literal). This can be kept under check by setting \s*\(? instead of (?:\s|\(). This might also not be the intended behaviour from the user, which is why I kept this suggestion at the end.

[*\s]* # multiple space or asterisk characters
@ # followed by the @ sign which
(?P<name>[\w\\]+) # has a string of \w or \ characters following it
\s*\(? # separated by spaces and maybe opening parenthesis
 (?P<value> # store the value attached to the named annotation
 (?:
 [^@]* # value is made of enlisted characters
 [^*\s/)] # but does not end with
 )
 )? # the value is optional parameter
(?:\s|$|\)) # succeded by one of these

hjpotter92 hjpotter92 8,9211 gold badge26 silver badges49 bronze badges · Answer 1 · 2015-09-27 05:20:35Z

If you're using it in a PHP code, you are not restricted to using / as the pattern delimiter. Using any other character, such as ~ or % frees you of using \/ everywhere, thereby shortening it.
I don't know what \R is meant to do there, but I think it was supposed to be \r. If so, you do not need it at all.
For the annotation name parameter match, you can just specify that it start with a \w character, followed by a lazy match on [\w\\] character set. Even using [\w\\]+ as the name parameter would not be wrong.
When inside a character set, you absolutely do not need to escape any characters other than the closing square bracket (]) and the exponent (^) if it is the first character. You can move the hyphen (-) to either the beginning or the end. So, the whole value group reduces to
```
(?P<value>(?:[-/\w\s"<>_#=.'{}:;,*()[\]]*[^*\s/)]))?
```
PS: You might need to escape at most one of ' or " depending on how you are using the pattern in your code.
In the above, it appears you want to match everything except the next @ character, so; [^@]* would be my next suggestion.
Towards the end of you pattern, you have: (?:\s | $|\)) which could become: (?:\s+|$|\)) or simply (?:\s+|\)) since you would not be needing the anchor at all.
You are not counting on a string like:
```
@name (something)
```
where, you receive name and \s\s\s(something as resulting value (\s is space literal). This can be kept under check by setting \s*\(? instead of (?:\s|\(). This might also not be the intended behaviour from the user, which is why I kept this suggestion at the end.

[*\s]* # multiple space or asterisk characters
@ # followed by the @ sign which
(?P<name>[\w\\]+) # has a string of \w or \ characters following it
\s*\(? # separated by spaces and maybe opening parenthesis
 (?P<value> # store the value attached to the named annotation
 (?:
 [^@]* # value is made of enlisted characters
 [^*\s/)] # but does not end with
 )
 )? # the value is optional parameter
(?:\s|$|\)) # succeded by one of these

Stack Exchange Network

Regular Expression for parsing PHP annotations containing multiple lines and markup

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Regular Expression for parsing PHP annotations containing multiple lines and markup

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions