I am writing a regular expression for parsing PHP annotations in a flexible way. I'd like to accomplish all the goals listed above with one single expression. So I would appreciate any suggestions about the quality of it (in terms of corner cases, performance, best practices and correctness).
PCRE expression:
/[\*\s]*@(?P<name>\w+[\\\w]*?)(?:\s|\()
(?P<value>
(?:[\/\w\s\"\<\>\_\#\=\-\.\'\{\}:;,\*\(\)\[\]]*[^\R\*\s\/\)])
)?
(?:\s | $|\))/gsxmu
Goals of the regular expression:
- List all annotations with their values
- Values can be multi-lined and have markup (html, json or markdown)
- The initial space + * of each line should be removed from the value
- Many annotations can be in the same line
- Annotation names can be namespaced
Sample PHPDoc string:
/** * Description * * @Tag name name @annotation beee @aaf dsfsd fgdg * @Tag name name {"json":"dfsf"} * @Tag asdasd <html> #markdown ==markdown== __markdown__ * - markdown * > mark 1 .mark "string" 'string' * @Annotation() * @Tag name name @annotation beee @aaf dsfsd fgdg <markdown> #markdown ==markdown== * @a() @b("name") @c() * @Annotation\Name("var1()", "var2") @n("name()_name") * @Annotation(["var1", "var2"], "var3") * @Annotation\Filter\Name(["var1", "var2"], "var3", {"var4": "var5"}) * @Annotation( * ["GET", "POST"] ) @Name({"name": "Tomas"}) * @Tag name name */
Expected result:
<?php array( [ 'name' => 'Tag', 'value' => 'name name' ], [ 'name' => 'annotation', 'value' => 'bee' ], [ 'name' => 'aaf', 'value' => 'dsfsd fgdg' ] // ... and so on... (see live example) );
A live demo can be found here.
1 Answer 1
- If you're using it in a PHP code, you are not restricted to using
/
as the pattern delimiter. Using any other character, such as~
or%
frees you of using\/
everywhere, thereby shortening it. - I don't know what
\R
is meant to do there, but I think it was supposed to be\r
. If so, you do not need it at all. - For the annotation name parameter match, you can just specify that it start with a
\w
character, followed by a lazy match on[\w\\]
character set. Even using[\w\\]+
as the name parameter would not be wrong. When inside a character set, you absolutely do not need to escape any characters other than the closing square bracket (
]
) and the exponent (^
) if it is the first character. You can move the hyphen (-
) to either the beginning or the end. So, the wholevalue
group reduces to(?P<value>(?:[-/\w\s"<>_#=.'{}:;,*()[\]]*[^*\s/)]))?
PS: You might need to escape at most one of
'
or"
depending on how you are using the pattern in your code.- In the above, it appears you want to match everything except the next
@
character, so;[^@]*
would be my next suggestion. - Towards the end of you pattern, you have:
(?:\s | $|\))
which could become:(?:\s+|$|\))
or simply(?:\s+|\))
since you would not be needing the anchor at all. You are not counting on a string like:
@name (something)
where, you receive
name
and\s\s\s(something
as resulting value (\s
is space literal). This can be kept under check by setting\s*\(?
instead of(?:\s|\()
. This might also not be the intended behaviour from the user, which is why I kept this suggestion at the end.
[*\s]* # multiple space or asterisk characters
@ # followed by the @ sign which
(?P<name>[\w\\]+) # has a string of \w or \ characters following it
\s*\(? # separated by spaces and maybe opening parenthesis
(?P<value> # store the value attached to the named annotation
(?:
[^@]* # value is made of enlisted characters
[^*\s/)] # but does not end with
)
)? # the value is optional parameter
(?:\s|$|\)) # succeded by one of these
\R
there? \$\endgroup\$