With REG_NEWLINE, POSIX says: "A <newline> in string shall not be matched by a period outside a bracket expression or by any form of a non-matching list"
In BRE, ^ is an anchor at the beginning of an expression, optionally it may be an anchor at the beginning of a subexpression and must be treated as a literal otherwise. Previously musl treated ^ in subexpressions as literal, but at least glibc and gnu sed treats it as an anchor and that's the more useful behaviour: it can always be escaped to get back the literal meaning. Same for $ at the end of a subexpression. Portable BRE should not rely on this, but there are sed commands in build scripts which do. This changes the meaning of the BREs: \(^a\) \(a\|^b\) \(a$\) \(a$\|b\)
the num_submatches field of some ast nodes was not initialized in tre_add_tag_{left,right}, but was accessed later. this was a benign bug since the uninitialized values were never used (these values are created during tre_add_tags and copied around during tre_expand_ast where they are also used in computations, but nothing in the final tnfa depends on them).
This is a workaround to treat * as literal * at the start of a BRE. Ideally ^ would be treated as an anchor at the start of any BRE subexpression and similarly $ would be an anchor at the end of any subexpression. This is not required by the standard and hard to do with the current code, but it's the existing practice. If it is changed, * should be treated as literal after such anchor as well.
commit 7eaa76fc2e7993582989d3838b1ac32dd8abac09 made * invalid at the start of a BRE subexpression, but it should be accepted as literal * there according to the standard. This patch does not fix subexpressions starting with ^*.
10k elements stack is increased to 1000k, otherwise tnfa creation fails for reasonable sized patterns: a single literal char can add 7 elements to this stack, so regcomp of an 1500 char long pattern (with only litral chars) fails with REG_ESPACE. (the new limit allows about < 150k chars, this arbitrary limit allows most command line regex usage.) ideally there would be no upper bound: regcomp dynamically reallocates this buffer, every reallocation checks for allocation failure and at the end this stack is freed so there is no reason for special bound. however that may have unwanted effect on regcomp and regexec runtime so this is a conservative change.
These are undefined escape sequences by the standard, but often used in sed scripts.
The goto logic was hard to follow and modify. This is in preparation for the BRE \+ and \? support.
The standard does not define semantics for \| in BRE, but some code depends on it meaning alternation. Empty alternative expression is allowed to be consistent with ERE. Based on a patch by Rob Landley.