musl - musl - an implementation of the standard library for Linux-based systems

index : musl
musl - an implementation of the standard library for Linux-based systems
summary refs log tree commit diff
path: root/src/regex/regcomp.c
AgeCommit message (Collapse)AuthorLines
2017年03月21日regex: fix newline matching with negated brackets Julien Ramseier-0/+14
With REG_NEWLINE, POSIX says: "A <newline> in string shall not be matched by a period outside a bracket expression or by any form of a non-matching list"
2016年12月16日handle ^ and $ in BRE subexpression start and end as anchors Szabolcs Nagy-9/+12
In BRE, ^ is an anchor at the beginning of an expression, optionally it may be an anchor at the beginning of a subexpression and must be treated as a literal otherwise. Previously musl treated ^ in subexpressions as literal, but at least glibc and gnu sed treats it as an anchor and that's the more useful behaviour: it can always be escaped to get back the literal meaning. Same for $ at the end of a subexpression. Portable BRE should not rely on this, but there are sed commands in build scripts which do. This changes the meaning of the BREs: \(^a\) \(a\|^b\) \(a$\) \(a$\|b\)
2016年05月22日fix the use of uninitialized value in regcomp Szabolcs Nagy-0/+2
the num_submatches field of some ast nodes was not initialized in tre_add_tag_{left,right}, but was accessed later. this was a benign bug since the uninitialized values were never used (these values are created during tre_add_tags and copied around during tre_expand_ast where they are also used in computations, but nothing in the final tnfa depends on them).
2016年03月02日fix ^* at the start of a complete BRE Szabolcs Nagy-0/+4
This is a workaround to treat * as literal * at the start of a BRE. Ideally ^ would be treated as an anchor at the start of any BRE subexpression and similarly $ would be an anchor at the end of any subexpression. This is not required by the standard and hard to do with the current code, but it's the existing practice. If it is changed, * should be treated as literal after such anchor as well.
2016年03月02日fix * at the start of a BRE subexpression Szabolcs Nagy-4/+0
commit 7eaa76fc2e7993582989d3838b1ac32dd8abac09 made * invalid at the start of a BRE subexpression, but it should be accepted as literal * there according to the standard. This patch does not fix subexpressions starting with ^*.
2016年01月31日regex: increase the stack tre uses for tnfa creation Szabolcs Nagy-1/+1
10k elements stack is increased to 1000k, otherwise tnfa creation fails for reasonable sized patterns: a single literal char can add 7 elements to this stack, so regcomp of an 1500 char long pattern (with only litral chars) fails with REG_ESPACE. (the new limit allows about < 150k chars, this arbitrary limit allows most command line regex usage.) ideally there would be no upper bound: regcomp dynamically reallocates this buffer, every reallocation checks for allocation failure and at the end this stack is freed so there is no reason for special bound. however that may have unwanted effect on regcomp and regexec runtime so this is a conservative change.
2016年01月30日regex: simplify the {,} repetition parsing logic Szabolcs Nagy-20/+19
2016年01月30日regex: treat \+, \? as repetitions in BRE Szabolcs Nagy-1/+5
These are undefined escape sequences by the standard, but often used in sed scripts.
2016年01月30日regex: rewrite the repetition parsing code Szabolcs Nagy-30/+29
The goto logic was hard to follow and modify. This is in preparation for the BRE \+ and \? support.
2016年01月30日regex: treat \| in BRE as alternation Szabolcs Nagy-2/+17
The standard does not define semantics for \| in BRE, but some code depends on it meaning alternation. Empty alternative expression is allowed to be consistent with ERE. Based on a patch by Rob Landley.
2016年01月30日regex: reject repetitions in some cases with REG_BADRPT Szabolcs Nagy-3/+12
Previously repetitions were accepted after empty expressions like in (*|?)|{2}, but in BRE the handling of * and \{\} were not consistent: they were accepted as literals in some cases and repetitions in others. It is better to treat repetitions after an empty expression as an error (this is allowed by the standard, and glibc mostly does the same). This is hard to do consistently with the current logic so the new rule is: Reject repetitions after empty expressions, except after assertions ^*, $? and empty groups ()+ and never treat them as literals. Empty alternation (|a) is undefined by the standard, but it can be useful so that should be accepted.
2016年01月30日regex: clean up position accounting for literal nodes Szabolcs Nagy-4/+2
This should not change the meaning of the code, just make the intent clearer: advancing position is tied to adding a new literal.
2015年09月24日regcomp: propagate allocation failures Szabolcs Nagy-1/+2
The error code of an allocating function was not checked in tre_add_tag.
2015年03月27日regex: fix character class repetitions Szabolcs Nagy-0/+5
Internally regcomp needs to copy some iteration nodes before translating the AST into TNFA representation. Literal nodes were not copied correctly: the class type and list of negated class types were not copied so classes were ignored (in the non-negated case an ignored char class caused the literal to match everything). This affects iterations when the upper bound is finite, larger than one or the lower bound is larger than one. So eg. the EREs [[:digit:]]{2} [^[:space:]ab]{1,4} were treated as .{2} [^ab]{1,4} The fix is done with minimal source modification to copy the necessary fields, but the AST preparation and node handling code of tre will need to be cleaned up for clarity.
2015年03月23日do not treat 0円 as a backref in BRE Szabolcs Nagy-1/+1
The valid BRE backref tokens are 1円 .. 9,円 and 0 is not a special character either so 0円 is undefined by the standard. Such undefined escaped characters are treated as literal characters currently, following existing practice, so 0円 is the same as 0.
2015年03月20日suppress backref processing in ERE regcomp Rich Felker-1/+1
one of the features of ERE is that it's actually a regular language and does not admit expressions which cannot be matched in linear time. introduction of \n backref support into regcomp's ERE parsing was unintentional.
2015年03月20日fix memory-corruption in regcomp with backslash followed by high byte Rich Felker-1/+1
the regex parser handles the (undefined) case of an unexpected byte following a backslash as a literal. however, instead of correctly decoding a character, it was treating the byte value itself as a character. this was not only semantically unjustified, but turned out to be dangerous on archs where plain char is signed: bytes in the range 252-255 alias the internal codes -4 through -1 used for special types of literal nodes in the AST.
2014年09月13日rewrite the regex pattern parser in regcomp Szabolcs Nagy-1081/+634
The new code is a bit simpler and the generated code is about 1KB smaller (on i386). The basic design was kept including internal interfaces, TNFA generation was not touched. The old tre parser had various issues: [^aa-z] negated overlapping ranges in a bracket expression were handled incorrectly (eg [^aa-z] was handled as [^a] instead of [^a-z]) a{,2} missing lower bound in a counted repetition should be an error, but it was accepted with broken semantics: a{,2} was treated as a{0,3}, the new parser rejects it a{999,} large min count was not rejected (a{5000,} failed with REG_ESPACE due to reaching a stack limit), the new parser enforces the RE_DUP_MAX limit \xff regcomp used to accept a pattern with illegal sequences in it (treated them as empty expression so p\xffq matched pq) the new parser rejects such patterns with REG_BADPAT or REG_ERANGE [^b-fD-H] with REG_ICASE old parser turned this into [^b-fB-F] because of the negated overlapping range issue (see above), the new parser treats it as [^b-hB-H], POSIX seems to require [^d-fD-F], but practical implementations do case-folding first and negate the character set later instead of the other way around. (Supporting the posix way efficiently would require significant changes so it was left as is, it is unclear if any application actually expects the posix behaviour, this issue is raised on the austingroup tracker: http://austingroupbugs.net/view.php?id=872 ). another case-insensitive matching issue is that unicode case folding rules can group more than two characters together while towupper and towlower can only work for a pair of upper and lower case characters, this is a limitation of POSIX so it is not fixed. invalid bracket and brace expressions may return different error codes now (REG_ERANGE instead of REG_EBRACK or REG_BADBR instead of REG_EBRACE) otherwise the new parser should be compatible with the old one. regcomp should be able to handle arbitrary pattern input if the pattern length is limited, the only exception is the use of large repetition counts (eg. (a{255}){255}) which require exp amount of memory and there is no easy workaround.
2013年12月12日include cleanups: remove unused headers and add feature test macros Szabolcs Nagy-1/+0
2013年10月07日fix allocation sizes in regcomp Szabolcs Nagy-4/+4
sizeof had incorrect argument in a few places, the size was always large enough so the issue was not critical.
2013年01月15日remove unused "params" related code from regex Szabolcs Nagy-20/+11
some structs and functions had reference to the params feature of tre that is not used by the code anymore
2012年09月06日use restrict everywhere it's required by c99 and/or posix 2008 Rich Felker-1/+1
to deal with the fact that the public headers may be used with pre-c99 compilers, __restrict is used in place of restrict, and defined appropriately for any supported compiler. we also avoid the form [restrict] since older versions of gcc rejected it due to a bug in the original c99 standard, and instead use the form *restrict.
2012年05月13日remove some no-op end of string tests from regex parser Rich Felker-4/+0
these are cruft from the original code which used an explicit string length rather than null termination. i blindly converted all the checks to null terminator checks, without noticing that in several cases, the subsequent switch statement would automatically handle the null byte correctly.
2012年05月13日another BRE fix: in ^*, * is literal Rich Felker-0/+2
i don't understand why this has to be conditional on being in BRE mode, but enabling this code unconditionally breaks a huge number of ERE test cases.
2012年05月07日fix error checking for \ at end of regex (this was broken previously) Rich Felker-1/+1
2012年05月07日fix copy and paste error in regex code causing mishandling of \) in BRE Rich Felker-1/+1
2012年05月07日fix regex breakage in last commit (failure to handle empty regex, etc.) Rich Felker-4/+1
2012年05月07日fix ugly bugs in TRE regex parser Rich Felker-60/+31
1. * in BRE is not special at the beginning of the regex or a subexpression. this broke ncurses' build scripts. 2. \\( in BRE is a literal \ followed by a literal (, not a literal \ followed by a subexpression opener. 3. the ^ in \\(^ in BRE is a literal ^ only at the beginning of the entire BRE. POSIX allows treating it as an anchor at the beginning of a subexpression, but TRE's code for checking if it was at the beginning of a subexpression was wrong, and fixing it for the sake of supporting a non-portable usage was too much trouble when just removing this non-portable behavior was much easier. this patch also moved lots of the ugly logic for empty atom checking out of the default/literal case and into new cases for the relevant characters. this should make parsing faster and make the code smaller. if nothing else it's a lot more readable/logical. at some point i'd like to revisit and overhaul lots of this code...
2012年04月13日remove invalid code from TRE Rich Felker-14/+0
TRE wants to treat + and ? after a +, ?, or * as special; ? means ungreedy and + is reserved for future use. however, this is non-conformant. although redundant, these redundant characters have well-defined (no-op) meaning for POSIX ERE, and are actually _literal_ characters (which TRE is wrongly ignoring) in POSIX BRE mode. the simplest fix is to simply remove the unneeded nonstandard functionality. as a plus, this shaves off a small amount of bloat.
2012年03月20日upgrade to latest upstream TRE regex code (0.8.0) Rich Felker-775/+822
the main practical results of this change are 1. the regex code is no longer subject to LGPL; it's now 2-clause BSD 2. most (all?) popular nonstandard regex extensions are supported I hesitate to call this a "sync" since both the old and new code are heavily modified. in one sense, the old code was "more severely" modified, in that it was actively hostile to non-strictly-conforming expressions. on the other hand, the new code has eliminated the useless translation of the entire regex string to wchar_t prior to compiling, and now only converts multibyte character literals as needed. in the future i may use this modified TRE as a basis for writing the long-planned new regex engine that will avoid multibyte-to-wide character conversion entirely by compiling multibyte bracket expressions specific to UTF-8.
2011年06月16日duplicate re_nsub in LSB/glibc ABI compatible location Rich Felker-1/+1
2011年02月12日initial check-in, version 0.5.0 v0.5.0 Rich Felker-0/+3362
generated by cgit v1.2.1 (git 2.18.0) at 2025年09月05日 05:12:13 +0000

AltStyle によって変換されたページ (->オリジナル) /