Package: grep;
Reported by: sur-behoffski <sur_behoffski <at> grouse.com.au>
Date: 2022年6月19日 05:40:02 UTC
Severity: wishlist
To reply to this bug, email your comments to 56079 AT debbugs.gnu.org.
the display of automated, internal messages from the tracker.
View this report as an mbox folder, status mbox, maintainer mbox
bug-grep <at> gnu.org:bug#56079; Package grep.
(2022年6月19日 05:40:02 GMT) Full text and rfc822 format available.sur-behoffski <sur_behoffski <at> grouse.com.au>:bug-grep <at> gnu.org.
(2022年6月19日 05:40:02 GMT) Full text and rfc822 format available.Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
From: sur-behoffski <sur_behoffski <at> grouse.com.au> To: bug-grep <at> gnu.org Subject: Missing performance optimisation: Word start/end tests Date: 2022年6月19日 15:08:46 +0930
G'day, I'm in the throes of a massive rewrite of "hstbm", which combined a very-quick-and-dirty lexer, no parser, and an optimised Boyer-Moore-style search where I had made some incremental improvements. The only release is at savannah.non-gnu.org: https://savannah.nongnu.org/projects/hstbm It's been over six years since the first and only release; Lua fans will note that I have had another project that has been active over that time, intended to help people use a scientific/technical toolkit on a range of GNU/Linux machines. -- I'm now in the process of trialling an all-singing, all-dancing lexer, with the philosophy that it tries to capture the pattern syntax and semantics, without resorting to parser constructs such as an AST. [I'm currently at a hairy point where the meaning of characters such as "^" can vary, based on constructs such as "(" (start-of-group) and/or "|" (alternation)... where does lexing stop and parsing start?!] One thing that is captured is predicates e.g. relating to IS_WORD: "IS_WORD_YES" (0x01), "IS_WORD_NO" (0x10), and "IS_WORD_MAYBE" (0x11). I've found some patterns containing word start/end boundary checks that are impossible to match in practice, e.g.: a\<b [abc]\>[def] Grep does not recognise these cases, and so spends time ploughing through the text for a match that can never occur. My lexing code, in contrast, sees the "IS_WORD_YES" "\<" "IS_WORD_YES" (or, equivalently, pairs of "IS_WORD_NO"), and arranges the lexical token stream such that the very first token is (effectively) MATCH_FAILED -- without any effort to inspect the haystack buffer. This can reduce runtimes for large haystack input from seconds to milliseconds. While this is not a terribly common case, it's an easy item to check for; it's possible that, in the future, patterns may become less hand-crafted and more machine-crafted, and so this case may become more relevant. cheers, sur-behoffski (Brenton Hoff) programmer, Grouse Software ["sur-" means "meta-", it's a commentary on a peculiar Australian event: See "Tony Abbott" + "Captain's Pick" + "Prince Philip". Absolutely no disrespect is intended to Honour-receivers at any level; I am grateful for your service, and how you have enriched society.]
bug-grep <at> gnu.org:bug#56079; Package grep.
(2022年6月19日 15:27:02 GMT) Full text and rfc822 format available.Message #8 received at 56079 <at> debbugs.gnu.org (full text, mbox):
From: Paul Eggert <eggert <at> cs.ucla.edu> To: sur-behoffski <sur_behoffski <at> grouse.com.au> Cc: 56079 <at> debbugs.gnu.org Subject: Re: bug#56079: Missing performance optimisation: Word start/end tests Date: 2022年6月19日 10:26:43 -0500
On 6/19/22 00:38, sur-behoffski wrote: > a\<b > [abc]\>[def] > Yes, and this comes up even in strict POSIX without GNU extensions, as "a^" is a valid extended regular expression that cannot match anything. I don't know whether dfa.c, regex.c, etc. optimize for this, but if not it would be nice if they did. For example, 'grep -E "a^" FOO' can silently exit with status 1 without even opening FOO.
Paul Eggert <eggert <at> cs.ucla.edu>
to control <at> debbugs.gnu.org.
(2022年7月02日 22:27:02 GMT) Full text and rfc822 format available.
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.