-
Notifications
You must be signed in to change notification settings - Fork 128
Conversation
RegexReader transforms #"foo"iL into (re2 "foo" {:ignore_case :literal})
...xp is refered directly
mpenet
commented
Feb 18, 2016
Interesting work!
One thing that (I believe) is problematic with this approach is the "freeing" of resources. You are never calling cre2_delete or cre2_opt_delete, you leave that to the user (which is fine, there's no other way when doing ffi for now). http://marcomaggi.github.io/docs/cre2.html#matching
It seems that one idiomatic way of handling this in pixie would be to wrap regex (and options) instances in a type and implement IDisposable (a bit like here:
Lines 9 to 14 in 189395f
The user still has to release resources manually but at least it's a bit more explicit/obvious with this protocol, and I guess down the road pixie could/might handle this internally once no ref to an instance remains (this is probably the idea behind IDisposable).
mpenet
commented
Feb 18, 2016
One could also imagine we'd have a generic regex protocol, that cre2 would implement with that type I mention. That would unify a bit the interface for if/when people make other implementations (pcre2 etc).
mks-m
commented
Feb 18, 2016
@mpenet you're absolutely right about freeing resources, it's one large problem i was hoping somebody would point me to a decent solution (which you did, thanks!). i imagine it should also be possible to somehow hook into GC and tell which function to use to free certain objects?
protocol is a good idea as well, i wasn't focusing much on how to do everything properly. one problem with different regexp engines is how to deal with "late binding" of pixie.regex/regexp function that implements literal conversion into native object? if somebody refers to re2 and passes literal into namespace that refers pcre - that obviously won't work.
mpenet
commented
Feb 18, 2016
Right, I didn't think about the literal, when you expand #"" you can just translate into a fn call that does the right thing, which would use the default active re engine (that can be set in a global var). and have that re-pattern fn have 2 signature with one that allows to specify another engine.
Something like this (dont mind the awful naming):
(def ^:dynamic *default-re-engine* :cre2) ;; an "open" engine registry (defmulti re-engine (fn [k s] k)) ;; add cre2 to registry (defmethod re-engine :cre2 [_ s] (new cre2-re-engine s)) ;; dispatch on the right engine constructor via registry (defn re-pattern ([s] (re-pattern s *default-re-engine*)) ([s kw] (re-engine kw s))) (defprotocol IRegex ... (re-find [r s]) (re-seq [r s])) (deftype cre2-re-engine [...] IRegex ... IDisposable ... ))
Just thinking out loud, there might be better approaches, but that way you can change impl using binding, a redef of the default-re-engine or simply by passing the engine kw to re-pattern manually)
halgari
commented
Feb 18, 2016
You're right IDispose doesn't hook into the GC, but IFinalize does (with
the associated -finalize! function). With a caveat: the GC can't look
outside of its own heap, so if you enable -finalize! on and that object is
stored outside the semantic bounds of a function call, then you may get a
segfault. Pixie makes sure to to never GC arguments to a FFI function while
that function is being called, but if a argument is stashed away somewhere
on the C heap, the GC may try to free it after the call is finished.
All that to say, I think IFinalize should work just fine for regexes.
On Thu, Feb 18, 2016 at 5:43 AM, Max Penet notifications@github.com wrote:
Right, I didn't think about the literal, when you expand #"" you can just
translate into a fn call that does the right thing, which would use the
default active re engine (that can be set in a global var). and have that
re-pattern fn have 2 signature with one that allows to specify another
engine.Something like this (dont mind the awful naming):
(def default-re-engine :cre2)
(set-dynamic! default-re-engine)
;; an "open" engine registry
(defmulti re-engine (fn [k opts])
;; add cre2 to registry
(defmethod re-engine :cre2 [_ opts](new cre2-re-engine opts))
;; dispatch on the right engine constructor via registry
(defn re-pattern
([s](re-pattern s default-re-engine)
([s kw](init-re-engine kw))(defprotocol IRegex
...
(re-find [r s])
(re-seq [r s]))(deftype cre2-re-engine [...]
IRegex
...IDisposable
... ))Just thinking out loud, there might be better approaches.
—
Reply to this email directly or view it on GitHub
#498 (comment).
"One of the main causes of the fall of the Roman Empire was that–lacking
zero–they had no way to indicate successful termination of their C
programs."
(Robert Firth)
mpenet
commented
Feb 18, 2016
@halgari awesome, I didn't know about IFinalize, thanks for the tip!
- some renames - defprotocol IRegex - deftype CRE2Regex implementing IFinalize and IRegex
mks-m
commented
Feb 19, 2016
@mpenet @halgari thanks for insights, i've added deftype and defprotocol and multimethod dispatch to current (only) engine.
i've also changed literal expansion into (pixie.re/regex ...) - which means to have literals working one needs to always refer to pixie.re namespace. what do you think is best way to address this? maybe (in-ns pixie.stdlib) just before (defn regex ...) so that regex is always in stdlib?
i think expected functionality is for literals to be available in any part of codebase but not sure how to achieve that properly.
927c66f to
ecd68b9
Compare
mks-m
commented
Feb 25, 2016
bit stuck with alloc'ing pointer to / array of cre2_string_t structs that will contain capture groups
example usage in cre2 library: https://github.com/marcomaggi/cre2/blob/master/src/cre2.cpp#L222
how do i do that in pixie?
right now i'm just doing this: 4c61558#diff-a532eb022eaed3bf076b7d519289551aR63
thomasmulvaney
commented
Feb 25, 2016
I'm not sure if there is a good way for allocating a pointer to an array in pixie at the moment. I don't think there is a nice way of creating an array for a given struct/type either. Perhaps we could start a list of FFI things to work on as a github issue?
mks-m
commented
Mar 15, 2016
@thomasmulvaney i'm afraid my ffi-fu is not strong enough to properly describe what are we missing using the right language but i would like to put more effort into this pr if it has any chance of being merged. will appreciate any help with ffi things even if it is just issue description and maybe some links to get me started.
thomasmulvaney
commented
Mar 18, 2016
@keymone I pulled down this branch last night. Nice work!
To extract the strings from the matches we need an ffi function which when given a CharP and a length returns a pixie string which is the result of reading from the pointer to length.
I also noticed that ffi-infer thinks the :data part of the cre2_string_t is a VoidP rather than a CharP so we should add a function to ffi-infer that lets one cast between pointer types.
I have had a look at implementing these 2 functions.
thomasmulvaney
commented
Apr 6, 2016
I've made some progress with handling charps: master...thomasmulvaney:charp
Currently this seems to only work when run interactively. When compiled, charp->str crashes. I'm guessing because whatever part of memory the pointer was addressing is no longer a char. I'm not sure if I'll have much time to look into this further until next week.
the reduce implementation was creating some lazy lists
first of all, this is not merge-ready. this PR introduces regexp literal and builds cre2 C library that wraps google's re2. to get all of it working you need:
make re2_cre2- this will build re2 and cre2 wrapper in externals/ln -s 'pwd'/externals/cre2/build/.libs/libcre2* /usr/local/lib/- i have no idea what i'm doingpixie-vm -c pixie/regex.pxi- this might require something likeexport DYLD_LIBRARY_PATH='pwd'/externals/cre2/build/.libs, see previous pointafter this is done, fire up your repl and run
(ns user (:require [pixie.regex :refer [regexp match]]))this PR introduces regexp literal like in clojure:
#"123"- is translated in RegexReader into(regexp "123" #{})#"123"iL- is translated into(regexp "123" #{:ignore_case :literal})most options available in
re2are supportedthis is how far i got and i will certainly need help from somebody more knowledgable in building and using C libraries.