Thread: [Crm114-discuss] unifying LEARN/CLASSIFY invocation (was: Re: [Crm114-general] Mixed 64-bit system

Brought to you by: nkadel, oopla, vanbaal, wsy

crm114-discuss

[Crm114-discuss] unifying LEARN/CLASSIFY invocation (was: Re: [Crm114-general] Mixed 64-bit system GerH binaries / BillYscripts)

From: Paolo <oo...@us...> - 2008年09月03日 07:27:40

On Wed, Sep 03, 2008 at 05:03:35AM +0200, Ger Hobbelt wrote:
...
> Given 60 classes (= CSS files), Paolo can have his KISS and I can eat
> my pie too. Simple.
let me stress once again that I question the _requirement_. 
 
> The set passed to classify is a set and should be passed to learn as
right, if we had vector/array struct it'd be 'natural' ...
> isolate (:c:) /class1 | class2 | and so on .../
...
> classify (:*:c:) [message]
which is a fake vector, works on strict assumptions on how to name 
var/classes.
Like in other situations, having true array data structure would be quite
useful.
> learn (:*:c:) (index) [message]
> 
> Both look good to _me_. ;-)
agreed, provided that
! learn (:*:s:) [message] <i flags>
where :s: is just a subset (1 as limit, so that 'i' can be dropped) of the 
N classes in use, is (remains) legal (where allowed). 
 
> Because you always pass along the whole set at script level, the
> classifier code (both learn and classify implementation) gets to pick
there's no need for that, where's the binding between script level and 
classifiers implementation? eg I can define N classes, but use any subset
for both LEARN / CLASSIFY at any point to my taste/needs, with the limit
of the actual classifier's requirement:
!# use classes: one two three four five six seven
! learn (one two three four five six seven) <i flags> [msg_x]
! learn (three four five) <i flags> [msg_y]
! learn (one) <flags> [msg1]
! ...
! classify (one two three four five six seven) <flags>
! classify (five six seven) <flags>
! classify (three six seven) <flags>
! classify (one three four six seven) <flags>
! classify (six) <flags> (cm)	# class membership -> cm, unsupported atm
...
> what they want/need, you get the chance to apply filters & processes
> in learn that are simply impossible right now PLUS you don't have to
that's C level, SVM wants 3 because it uses 3 in both cases.
> worry anymore either which classifier you're gonna use because today
> all the bloody buggers require their own particular incantation when
> it comes to number of css files (classes) passed to learn.
there are categories of classifiers that have same requirements wrt
#classes and params. Now suppose the actual classes are compatible, but
one classifier needs 1+ extras (eg SVM) and I want to compare classifiers,
then it'd be nice to do (SVM case, forget 4now actual class compatibility):
! learn (a b a_v_b) <svm flags>		# wants all 3
! classify (a b a_v_b) <svm flags> (s_svm)	# wants all 3
! classify (a b) <xxx flags> (s_xxx)	# can't use the extra a_v_b
> So no unified ... mess; I'd say it's unified ... structure / design.
maybe, but that's not as simple as saying :
define:	N classes
hence:	LEARN(1 2 ... N)
	CLASSIFY(1 2 ... N)
which might turn into a mess, or better shift the mess from one place to
another.
> Cost for Trever @ 60 classes? nil.
wasn't thinking of run time cost, but script readability.
> You save far more time when you find a way to reduce disc I/O cache
> misses on your memory-mapped CSS files, even when you achieve such a
> feat for learn alone (which would be rather weird and besides, unless
> you 'Train Everything', optimizing classify is the winner). I have a
yes, though once N classes get mmaped for a CLASSIFY a single class LEARN
can check for it and won't mmap() again, and mmsync() can be deferred
iff other processes that use same class(es) do that via shared mem.
> Want some real, achievable gain? convert crm114 to play 'server', i.e.
> permanently loaded and CSS files (close to) permanently mapped in
yes yes yes yes - the endless daemon saga :)
> invocation of crm114 and the moment the script *tokenizer* kicks in.
> You're not even *executing* script yet by then! The rest (8%) is
> spread across tokenizing ('compiling the [small!] script'), tokenized
> script code execution, wrap-up and unidentified fluff elsewhere.
> Believe me, if I'd see an easy way to kick that bugger into higher
> gear, you'd already have it.
yeah, maybe the ability to run pre-compiled scripts can be good idea 
for a number of applications.
> seriously considering hacking crm114 into becoming mod_crm114, i.e. an
> Apache2 plugin: you get the server, the socket I/O and the
like Apache's Lucene and derivatives.
> live in there like a wicked PHP-alike server-side scripting language
> and you will definitely achieve instant notoriety. ;-)
and support headache ;)
> Anyhow, I don't see any good reason why the learn (classes) argument
> cannot be identical to the related classify (classes) argument, except
see above: CAN but definitely should not be a MUST.
> ONE: strict adherence to 'backwards compatibility' at CRM114 script
just one good reason.
-- 
paolo

Re: [Crm114-discuss] unifying LEARN/CLASSIFY invocation (was: Re: [Crm114-general] Mixed 64-bit system GerH binaries / BillYscripts)

From: Ger H. <ge...@ho...> - 2008年09月03日 10:28:19

On Wed, Sep 3, 2008 at 9:27 AM, Paolo <oo...@us...> wrote:
> On Wed, Sep 03, 2008 at 05:03:35AM +0200, Ger Hobbelt wrote:
> ...
>> Given 60 classes (= CSS files), Paolo can have his KISS and I can eat
>> my pie too. Simple.
>
> let me stress once again that I question the _requirement_.
[...]
> agreed, provided that
>
> ! learn (:*:s:) [message] <i flags>
>
> where :s: is just a subset (1 as limit, so that 'i' can be dropped) of the
> N classes in use, is (remains) legal (where allowed).
Yes, that should be possible in my line of thought. Assuming "you know
what you're doing" i.e. are aware of classifier internals, you can do
this in the new 'learn'. Take existing OSB for example (*forget* my
'delta' stuff for a sec there), which touches only a single CSS file
on learn, then
learn (A|B) <1>
is identical to
learn (A) <1>
is identical to
learn (A)
is identical to
learn (A|B)
because <1> is a possible 'default' -- though that might be a
disputable thing - I'd rather see an error report, because learn (A|B)
isn't 'obviously' going to teach the way of A.
The thing I'm really after is that at script level
learn (A|B|...) <i>
is supported for _all_ classifiers. When you're doing smart stuff
script-wise where you like to code
learn (A)
while you classify code is
classify (A|B|C|D|E|F|..)
fine. The bit of 'cut at pipe, pick the ones you want' code I envision
can handle it, so you've got options script-code-wise.
In other words: a 'set' of one, is still a set in my book. That you as
a script writer might want to take that thought to the edge (set of 1)
is fine with me. I always appreciate that kind of craftiness. It's
just that the starting point shifts for people new to this: keep the
set around and apply to both classify and learn equally. When you are
ready to read the fine print in the manual, you can decide to use 'set
of 1' as a valid 'fringe case' (fringe from script-language structural
point of view).
What I *need* is learn (A|B) support for classifiers that don't have
it yet (OSB and friends) and currently there's no possibility for
coding
learn (A|B) <i osb>
so I am prevented from testing my ideas for the classifier itself.
>> what they want/need, you get the chance to apply filters & processes
>> in learn that are simply impossible right now PLUS you don't have to
>
> that's C level, SVM wants 3 because it uses 3 in both cases.
Aicks! You _got_ me there. Forgot the 3rd one in SVM. DANG! Still a
remaining 'oddity' hence. :-((
No good answer there expect mumbling about the implicit 'variable
size' of a 'set' as I approach it.
> ! learn (a b a_v_b) <svm flags> # wants all 3
> ! classify (a b a_v_b) <svm flags> (s_svm) # wants all 3
> ! classify (a b) <xxx flags> (s_xxx) # can't use the extra a_v_b
>
>> So no unified ... mess; I'd say it's unified ... structure / design.
>
> maybe, but that's not as simple as saying :
> define: N classes
> hence: LEARN(1 2 ... N)
> CLASSIFY(1 2 ... N)
> which might turn into a mess, or better shift the mess from one place to
> another.
Sure it's a shift: out of the [script] language, so it's 'black
boxing' learn as it is classify, and into the [C] code.
I think for general use it's less mess because you need to 'remember'
less about the script language and the 'learn' interface, because
apart from the extra index (in a sense you're _feeding_ it the pR
which would pop out of classify as a result) it's exactly like
classify. I really like language layout where general use requires the
least number of 'rules' and 'details' to be remembered: it makes for a
simpler language overall which is good for me as I work with multiple
languages and a limited brain. ;-)
(This learn/classify stuff is - in a way - comparable to old
discussions about 'coding standards' and such for Pascal or C, where
there's a class of folks that say: "you can skip the braces/begin-end
and the semicolons so you should" while I am clearly with the folks
that say: "don't matter what you do, always apply the same structure:
braces/begin-end and semicolons and stuff, unless it is _prohibited_
by the language". Right now 'learn (A)' is prohibiting me from using
'learn (A|B)'. I think that bit didn't make it through last night.)
>> Cost for Trever @ 60 classes? nil.
>
> wasn't thinking of run time cost, but script readability.
Same here. But Trever was starting to worry, it seemed to me,
performance would drop, if ever so slightly, if we'd be introducing
this. And in case others were going to think it mattered.
> yes, though once N classes get mmaped for a CLASSIFY a single class LEARN
> can check for it and won't mmap() again, and mmsync() can be deferred
> iff other processes that use same class(es) do that via shared mem.
Yep. When you construct your scripts to handle classification and
subsequent learning in the same crm114 instance, you get that
advantage today.
A (very limited) 'server'-y approach doable right now is writing a
script which loops, waiting for messages available on disc or stdin,
and keep on processing them one after the other in the same instance:
you have the 'CSS stays in mem' benefit then as well (note: ignoring
how to code for cutting up stdin into messages and/or poll/wait for
disc-based messages here - that's another subject)
>> Want some real, achievable gain? convert crm114 to play 'server', i.e.
>> permanently loaded and CSS files (close to) permanently mapped in
>
> yes yes yes yes - the endless daemon saga :)
[...]
> yeah, maybe the ability to run pre-compiled scripts can be good idea
> for a number of applications.
You mean a kind of .java p-coded crm114 scripts, i.e. a real crm114
*compiler* (.crm --> .114 binary file) and, er, accompanying 'virtual
machine'? Oh boy, the table rises here. ;-P But that's just the geek
in me getting all exited. It's not on my list of 'things worth doing @
mid/short-term' though, but fun anyway. A crude/cheap way might be an
option to 'dump' and 'load' tokenized script as it leaves the crm114
tokenizer going to the execution unit. Tokenize once, run multiple
times.
It's not worth it for me (I ran tiny scripts) but all the folks out
there enjoying mailreaver and friends might get some good delight out
of that as mailreaver/mailtrainer are _significant_ sized scripts.
>> seriously considering hacking crm114 into becoming mod_crm114, i.e. an
>> Apache2 plugin: you get the server, the socket I/O and the
>
> like Apache's Lucene and derivatives.
Sorta. Yes.
>> live in there like a wicked PHP-alike server-side scripting language
>> and you will definitely achieve instant notoriety. ;-)
>
> and support headache ;)
I like my native Americans ;-))
Granted, moving from 1.3 to 2.0/2.2 wasn't easy for a mod_xxx, but
still I like it way more than 'roll your own [TCP-based] server'
again: linking it to Apache (and no, despite the fact that I do
Win32/64, I don't think I'll be the go-to guy if you want IIS plugin
support: IIS6 is nice, in a way, and has good performance, but I run
Apache on Windows for free projects and only do IIS for paying
customers. Got to draw the line _somewhere_. If they open-source IIS,
I'll reconsider that statement.)
Anyhow, the crm114 scripts would still be there as they are right now;
I would just take the std I/O and bend it so stdin = request and
stdout (and stderr?) == response. Maybe add a touch of XML if you want
to have a freeze-dried instant low-cal 'web service' (which is hot
stuff these days, but rather old wine in fashionable new Walmart bags
if you ask me, but then folks don't seem to study IT history anymore)
Why Apache really? Because I can then 'lean on' the stick provided by
them when I need to scale up: distributed servers, pardon, *services*,
and the whole bloody lot are documented already. Besides, my purposes
lead me towards a production environment as a 'web backend' anyhow, so
why not bolt it to the web server itself? Yup, doing so requires some
understanding of the Apache API interfacing and that's raising the
tech level by +1, but at least you can be spared some significant
intricacies regarding TCP/server performance tactics at server level.
It's fun to write it, but in this case, my feeling was it's faster to
go for mod_crm114 in dev time. And yes: that's 'faster' regarding a
_production quality_ mod_crm114 compared to _production quality_
crm114d (note the 'd').
(For free as well: SSL secured communications with the crm114
'service' - which might be something to cheer the 'remote services'
folks up quite a bit.)
Anyway, I don't 'do' the alpha release of mod_crm114 in one week, nor
can I deliver alpha stage crm114d in the same timeframe, so it'll
probably stay a great idea over whiskey on Friday as I don't see Bill
getting his hands on a particular red phone booth with free access
either. ;-)
> see above: CAN but definitely should not be a MUST.
Does my approach of 'set' as described at start of this email match
your CAN, or does it still sound like MUST to you?
>> ONE: strict adherence to 'backwards compatibility' at CRM114 script
>
> just one good reason.
well.... ;-)
-- 
Met vriendelijke groeten / Best regards,
Ger Hobbelt
--------------------------------------------------
web: http://www.hobbelt.com/
 http://www.hebbut.net/
mail: ge...@ho...
mobile: +31-6-11 120 978
--------------------------------------------------

Thanks for helping keep SourceForge clean.