crm114-discuss Mailing List for CRM114 Discriminator - Controllable Rege

Brought to you by: nkadel, oopla, vanbaal, wsy

crm114-discuss — For discussion of CRM114 in theory and practice

You can subscribe to this list here.

2004	_Jan	_Feb	_Mar (27)	_Apr (25)	_May (8)	_Jun (2)	_Jul	_Aug (1)	_Sep	_Oct	_Nov (1)	_Dec
2005	_Jan (1)	_Feb	_Mar (1)	_Apr	_May	_Jun	_Jul	_Aug	_Sep (2)	_Oct (1)	_Nov	_Dec
2006	_Jan	_Feb	_Mar	_Apr	_May (1)	_Jun (1)	_Jul	_Aug (1)	_Sep	_Oct (2)	_Nov	_Dec
2007	_Jan	_Feb (1)	_Mar	_Apr	_May (8)	_Jun	_Jul (3)	_Aug (8)	_Sep	_Oct	_Nov (1)	_Dec
2008	_Jan (3)	_Feb	_Mar	_Apr (2)	_May	_Jun	_Jul (2)	_Aug	_Sep (3)	_Oct	_Nov (3)	_Dec
2011	_Jan	_Feb	_Mar	_Apr	_May	_Jun (1)	_Jul	_Aug	_Sep	_Oct	_Nov	_Dec
2012	_Jan	_Feb	_Mar	_Apr	_May	_Jun	_Jul	_Aug	_Sep	_Oct	_Nov	_Dec (2)
2013	_Jan	_Feb	_Mar	_Apr (1)	_May	_Jun	_Jul	_Aug	_Sep	_Oct	_Nov	_Dec

Flat | Threaded

1 2 3 .. 5 > >> (Page 1 of 5)

[Crm114-discuss] How to make libcrm114 work with ideographic languages

From: Vikram K. <vik...@gm...> - 2013年04月18日 10:24:53

Hi,
I wanted to know whether The C-callable library, LIBCRM114 would work
for ideographic languages
like Chinese, Korean or Japanese ?
As these languages do not have word boundaries so how tokenization stuff
would work ?
Is there any way around like converting the ISO-2022 encoding into UTF-8
then training and classifying ?
Or is there any other solution ?
Please provide feedback.
-Viks

Re: [Crm114-discuss] [Crm114-general] Classifying general email

From: Lars B. <la...@da...> - 2012年12月12日 15:38:23

Hello Bill,
Thanks for you reply.
I'm not a programmer, but I can find my way around Linux and do basic bash scripting.
I looked at both mailfilter.crm and mailreaver.crm and was, with my current knowledge of the crm language, 
a bit overwhelmed at the prospect of modifying anyone of those to my needs.
So I would much prefer any command-line scripts that I could modify to test this out.
Best
Lars Sorensen
On Dec 12, 2012, at 3:33 PM, ws...@me... wrote:
> 
> Yes, CRM114 can do multi-class sorting; one of the test cases actually
> does that (four classes, I believe).
> 
> Now, a question: do you want to do this from command-line, or are you a
> C programmer? The reason I ask is that we have two user-compatible but
> NOT binary-compatible CRM114's now.
> 
> - There's the command-line version, which has it's own language;
> 
> - There's the C-callable library (written in ANSI C)- you call it from
> a program you write. (yes, there's example code, including, if I 
> recall correctly - four-class examples)
> 
> Which would you prefer?
> 
> - Bill

[Crm114-discuss] Classifying general email

From: Lars B. <la...@da...> - 2012年12月12日 14:09:09

Hello,
I have an email account that receives a fairly high volume (500-800) of daily emails (500-800), and would like to categorize/classify these emails automatically into about 100 categories/folders.
The last two months I have been trying out POPFile with some limited success. (http://getpopfile.org/)
After I have been inspecting keywords and decision trees in POPFile, it would seem to me like a classifier using phrases for classification might classify this type of emails better than the Naive Bayes implementation in POPFile.
As I'm not a programmer, but trying to learn, I have been searching for preexisting tools that might work for what I want to achieve.
Searching the web I can see that leaves me with the two options: CRM114 or OSBF-lua as classifiers and as I understand CRM114 now uses the OSBF classifier as the default!
Are there any implementations/scripts out there that will allow multiple classes for general email sorting using CRM114 or OSBF-lua as the classification engine?
It seems from what I read that this should be possible, but I'm unable to find any practical implementations to test with.
As I understand both mailfilter.crm and mailreaver.crm use only 3 classifications: 1.spam 2.nonspam 3.unsure, so these would not be useful for me in this regard I presume.
I could use some advice in how to go about this the right way.
Are there any scripts or tools out there that will do general email classification with CRM114 or OSBF-lua that could be implemented with maildrop or procmail on a Linux OS?
Any ideas or pointers would be greatly appreciated.
Best
Lars Sorensen

[Crm114-discuss] crm114 to filter bounce emails

From: Matthieu <m...@tt...> - 2011年06月30日 11:08:23

Hi,
I'm trying to use crm114 on our mail server to filters bounced messages 
into categories :
user_unknown
host_not_found
relay_denied
mailbox_full
mailbox_blocked
detected_as_spam
on_vacation
message_too_large
not_a_bounce
unknown
I'm using the learn and classify commands from this script : 
https://github.com/samdeane/code-snippets/blob/master/python/crm.py :
categorization : "<osb unique microgroom>"
learn :" '-{ learn %s( %s) }'"
classify : " '-{ isolate (:stats:); classify %s( %s) (:stats:); match 
[:stats:] (:: :best: :prob:) /Best match to file .. 
\(%s\/([[:graph:]]+)\\%s\) prob: ([0-9.]+)/; output 
/:*:best:\\t:*:prob:/ }'"
My question is which categorization method would you suggest to achieve 
this kind of filtering ?
thanks,
Matthieu

Re: [Crm114-discuss] CV / hiring filter question

From: Alejandro F. J. <ar...@gm...> - 2008年11月23日 16:06:59

The only spot where he seems to be aware of incoming news/messages
is his facebook and someone tried to reach him (Simon Vans-Colina)
there. No lights there, tho.
The point is that, if this guy work was ever an opensource project,
i was wondering if anyone had a piece of it, or any other implementation
of CRM114 regarding CVs classification for recruiting.
Thanks again!
Alejandro
El jue, 20-11-2008 a las 10:39 +0100, Gerrit E.G. Hobbelt escribió:
> 
> I see he's on LinkedIn; did you try to reach him there?

Re: [Crm114-discuss] CV / hiring filter question

From: Gerrit E.G. H. <ge...@ho...> - 2008年11月20日 09:54:52

Sorry, can't help you out.
I see he's on LinkedIn; did you try to reach him there?
Take care,
Ger
Alejandro Fernandez Japkin wrote:
> Hello everyone,
>
> I'm in the middle of a hurry that includes implementing CRM114
> as a CV -resume- classifier for hiring purposes. Is of my understanding
> that someone named "Simon Vans-Colina" was involved in some tool
> on this subject, but the few links available over the net are just dead.
> Is there *anyone with *any information about this? I'd really appreciate
> a straight answer, since i'm running out of time and i want this
> monkey off my back. Writing from scratch is not an option at the point i
> am.
>
> Thanks really -a lot
>
>
> Alejandro
>
>
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
> Build the coolest Linux based applications with Moblin SDK & win great prizes
> Grand prize is a trip for two to an Open Source event anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> Crm114-discuss mailing list
> Crm...@li...
> https://lists.sourceforge.net/lists/listinfo/crm114-discuss
>
> 
-- 
Met vriendelijke groeten / Best regards,
Ger Hobbelt
--------------------------------------------------
web: http://www.hobbelt.com/
 http://www.hebbut.net/
mail: ge...@ho...
mobile: +31-6-11 120 978
--------------------------------------------------

[Crm114-discuss] CV / hiring filter question

From: Alejandro F. J. <ar...@gm...> - 2008年11月18日 22:50:36

Hello everyone,
I'm in the middle of a hurry that includes implementing CRM114
as a CV -resume- classifier for hiring purposes. Is of my understanding
that someone named "Simon Vans-Colina" was involved in some tool
on this subject, but the few links available over the net are just dead.
Is there *anyone with *any information about this? I'd really appreciate
a straight answer, since i'm running out of time and i want this
monkey off my back. Writing from scratch is not an option at the point i
am.
Thanks really -a lot
Alejandro

[Crm114-discuss] The new LEARN syntax

From: Ger H. <ge...@ho...> - 2008年09月03日 15:08:54

On Wed, Sep 3, 2008 at 3:09 PM, Bill Yerazunis <ws...@me...> wrote:
> More like:
>
> LEARN ( c1.stat c2.stat | c5.stat ... c127.stat) < osbf unique> [my.txt]
>
> which means "train my.txt in as a positive example in statistics files C1
> and C2, and as a negative example in files C5 through C127". If a
> file is not found, initialize it as "osbf unique", otherwise use
> the self identification in the file to choose the correct learning
> method.
Whoa. I am probably OD'ing on Microsoft Excel right now so my 'grok'
is down to zero, but can you please run that "self identification" bit
by me again?
Or is that something along the lines of 'open file, read header, check
classifier id+config in there, *then* jump to classifier? (Which can
be done, if you provide the 'csscreate' script opcode or some such
(which is only a stupid stub in GerH now, btw) which is then to be
used to 'create/set-up' any new CSS file. (mailreaver's 'learn zilch'
trick to create css on the fly has to be replaced then with such a
csscreate opcode.)
Am I thinking too 'classical/procedural' here regarding learn? Anyway,
from what I read in your text is that you're going for something like
this:
assume message M which will be classified, then [unidentified
intelligent code] will train message M as 'spam' or 'ham' --> code
assuming auto-ID'ing classifier as described above so no attributes
needed:
classify (S|H) [M]
...
learn (S|H) [M] --> learn as spam (left side is 'S'pam CSS files,
right is 'H'am CSS)
...
learn (H|S) [M] --> learn as ham (because now 'H'am is at left)
which means you rotate the S/H CSS file[s] [collections] around that |
pipe symbol there.
That would be identical - I think - to Paolo's
learn (S|H) <1> [M] --> learn '1st' side == left side == spam
...
learn (S|H) <2> [M] --> learn '2nd' side == ham
Now for multiclass A|B|C|D|... it would probably work the same, you
just rotate the proper class E {A,B,C,D,...} (E == element of, no
math symbols in email) to the front while Paolo's would send along the
proper 'index' value as an attribute or some such.
If it's like that, I'd rather have the 'indexed' variant instead of
the 'rotated around | pipe' style because it would take one isolated
var only to schlepp that bunch around and it saves on possible if/else
conditionals as well, because I might be able to blunty derive index i
E {<1>, <2>, ..} from a previously determined pR using a bit of :@:
math, but that's just me. The 'rotating' style is auto-backwards
compatible (while keeping 'details' like <refute> outside that
equation for now) when you have 'optional pipe' instead of 'required
pipe' (and provided "you know what you are doing" caveat applies to
script writer).
Meanwhile, SVM still has 2 pipes and 3 files where anybody else uses
A|B (1 pipe, 2 files) for same, so there's still a bit of
'irregularity' there to my mind, but then I probably should stick to
looking at lotsa numbers in rectangles instead of attempting brain
activity today.
-- 
Met vriendelijke groeten / Best regards,
Ger Hobbelt
--------------------------------------------------
web: http://www.hobbelt.com/
 http://www.hebbut.net/
mail: ge...@ho...
mobile: +31-6-11 120 978
--------------------------------------------------

Re: [Crm114-discuss] unifying LEARN/CLASSIFY invocation (was: Re: [Crm114-general] Mixed 64-bit system GerH binaries / BillYscripts)

From: Ger H. <ge...@ho...> - 2008年09月03日 10:28:19

On Wed, Sep 3, 2008 at 9:27 AM, Paolo <oo...@us...> wrote:
> On Wed, Sep 03, 2008 at 05:03:35AM +0200, Ger Hobbelt wrote:
> ...
>> Given 60 classes (= CSS files), Paolo can have his KISS and I can eat
>> my pie too. Simple.
>
> let me stress once again that I question the _requirement_.
[...]
> agreed, provided that
>
> ! learn (:*:s:) [message] <i flags>
>
> where :s: is just a subset (1 as limit, so that 'i' can be dropped) of the
> N classes in use, is (remains) legal (where allowed).
Yes, that should be possible in my line of thought. Assuming "you know
what you're doing" i.e. are aware of classifier internals, you can do
this in the new 'learn'. Take existing OSB for example (*forget* my
'delta' stuff for a sec there), which touches only a single CSS file
on learn, then
learn (A|B) <1>
is identical to
learn (A) <1>
is identical to
learn (A)
is identical to
learn (A|B)
because <1> is a possible 'default' -- though that might be a
disputable thing - I'd rather see an error report, because learn (A|B)
isn't 'obviously' going to teach the way of A.
The thing I'm really after is that at script level
learn (A|B|...) <i>
is supported for _all_ classifiers. When you're doing smart stuff
script-wise where you like to code
learn (A)
while you classify code is
classify (A|B|C|D|E|F|..)
fine. The bit of 'cut at pipe, pick the ones you want' code I envision
can handle it, so you've got options script-code-wise.
In other words: a 'set' of one, is still a set in my book. That you as
a script writer might want to take that thought to the edge (set of 1)
is fine with me. I always appreciate that kind of craftiness. It's
just that the starting point shifts for people new to this: keep the
set around and apply to both classify and learn equally. When you are
ready to read the fine print in the manual, you can decide to use 'set
of 1' as a valid 'fringe case' (fringe from script-language structural
point of view).
What I *need* is learn (A|B) support for classifiers that don't have
it yet (OSB and friends) and currently there's no possibility for
coding
learn (A|B) <i osb>
so I am prevented from testing my ideas for the classifier itself.
>> what they want/need, you get the chance to apply filters & processes
>> in learn that are simply impossible right now PLUS you don't have to
>
> that's C level, SVM wants 3 because it uses 3 in both cases.
Aicks! You _got_ me there. Forgot the 3rd one in SVM. DANG! Still a
remaining 'oddity' hence. :-((
No good answer there expect mumbling about the implicit 'variable
size' of a 'set' as I approach it.
> ! learn (a b a_v_b) <svm flags> # wants all 3
> ! classify (a b a_v_b) <svm flags> (s_svm) # wants all 3
> ! classify (a b) <xxx flags> (s_xxx) # can't use the extra a_v_b
>
>> So no unified ... mess; I'd say it's unified ... structure / design.
>
> maybe, but that's not as simple as saying :
> define: N classes
> hence: LEARN(1 2 ... N)
> CLASSIFY(1 2 ... N)
> which might turn into a mess, or better shift the mess from one place to
> another.
Sure it's a shift: out of the [script] language, so it's 'black
boxing' learn as it is classify, and into the [C] code.
I think for general use it's less mess because you need to 'remember'
less about the script language and the 'learn' interface, because
apart from the extra index (in a sense you're _feeding_ it the pR
which would pop out of classify as a result) it's exactly like
classify. I really like language layout where general use requires the
least number of 'rules' and 'details' to be remembered: it makes for a
simpler language overall which is good for me as I work with multiple
languages and a limited brain. ;-)
(This learn/classify stuff is - in a way - comparable to old
discussions about 'coding standards' and such for Pascal or C, where
there's a class of folks that say: "you can skip the braces/begin-end
and the semicolons so you should" while I am clearly with the folks
that say: "don't matter what you do, always apply the same structure:
braces/begin-end and semicolons and stuff, unless it is _prohibited_
by the language". Right now 'learn (A)' is prohibiting me from using
'learn (A|B)'. I think that bit didn't make it through last night.)
>> Cost for Trever @ 60 classes? nil.
>
> wasn't thinking of run time cost, but script readability.
Same here. But Trever was starting to worry, it seemed to me,
performance would drop, if ever so slightly, if we'd be introducing
this. And in case others were going to think it mattered.
> yes, though once N classes get mmaped for a CLASSIFY a single class LEARN
> can check for it and won't mmap() again, and mmsync() can be deferred
> iff other processes that use same class(es) do that via shared mem.
Yep. When you construct your scripts to handle classification and
subsequent learning in the same crm114 instance, you get that
advantage today.
A (very limited) 'server'-y approach doable right now is writing a
script which loops, waiting for messages available on disc or stdin,
and keep on processing them one after the other in the same instance:
you have the 'CSS stays in mem' benefit then as well (note: ignoring
how to code for cutting up stdin into messages and/or poll/wait for
disc-based messages here - that's another subject)
>> Want some real, achievable gain? convert crm114 to play 'server', i.e.
>> permanently loaded and CSS files (close to) permanently mapped in
>
> yes yes yes yes - the endless daemon saga :)
[...]
> yeah, maybe the ability to run pre-compiled scripts can be good idea
> for a number of applications.
You mean a kind of .java p-coded crm114 scripts, i.e. a real crm114
*compiler* (.crm --> .114 binary file) and, er, accompanying 'virtual
machine'? Oh boy, the table rises here. ;-P But that's just the geek
in me getting all exited. It's not on my list of 'things worth doing @
mid/short-term' though, but fun anyway. A crude/cheap way might be an
option to 'dump' and 'load' tokenized script as it leaves the crm114
tokenizer going to the execution unit. Tokenize once, run multiple
times.
It's not worth it for me (I ran tiny scripts) but all the folks out
there enjoying mailreaver and friends might get some good delight out
of that as mailreaver/mailtrainer are _significant_ sized scripts.
>> seriously considering hacking crm114 into becoming mod_crm114, i.e. an
>> Apache2 plugin: you get the server, the socket I/O and the
>
> like Apache's Lucene and derivatives.
Sorta. Yes.
>> live in there like a wicked PHP-alike server-side scripting language
>> and you will definitely achieve instant notoriety. ;-)
>
> and support headache ;)
I like my native Americans ;-))
Granted, moving from 1.3 to 2.0/2.2 wasn't easy for a mod_xxx, but
still I like it way more than 'roll your own [TCP-based] server'
again: linking it to Apache (and no, despite the fact that I do
Win32/64, I don't think I'll be the go-to guy if you want IIS plugin
support: IIS6 is nice, in a way, and has good performance, but I run
Apache on Windows for free projects and only do IIS for paying
customers. Got to draw the line _somewhere_. If they open-source IIS,
I'll reconsider that statement.)
Anyhow, the crm114 scripts would still be there as they are right now;
I would just take the std I/O and bend it so stdin = request and
stdout (and stderr?) == response. Maybe add a touch of XML if you want
to have a freeze-dried instant low-cal 'web service' (which is hot
stuff these days, but rather old wine in fashionable new Walmart bags
if you ask me, but then folks don't seem to study IT history anymore)
Why Apache really? Because I can then 'lean on' the stick provided by
them when I need to scale up: distributed servers, pardon, *services*,
and the whole bloody lot are documented already. Besides, my purposes
lead me towards a production environment as a 'web backend' anyhow, so
why not bolt it to the web server itself? Yup, doing so requires some
understanding of the Apache API interfacing and that's raising the
tech level by +1, but at least you can be spared some significant
intricacies regarding TCP/server performance tactics at server level.
It's fun to write it, but in this case, my feeling was it's faster to
go for mod_crm114 in dev time. And yes: that's 'faster' regarding a
_production quality_ mod_crm114 compared to _production quality_
crm114d (note the 'd').
(For free as well: SSL secured communications with the crm114
'service' - which might be something to cheer the 'remote services'
folks up quite a bit.)
Anyway, I don't 'do' the alpha release of mod_crm114 in one week, nor
can I deliver alpha stage crm114d in the same timeframe, so it'll
probably stay a great idea over whiskey on Friday as I don't see Bill
getting his hands on a particular red phone booth with free access
either. ;-)
> see above: CAN but definitely should not be a MUST.
Does my approach of 'set' as described at start of this email match
your CAN, or does it still sound like MUST to you?
>> ONE: strict adherence to 'backwards compatibility' at CRM114 script
>
> just one good reason.
well.... ;-)
-- 
Met vriendelijke groeten / Best regards,
Ger Hobbelt
--------------------------------------------------
web: http://www.hobbelt.com/
 http://www.hebbut.net/
mail: ge...@ho...
mobile: +31-6-11 120 978
--------------------------------------------------

[Crm114-discuss] unifying LEARN/CLASSIFY invocation (was: Re: [Crm114-general] Mixed 64-bit system GerH binaries / BillYscripts)

From: Paolo <oo...@us...> - 2008年09月03日 07:27:40

On Wed, Sep 03, 2008 at 05:03:35AM +0200, Ger Hobbelt wrote:
...
> Given 60 classes (= CSS files), Paolo can have his KISS and I can eat
> my pie too. Simple.
let me stress once again that I question the _requirement_. 
 
> The set passed to classify is a set and should be passed to learn as
right, if we had vector/array struct it'd be 'natural' ...
> isolate (:c:) /class1 | class2 | and so on .../
...
> classify (:*:c:) [message]
which is a fake vector, works on strict assumptions on how to name 
var/classes.
Like in other situations, having true array data structure would be quite
useful.
> learn (:*:c:) (index) [message]
> 
> Both look good to _me_. ;-)
agreed, provided that
! learn (:*:s:) [message] <i flags>
where :s: is just a subset (1 as limit, so that 'i' can be dropped) of the 
N classes in use, is (remains) legal (where allowed). 
 
> Because you always pass along the whole set at script level, the
> classifier code (both learn and classify implementation) gets to pick
there's no need for that, where's the binding between script level and 
classifiers implementation? eg I can define N classes, but use any subset
for both LEARN / CLASSIFY at any point to my taste/needs, with the limit
of the actual classifier's requirement:
!# use classes: one two three four five six seven
! learn (one two three four five six seven) <i flags> [msg_x]
! learn (three four five) <i flags> [msg_y]
! learn (one) <flags> [msg1]
! ...
! classify (one two three four five six seven) <flags>
! classify (five six seven) <flags>
! classify (three six seven) <flags>
! classify (one three four six seven) <flags>
! classify (six) <flags> (cm)	# class membership -> cm, unsupported atm
...
> what they want/need, you get the chance to apply filters & processes
> in learn that are simply impossible right now PLUS you don't have to
that's C level, SVM wants 3 because it uses 3 in both cases.
> worry anymore either which classifier you're gonna use because today
> all the bloody buggers require their own particular incantation when
> it comes to number of css files (classes) passed to learn.
there are categories of classifiers that have same requirements wrt
#classes and params. Now suppose the actual classes are compatible, but
one classifier needs 1+ extras (eg SVM) and I want to compare classifiers,
then it'd be nice to do (SVM case, forget 4now actual class compatibility):
! learn (a b a_v_b) <svm flags>		# wants all 3
! classify (a b a_v_b) <svm flags> (s_svm)	# wants all 3
! classify (a b) <xxx flags> (s_xxx)	# can't use the extra a_v_b
> So no unified ... mess; I'd say it's unified ... structure / design.
maybe, but that's not as simple as saying :
define:	N classes
hence:	LEARN(1 2 ... N)
	CLASSIFY(1 2 ... N)
which might turn into a mess, or better shift the mess from one place to
another.
> Cost for Trever @ 60 classes? nil.
wasn't thinking of run time cost, but script readability.
> You save far more time when you find a way to reduce disc I/O cache
> misses on your memory-mapped CSS files, even when you achieve such a
> feat for learn alone (which would be rather weird and besides, unless
> you 'Train Everything', optimizing classify is the winner). I have a
yes, though once N classes get mmaped for a CLASSIFY a single class LEARN
can check for it and won't mmap() again, and mmsync() can be deferred
iff other processes that use same class(es) do that via shared mem.
> Want some real, achievable gain? convert crm114 to play 'server', i.e.
> permanently loaded and CSS files (close to) permanently mapped in
yes yes yes yes - the endless daemon saga :)
> invocation of crm114 and the moment the script *tokenizer* kicks in.
> You're not even *executing* script yet by then! The rest (8%) is
> spread across tokenizing ('compiling the [small!] script'), tokenized
> script code execution, wrap-up and unidentified fluff elsewhere.
> Believe me, if I'd see an easy way to kick that bugger into higher
> gear, you'd already have it.
yeah, maybe the ability to run pre-compiled scripts can be good idea 
for a number of applications.
> seriously considering hacking crm114 into becoming mod_crm114, i.e. an
> Apache2 plugin: you get the server, the socket I/O and the
like Apache's Lucene and derivatives.
> live in there like a wicked PHP-alike server-side scripting language
> and you will definitely achieve instant notoriety. ;-)
and support headache ;)
> Anyhow, I don't see any good reason why the learn (classes) argument
> cannot be identical to the related classify (classes) argument, except
see above: CAN but definitely should not be a MUST.
> ONE: strict adherence to 'backwards compatibility' at CRM114 script
just one good reason.
-- 
paolo

Re: [Crm114-discuss] Calculating Disk Space Requirements

From: Ger H. <ge...@ho...> - 2008年07月22日 06:00:52

On Tue, Jul 22, 2008 at 1:21 AM, Chris Babcock <cba...@as...> wrote:
> The answer to these questions might very well be, "test and measure".
> If that's the case, I appreciate pointer to whatever help is available for the methodology since CRM114 and working with text classification in general are new to me. (I hate Perl, but I'm pretty handy with sed... Go figure.)
>
> How do you calculate the hardware requirements, especially the size of
> CSS files needed, with a CRM114 program?
Okay, let's try my hand at this on the quick.
First of all, there's the classifier you pick. Different classifier,
different behaviour, different size requirements. A few of them
require unlimited space (CSS files grow a little every time), most of
them have fixed space requirements.
I'm a bottom-up guy most of the time, so let's start bottom up for
some goodness.
All 'production quality' classifiers (that's OSB and friends) are
based on a fixed-sized hash table. Since the method chosen to store
stuff in that hash table is the 'linear probe' algorithm, you should
never ever try to get beyond the 50% fill rate point as then the hash
table performance is QUICKLY deteriorating (quite a few papers on
that; 50% isn't 'hard' but a 'rule of thumb' number).
Since the hash table is filled with hash elements and we assume a
reasonable quality hash here (flat distribution in N dimensions and
bla bla bla), best case is a flat fill. To satisfy the mechanical
engineer in me who's learned there's no such thing as a 'best case' in
daily practice unless you get your lifetime's joss delivered in a
single day, we add a fudge factor and after sucking on my thumb (mjam)
I'd say a fill of about 20-30% would be swell.
Performance? Can be assumed to be about near flat (O(1)); don't have
'live numbers' on this one, but my guess is bigger CSS files will be
slower due to more chances at disc and CPU cache misses while poking
at the hash table entries. Fast disc I/O helps there. RAID5, maybe
RAID6 or other dandiness...
Now one element is one hash plus a number, clocking in at 4 bytes
each, so at N elements, that's N*8 bytes disc space per 'feature'.
Given a 32-bit box and a safety margin of 2 for signed versus unsigned
queerness -- NOT to be mistaken with the use of signed versus unsigned
int discussed elsewhere ;-) -- that's a max size of 2GB / 8 =
2**(31-3) = 2**28 elements, then use 25% fill because it's a nice
number (1/2**2) that means we can store 2**26 elements max on 32-bit
'without noticeable loss of performance'. (Thanks to the 2G instead of
4G edge I also have a fighting chance at getting this to actually
/work/ on such a box as CRM114's using memory mapping and we can't eat
all for just the CSS file. No space left for the binary and misc data
there.)
Ah! But to classify you need two CSS files at least! And given our
memory mapping is done all at the same time, I'll dial down that
max(N) number to 2**25.
Because you can smell the napalm from here when looking at your
numbers, let's quickly see what 64-bit has on offer in 'best case';
and that would include additional money for harddisc technology
researchers an' all: 2**(63-3-2-1) ==> max(N) storage capacity at
2**57 which would mean you're good to go, topping out somewhere beyond
0.125 ExaFeatures (where one Feature is one CRM114 token a.k.a.
'hash'). If you get that kinda space, could I maybe charge you please
for a measly commission fee in the form of .00001% of your disc space,
yes? MY problems are solved then. ;-)
So far the 'practical' limits.
Now from your side of the fence:
Taking 17*5*(175!/158!) on faith (this is my morning coffee, and it's
gotta be fast, so I DO believe) at 6K sized docs? Hmmmm.... Let's just
assume one doc is one(1) Feature (it probably isn't but what the heck,
my backbone already feels where this is going; trying to beat Big Blue
at it, are we, eh? :-)) ) that would mean, say, n!/m! =~= 100**(n-m)
for m >= 100 here (and that's a BIG lie! but a really sweet one.) ==>
we're going to be hit by a feed of over 17*5*(100**(175-158)) =? and
since we're ballparking here like there's no tomorrow, that'd be
somewhere beyond 100**(175+1-158)=100**18 == 10**36 which is somewhere
over the rainbow and beyond an Exa SQUARED.
Like the backbone already knew: ...OOPS?!
Not to be the bee in your bonnet - I like the idea! :-)) - but (a) all
them Features are never ever gonna fit, even when you get unlimited
sponsoring by Hitachi and IBM, heck, you /buy/ them, and (b) assuming
for now that (pre-)calculating/learning/whatever one such item takes
about a single modern day CPU clock tick, i.e. ~ 1.0**-9 seconds,
which is rather optimistic and out the /other/ side, you'd /still/ be
at it when the Four Horsemen are having a snack on our offspring.
Of course, we can make the bugger 'learn on the job' (don't we all?)
and then it turns into the question of 'lifetime': how much do you
want it to learn and how good should the bugger be at playing
Diplomacy... in the end? Because there's surely to be found 'pathways'
in that data a.k.a. 'successful strategies'. Guestimating what
learning _those_ will cost is _way_ beyond the morning coffee, though.
Sooooo... getting that Diplomacy-playing Big Blue going somewhere
during /this/ lifetime, brute force ain't gonna cut it. Assuming the
above was kicking in an open stabledoor (but fun!), the plus benefit
of it all is that we have one practical usable result here:
if you know how many different words ('features') you want this Bayes
box to 'remember', you can take that number (N), multiply it by 4*8=32
for a 25% filled OSB[F].Markov/... classifier and your
advised/preferred CSS file size would be N*32 bytes. To be eligible
for one(1) yes/no style classification question ("is it or isn't
it?"), that takes two(2) CSS files, so total disc /cost/ would be
about N*64 bytes, excluding a negligible bit of header icing on the
cake.
Of course, that doesn't say nothing at what a 'feature' would BE in
your case; in email it's generally one word, but that's also an 'it
depends...' so there's lots of puzzling to do before you hit the Bayes
box. Big question before plugging it in: exactly _what_ are we going
to feed the animal? (See also a blurb about stocks analysis a few
months ago in this ML. Simply plonking in raw data ain't gonna cut it.
Same here.)
On another note - before I run: if you want 'win/loose/draw'
three-ways or other 'multiway' decisions, it is theoretically (and
practically) possible with CRM114; for every extra choice you have to
add one(1) more CSS file (and a | pipe symbol in your script).
Multiway weighting is 'supported' in the code but I haven't heard
about anybody actively using it since I first popped by in autumn 2007
so software-wise YMMV, Caveat Emptor, pick your classifier wisely and
all that and here's a rabbit foot as well. Cause you're gonna _need_
it.
Having unsettled you sufficiently, I'm exit left outa here. The
laboring masses and all that. Still, I love your idea. Tip if you want
to pursue this: check out what the chess boys have been doing. Same
problem; smaller scale (cough).
>
> One of my long range projects is to write an AI for the boardgame
> Diplomacy using CRM114. The approach is to archive combinations of turn
> results and moves sorted by how favorable the outcome was. The program
> builds movement sets by parsing game results to determine the
> disposition of its units then consults a movement matrix to generate
> all possible order sets. Each movement set and result combination is
> submitted to the classifier to determine how closely it "resembles"
> winning combinations from games in its training.
>
> What do I need to know in order to estimate the necessary size of the
> CSS file? There's ~175 unit dispositions and an average of 5 possible
> destinations for each unit. A well trained classifier which has
> not eliminated any trivial cases will have no more than 17*5*(175!/158!)
> documents of ~6 KB each - consisting of an order set and the results that produced it - in the corpus for each of the 7 game powers and 5 game phases. I need to be able to determine the optimal classification granularity (number of categories to sort to). Logically, I think that I would get the best speed and accuracy sorting to "win" and "not a win", but that only holds true as long as the classification file is within program limits without microgrooming. Dividing the classification files into more
> outcomes - "win", "draw (by size of draw)" and "elimination" - doesn't reduce the size of the "win" CSS file, nor does any other outcome-based
> refinements in classification. If I divide the classification according
> to a metric of game progress then I can effectively reduce the size of
> the CSS files at the expense of calculating that metric each turn. Are
> there any guidelines for determining how the size of the CSS files
> affects classification speeds?
>
> Chris
>
>
>
>
>
>
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
> Build the coolest Linux based applications with Moblin SDK & win great prizes
> Grand prize is a trip for two to an Open Source event anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> Crm114-discuss mailing list
> Crm...@li...
> https://lists.sourceforge.net/lists/listinfo/crm114-discuss
>
>
-- 
Met vriendelijke groeten / Best regards,
Ger Hobbelt
--------------------------------------------------
web: http://www.hobbelt.com/
 http://www.hebbut.net/
mail: ge...@ho...
mobile: +31-6-11 120 978
--------------------------------------------------

[Crm114-discuss] Calculating Disk Space Requirements

From: Chris B. <cba...@as...> - 2008年07月21日 23:21:30

The answer to these questions might very well be, "test and measure".
If that's the case, I appreciate pointer to whatever help is available for the methodology since CRM114 and working with text classification in general are new to me. (I hate Perl, but I'm pretty handy with sed... Go figure.)
How do you calculate the hardware requirements, especially the size of
CSS files needed, with a CRM114 program?
One of my long range projects is to write an AI for the boardgame
Diplomacy using CRM114. The approach is to archive combinations of turn
results and moves sorted by how favorable the outcome was. The program
builds movement sets by parsing game results to determine the
disposition of its units then consults a movement matrix to generate
all possible order sets. Each movement set and result combination is
submitted to the classifier to determine how closely it "resembles"
winning combinations from games in its training.
What do I need to know in order to estimate the necessary size of the
CSS file? There's ~175 unit dispositions and an average of 5 possible
destinations for each unit. A well trained classifier which has
not eliminated any trivial cases will have no more than 17*5*(175!/158!)
documents of ~6 KB each - consisting of an order set and the results that produced it - in the corpus for each of the 7 game powers and 5 game phases. I need to be able to determine the optimal classification granularity (number of categories to sort to). Logically, I think that I would get the best speed and accuracy sorting to "win" and "not a win", but that only holds true as long as the classification file is within program limits without microgrooming. Dividing the classification files into more
outcomes - "win", "draw (by size of draw)" and "elimination" - doesn't reduce the size of the "win" CSS file, nor does any other outcome-based
refinements in classification. If I divide the classification according
to a metric of game progress then I can effectively reduce the size of
the CSS files at the expense of calculating that metric each turn. Are
there any guidelines for determining how the size of the CSS files
affects classification speeds?
Chris

Re: [Crm114-discuss] crm114 training question

From: Ger H. <ge...@ho...> - 2008年04月19日 00:35:27

Hi,
I assume that the numbers you reported are for the testset which was
NOT trained as the numbers are lower than 70% of 266.
Anyway, I would not worry too much about your numbers in relation to
crm114 performance. Nothing which makes my eyebrows go up. I'm rather
surprised crm114 got this far on its own, really.
The problem lies elsewhere as it looks like you are running into the
same /fundamental/ issue as I did when I decided to use crm114 for my
signal analysis.
The basic two questions you should answer for yourself first and foremost are:
1a- how do Bayesian and other statistical filters like crm114 work
EXACTLY? (I refer to recent discussion in the crm114-developer mailing
list (Bill/Paolo/Ger) where crm114 innards are explained and discussed
using the analogy of a sandbox, green and red balls and a gold ball.
It's way too much to reproduce here, but read up on that and make sure
you understand what's going on. Research the algorithms used by
crm114, before you continue. Key element to understand is how crm114
compares data elements to arrive at similarity figures. Which leads to
question
1b- ask yourself where in your data is the 'equality' / identity in
elements in the evaluated inputs, which is a low level engineering
question derived from the second major question:
2- what are the metrics I want crm114 to compare to help me arrive at
the answers which I seek? And which answer am I looking for, really?
NOTE: express answers in both functional goals (for yourself) and
technical implementation terms, because you are designing the
automation of a 'human' system here, so you must be able to instruct
the computer what to do /exactly/ what you want it to do to emulate
the human process you try to model.
Tip of the week: This implies, technically speaking, that you /may/
find you need to preprocess your data.
I give this rather generic answer, because I believe it will help you
far more in understanding the core of what you are doing than when I
focus on a little detail (symptom) in your email and maybe up your
successrate right now. Understanding what is going on in there is
mandatory for anyone wishing to use statistical filters in a domain
where they have not been 'preconfigured' by other researchers for you.
> another. Is my understanding correct? Also, I found each time crm114 is made
> to learn the same thing, it produces different classification result on
> testing case. Is there a correct behavior?
A few bits of info are lacking to answer this, but when there's no
randomness involved in any way, the process should be completely
reproducible, i.e. provide you with the same results after every
complete re-run. Some learning methods (when you use mailtrainer for
instance) /may/ employ randomizer learn ordering, which will jolt
results for test sets; more so for small test sets like yours.
Of course, further questions and results are welcomed.
Best regards,
Ger Hobbelt
On Fri, Apr 18, 2008 at 9:48 PM, Weide Zhang <wz...@gm...> wrote:
>
>
> Hi, I am using crm114 to do text mining on stock annual report 10K to make
> prediction on their performances. The sample has 266 rows, each containing 1
> column indicating their annual report segment, and the other indicating
> whether or not they perform better in that year compared to the industry
> average. I use 70% of the data(data before 2006) as training and I tried
> different training method.
>
> Below are the correct number for each category('good' and 'bad' meaning
> perform better or worse). I use the python wrapper found on the crm114
> wiki. The accuracy is quite low and I notice that for osbf, there are no bad
> case that are classified correctly and for markov, only 2 good cases are
> classified correctly. It seems that the algorithms is biased one over
> another. Is my understanding correct? Also, I found each time crm114 is made
> to learn the same thing, it produces different classification result on
> testing case. Is there a correct behavior?
>
>
> good bad
> entropy corr 14 6
> total 18 19
>
> markov corr 2 14
> total 18 19
>
> osb corr 10 4
> total 18 19
>
> osbf corr 18 0
> total 18 19
>
> Thanks for your answer,
>
> Weide
> -------------------------------------------------------------------------
> This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
> Don't miss this year's exciting event. There's still time to save 100ドル.
> Use priority code J8TL2D2.
>
> http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
> _______________________________________________
> Crm114-discuss mailing list
> Crm...@li...
> https://lists.sourceforge.net/lists/listinfo/crm114-discuss
>
>
-- 
Met vriendelijke groeten / Best regards,
Ger Hobbelt
--------------------------------------------------
web: http://www.hobbelt.com/
 http://www.hebbut.net/
mail: ge...@ho...
mobile: +31-6-11 120 978
--------------------------------------------------

[Crm114-discuss] crm114 training question

From: Weide Z. <wz...@gm...> - 2008年04月18日 19:48:34

Hi, I am using crm114 to do text mining on stock annual report 10K to make prediction on their performances. The sample has 266 rows, each containing 1 column indicating their annual report segment, and the other indicating whether or not they perform better in that year compared to the industry average. I use 70% of the data(data before 2006) as training and I tried different training method. 
Below are the correct number for each category('good' and 'bad' meaning perform better or worse). I use the python wrapper found on the crm114 wiki. The accuracy is quite low and I notice that for osbf, there are no bad case that are classified correctly and for markov, only 2 good cases are classified correctly. It seems that the algorithms is biased one over another. Is my understanding correct? Also, I found each time crm114 is made to learn the same thing, it produces different classification result on testing case. Is there a correct behavior? 
 good bad 
 entropy corr 14 6 
 total 18 19 
 
 markov corr 2 14 
 total 18 19 
 
 osb corr 10 4 
 total 18 19 
 
 osbf corr 18 0 
 total 18 19 
Thanks for your answer,
Weide

Re: [Crm114-discuss] Question about the weighting formula in theplateau paper

From: Tobias S. <tob...@fr...> - 2008年01月07日 22:21:31

Thanks for your response.=20
But if I have for example two words (N=3D2) and put it in the formula, =
the
resulting weight is 16 (2^2*2) and not 4.
Where is my mistake?
-----Urspr=FCngliche Nachricht-----
Von: crm...@li...
[mailto:crm...@li...] Im Auftrag von =
Paolo
Gesendet: Monday, January 07, 2008 9:58 PM
An: crm...@li...
Betreff: Re: [Crm114-discuss] Question about the weighting formula in
theplateau paper
On Fri, Jan 04, 2008 at 04:41:40PM +0100, Tobias Schneider wrote:
> Weight =3D 2^2N=20
>=20
> Thus, for features containing 1, 2, 3, 4, and 5 words, the weights =
of
> those features would be 1, 4, 16, 64, and 256 respectively."
>=20
>=20
> What does the variable N in the weighting formula stand for?
I think you get the answer in the following slide:
(3) the 2^2N weighting means that weights were=20
 1, 4, 16, 64, 256, ...=20
for the span lengths of 1, 2, 3, 4, 5 ... words=20
Thus N stands for the number of words in the N-gram.
HTH
--=20
 paolo
=20
 GPG/PGP id:0x1D5A11A4 - 04FC 8EB9 51A1 5158 1425 BC12 EA57 3382 1D5A =
11A4
 - 9/11: the outrageous deception and ongoing coverup: =
http://911review.org
-
-------------------------------------------------------------------------=
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketpl=
ace
_______________________________________________
Crm114-discuss mailing list
Crm...@li...
https://lists.sourceforge.net/lists/listinfo/crm114-discuss

Re: [Crm114-discuss] Question about the weighting formula in the plateau paper

From: Paolo <oo...@us...> - 2008年01月07日 20:58:37

On Fri, Jan 04, 2008 at 04:41:40PM +0100, Tobias Schneider wrote:
> Weight = 2^2N 
> 
> Thus, for features containing 1, 2, 3, 4, and 5 words, the weights of
> those features would be 1, 4, 16, 64, and 256 respectively."
> 
> 
> What does the variable N in the weighting formula stand for?
I think you get the answer in the following slide:
(3) the 2^2N weighting means that weights were 
 1, 4, 16, 64, 256, ... 
for the span lengths of 1, 2, 3, 4, 5 ... words 
Thus N stands for the number of words in the N-gram.
HTH
-- 
 paolo
 
 GPG/PGP id:0x1D5A11A4 - 04FC 8EB9 51A1 5158 1425 BC12 EA57 3382 1D5A 11A4
 - 9/11: the outrageous deception and ongoing coverup: http://911review.org -

[Crm114-discuss] Question about the weighting formula in the plateau paper

From: Tobias S. <tob...@fr...> - 2008年01月04日 15:41:52

I read the paper "The Spam-Filtering Accuracy Plateau at 99.9% Accuracy and
How to Get Past It." and I have a question about the following part:
 
"In this experiment, we used superincreasing weights as determined by the
formula 
Weight = 22N 
Thus, for features containing 1, 2, 3, 4, and 5 words, the weights of those
features would be 1, 4, 16, 64, and 256 respectively."
 
What does the variable N in the weighting formula stand for?

[Crm114-discuss] high scores after last updates

From: martin f k. <ma...@ma...> - 2007年11月20日 13:14:41

Hi list,
I recently upgraded from 20070320 to 20070810. Since that upgrade,
I get a very large number of false positives, which previously was
not the case.
I have been training-on-errors for almost three weeks, but crm114
still classifies almost every mail as spam. What's weird is that
I thought the cut-off point was 0 and negative scores would be
indicative of spam, positives would be ham, but this seems not the
case, I have messages with a score of 10 be GOOD and a score of 16
be SPAM.
I am using mailreaver with :clf: /osb unique microgroom/
Does anyone have any advice?
--=20
martin | http://madduck.net/ | http://two.sentenc.es/
=20
eleventh law of acoustics:
 in a minimum-phase system there is an inextricable link between
 frequency response, phase response and transient response, as they
 are all merely transforms of one another. this combined with
 minimalization of open-loop errors in output amplifiers and correct
 compensation for non-linear passive crossover network loading can
 lead to a significant decrease in system resolution lost. however,
 of course, this all means jack when you listen to pink floyd.
=20
spamtraps: mad...@ma...

Re: [Crm114-discuss] learncount slot is busy, file hosed?

From: Paolo <oo...@us...> - 2007年08月27日 22:11:07

On Sun, Aug 26, 2007 at 09:54:18PM +0200, martin f krafft wrote:
> I upgraded to 20070810-BlameTheSegfault and started to see errors
...
> /usr/bin/crm: *ERROR* 
> This file should have learncounts, but doesn't, and the learncount slot is busy. It's hosed. Time to die.
... 
> What's going on? It seems to work fine with 20070320.
weird ... did you change anything in mailfilter.cf along with the upgrade?
what's the :clf: in use?
how/when did you make the .css in use?
--
paolo
PS: this is rather matter for -general ML than -discuss

[Crm114-discuss] learncount slot is busy, file hosed?

From: martin f k. <ma...@ma...> - 2007年08月26日 19:54:30

I upgraded to 20070810-BlameTheSegfault and started to see errors
like this whenever I used mailreaver to train spam/ham:
 ERROR: mailreaver.crm broke. Here's the error\:=20
ERROR:=20
/usr/bin/crm: *ERROR*=20
 This file should have learncounts, but doesn't, and the learncount slot i=
s busy. It's hosed. Time to die.
 Sorry, but this program is very sick and probably should be killed off.
This happened at line 529 of file /usr/share/crm114/mailreaver.crm
What's going on? It seems to work fine with 20070320.
--=20
martin; (greetings from the heart of the sun.)
 \____ echo mailto: !#^."<*>"|tr "<*> mailto:" net@madduck
=20
"the truth is rarely pure and never simple. modern life would be very
 tedious if it were either, and modern literature a complete
 impossibility!"
 -- oscar wilde
=20
spamtraps: mad...@ma...