SourceForge logo
SourceForge logo
Menu

crm114-discuss — For discussion of CRM114 in theory and practice

You can subscribe to this list here.

2004 Jan
Feb
Mar
(27)
Apr
(25)
May
(8)
Jun
(2)
Jul
Aug
(1)
Sep
Oct
Nov
(1)
Dec
2005 Jan
(1)
Feb
Mar
(1)
Apr
May
Jun
Jul
Aug
Sep
(2)
Oct
(1)
Nov
Dec
2006 Jan
Feb
Mar
Apr
May
(1)
Jun
(1)
Jul
Aug
(1)
Sep
Oct
(2)
Nov
Dec
2007 Jan
Feb
(1)
Mar
Apr
May
(8)
Jun
Jul
(3)
Aug
(8)
Sep
Oct
Nov
(1)
Dec
2008 Jan
(3)
Feb
Mar
Apr
(2)
May
Jun
Jul
(2)
Aug
Sep
(3)
Oct
Nov
(3)
Dec
2011 Jan
Feb
Mar
Apr
May
Jun
(1)
Jul
Aug
Sep
Oct
Nov
Dec
2012 Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
(2)
2013 Jan
Feb
Mar
Apr
(1)
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec

Showing results of 112

1 2 3 .. 5 > >> (Page 1 of 5)
From: Vikram K. <vik...@gm...> - 2013年04月18日 10:24:53
Hi,
I wanted to know whether The C-callable library, LIBCRM114 would work
for ideographic languages
like Chinese, Korean or Japanese ?
As these languages do not have word boundaries so how tokenization stuff
would work ?
Is there any way around like converting the ISO-2022 encoding into UTF-8
then training and classifying ?
Or is there any other solution ?
Please provide feedback.
-Viks
From: Lars B. <la...@da...> - 2012年12月12日 15:38:23
Hello Bill,
Thanks for you reply.
I'm not a programmer, but I can find my way around Linux and do basic bash scripting.
I looked at both mailfilter.crm and mailreaver.crm and was, with my current knowledge of the crm language, 
a bit overwhelmed at the prospect of modifying anyone of those to my needs.
So I would much prefer any command-line scripts that I could modify to test this out.
Best
Lars Sorensen
On Dec 12, 2012, at 3:33 PM, ws...@me... wrote:
> 
> Yes, CRM114 can do multi-class sorting; one of the test cases actually
> does that (four classes, I believe).
> 
> Now, a question: do you want to do this from command-line, or are you a
> C programmer? The reason I ask is that we have two user-compatible but
> NOT binary-compatible CRM114's now.
> 
> - There's the command-line version, which has it's own language;
> 
> - There's the C-callable library (written in ANSI C)- you call it from
> a program you write. (yes, there's example code, including, if I 
> recall correctly - four-class examples)
> 
> Which would you prefer?
> 
> - Bill
From: Lars B. <la...@da...> - 2012年12月12日 14:09:09
Hello,
I have an email account that receives a fairly high volume (500-800) of daily emails (500-800), and would like to categorize/classify these emails automatically into about 100 categories/folders.
The last two months I have been trying out POPFile with some limited success. (http://getpopfile.org/)
After I have been inspecting keywords and decision trees in POPFile, it would seem to me like a classifier using phrases for classification might classify this type of emails better than the Naive Bayes implementation in POPFile.
As I'm not a programmer, but trying to learn, I have been searching for preexisting tools that might work for what I want to achieve.
Searching the web I can see that leaves me with the two options: CRM114 or OSBF-lua as classifiers and as I understand CRM114 now uses the OSBF classifier as the default!
Are there any implementations/scripts out there that will allow multiple classes for general email sorting using CRM114 or OSBF-lua as the classification engine?
It seems from what I read that this should be possible, but I'm unable to find any practical implementations to test with.
As I understand both mailfilter.crm and mailreaver.crm use only 3 classifications: 1.spam 2.nonspam 3.unsure, so these would not be useful for me in this regard I presume.
I could use some advice in how to go about this the right way.
Are there any scripts or tools out there that will do general email classification with CRM114 or OSBF-lua that could be implemented with maildrop or procmail on a Linux OS?
Any ideas or pointers would be greatly appreciated.
Best
Lars Sorensen
From: Matthieu <m...@tt...> - 2011年06月30日 11:08:23
Hi,
I'm trying to use crm114 on our mail server to filters bounced messages 
into categories :
user_unknown
host_not_found
relay_denied
mailbox_full
mailbox_blocked
detected_as_spam
on_vacation
message_too_large
not_a_bounce
unknown
I'm using the learn and classify commands from this script : 
https://github.com/samdeane/code-snippets/blob/master/python/crm.py :
categorization : "<osb unique microgroom>"
learn :" '-{ learn %s( %s) }'"
classify : " '-{ isolate (:stats:); classify %s( %s) (:stats:); match 
[:stats:] (:: :best: :prob:) /Best match to file .. 
\(%s\/([[:graph:]]+)\\%s\) prob: ([0-9.]+)/; output 
/:*:best:\\t:*:prob:/ }'"
My question is which categorization method would you suggest to achieve 
this kind of filtering ?
thanks,
Matthieu
From: Alejandro F. J. <ar...@gm...> - 2008年11月23日 16:06:59
The only spot where he seems to be aware of incoming news/messages
is his facebook and someone tried to reach him (Simon Vans-Colina)
there. No lights there, tho.
The point is that, if this guy work was ever an opensource project,
i was wondering if anyone had a piece of it, or any other implementation
of CRM114 regarding CVs classification for recruiting.
Thanks again!
Alejandro
El jue, 20-11-2008 a las 10:39 +0100, Gerrit E.G. Hobbelt escribió:
> 
> I see he's on LinkedIn; did you try to reach him there?
From: Gerrit E.G. H. <ge...@ho...> - 2008年11月20日 09:54:52
Sorry, can't help you out.
I see he's on LinkedIn; did you try to reach him there?
Take care,
Ger
Alejandro Fernandez Japkin wrote:
> Hello everyone,
>
> I'm in the middle of a hurry that includes implementing CRM114
> as a CV -resume- classifier for hiring purposes. Is of my understanding
> that someone named "Simon Vans-Colina" was involved in some tool
> on this subject, but the few links available over the net are just dead.
> Is there *anyone with *any information about this? I'd really appreciate
> a straight answer, since i'm running out of time and i want this
> monkey off my back. Writing from scratch is not an option at the point i
> am.
>
> Thanks really -a lot
>
>
> Alejandro
>
>
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
> Build the coolest Linux based applications with Moblin SDK & win great prizes
> Grand prize is a trip for two to an Open Source event anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> Crm114-discuss mailing list
> Crm...@li...
> https://lists.sourceforge.net/lists/listinfo/crm114-discuss
>
> 
-- 
Met vriendelijke groeten / Best regards,
Ger Hobbelt
--------------------------------------------------
web: http://www.hobbelt.com/
 http://www.hebbut.net/
mail: ge...@ho...
mobile: +31-6-11 120 978
--------------------------------------------------
From: Alejandro F. J. <ar...@gm...> - 2008年11月18日 22:50:36
Hello everyone,
I'm in the middle of a hurry that includes implementing CRM114
as a CV -resume- classifier for hiring purposes. Is of my understanding
that someone named "Simon Vans-Colina" was involved in some tool
on this subject, but the few links available over the net are just dead.
Is there *anyone with *any information about this? I'd really appreciate
a straight answer, since i'm running out of time and i want this
monkey off my back. Writing from scratch is not an option at the point i
am.
Thanks really -a lot
Alejandro
From: Ger H. <ge...@ho...> - 2008年09月03日 15:08:54
On Wed, Sep 3, 2008 at 3:09 PM, Bill Yerazunis <ws...@me...> wrote:
> More like:
>
> LEARN ( c1.stat c2.stat | c5.stat ... c127.stat) < osbf unique> [my.txt]
>
> which means "train my.txt in as a positive example in statistics files C1
> and C2, and as a negative example in files C5 through C127". If a
> file is not found, initialize it as "osbf unique", otherwise use
> the self identification in the file to choose the correct learning
> method.
Whoa. I am probably OD'ing on Microsoft Excel right now so my 'grok'
is down to zero, but can you please run that "self identification" bit
by me again?
Or is that something along the lines of 'open file, read header, check
classifier id+config in there, *then* jump to classifier? (Which can
be done, if you provide the 'csscreate' script opcode or some such
(which is only a stupid stub in GerH now, btw) which is then to be
used to 'create/set-up' any new CSS file. (mailreaver's 'learn zilch'
trick to create css on the fly has to be replaced then with such a
csscreate opcode.)
Am I thinking too 'classical/procedural' here regarding learn? Anyway,
from what I read in your text is that you're going for something like
this:
assume message M which will be classified, then [unidentified
intelligent code] will train message M as 'spam' or 'ham' --> code
assuming auto-ID'ing classifier as described above so no attributes
needed:
classify (S|H) [M]
...
learn (S|H) [M] --> learn as spam (left side is 'S'pam CSS files,
right is 'H'am CSS)
...
learn (H|S) [M] --> learn as ham (because now 'H'am is at left)
which means you rotate the S/H CSS file[s] [collections] around that |
pipe symbol there.
That would be identical - I think - to Paolo's
learn (S|H) <1> [M] --> learn '1st' side == left side == spam
...
learn (S|H) <2> [M] --> learn '2nd' side == ham
Now for multiclass A|B|C|D|... it would probably work the same, you
just rotate the proper class E {A,B,C,D,...} (E == element of, no
math symbols in email) to the front while Paolo's would send along the
proper 'index' value as an attribute or some such.
If it's like that, I'd rather have the 'indexed' variant instead of
the 'rotated around | pipe' style because it would take one isolated
var only to schlepp that bunch around and it saves on possible if/else
conditionals as well, because I might be able to blunty derive index i
E {<1>, <2>, ..} from a previously determined pR using a bit of :@:
math, but that's just me. The 'rotating' style is auto-backwards
compatible (while keeping 'details' like <refute> outside that
equation for now) when you have 'optional pipe' instead of 'required
pipe' (and provided "you know what you are doing" caveat applies to
script writer).
Meanwhile, SVM still has 2 pipes and 3 files where anybody else uses
A|B (1 pipe, 2 files) for same, so there's still a bit of
'irregularity' there to my mind, but then I probably should stick to
looking at lotsa numbers in rectangles instead of attempting brain
activity today.
-- 
Met vriendelijke groeten / Best regards,
Ger Hobbelt
--------------------------------------------------
web: http://www.hobbelt.com/
 http://www.hebbut.net/
mail: ge...@ho...
mobile: +31-6-11 120 978
--------------------------------------------------
On Wed, Sep 3, 2008 at 9:27 AM, Paolo <oo...@us...> wrote:
> On Wed, Sep 03, 2008 at 05:03:35AM +0200, Ger Hobbelt wrote:
> ...
>> Given 60 classes (= CSS files), Paolo can have his KISS and I can eat
>> my pie too. Simple.
>
> let me stress once again that I question the _requirement_.
[...]
> agreed, provided that
>
> ! learn (:*:s:) [message] <i flags>
>
> where :s: is just a subset (1 as limit, so that 'i' can be dropped) of the
> N classes in use, is (remains) legal (where allowed).
Yes, that should be possible in my line of thought. Assuming "you know
what you're doing" i.e. are aware of classifier internals, you can do
this in the new 'learn'. Take existing OSB for example (*forget* my
'delta' stuff for a sec there), which touches only a single CSS file
on learn, then
learn (A|B) <1>
is identical to
learn (A) <1>
is identical to
learn (A)
is identical to
learn (A|B)
because <1> is a possible 'default' -- though that might be a
disputable thing - I'd rather see an error report, because learn (A|B)
isn't 'obviously' going to teach the way of A.
The thing I'm really after is that at script level
learn (A|B|...) <i>
is supported for _all_ classifiers. When you're doing smart stuff
script-wise where you like to code
learn (A)
while you classify code is
classify (A|B|C|D|E|F|..)
fine. The bit of 'cut at pipe, pick the ones you want' code I envision
can handle it, so you've got options script-code-wise.
In other words: a 'set' of one, is still a set in my book. That you as
a script writer might want to take that thought to the edge (set of 1)
is fine with me. I always appreciate that kind of craftiness. It's
just that the starting point shifts for people new to this: keep the
set around and apply to both classify and learn equally. When you are
ready to read the fine print in the manual, you can decide to use 'set
of 1' as a valid 'fringe case' (fringe from script-language structural
point of view).
What I *need* is learn (A|B) support for classifiers that don't have
it yet (OSB and friends) and currently there's no possibility for
coding
learn (A|B) <i osb>
so I am prevented from testing my ideas for the classifier itself.
>> what they want/need, you get the chance to apply filters & processes
>> in learn that are simply impossible right now PLUS you don't have to
>
> that's C level, SVM wants 3 because it uses 3 in both cases.
Aicks! You _got_ me there. Forgot the 3rd one in SVM. DANG! Still a
remaining 'oddity' hence. :-((
No good answer there expect mumbling about the implicit 'variable
size' of a 'set' as I approach it.
> ! learn (a b a_v_b) <svm flags> # wants all 3
> ! classify (a b a_v_b) <svm flags> (s_svm) # wants all 3
> ! classify (a b) <xxx flags> (s_xxx) # can't use the extra a_v_b
>
>> So no unified ... mess; I'd say it's unified ... structure / design.
>
> maybe, but that's not as simple as saying :
> define: N classes
> hence: LEARN(1 2 ... N)
> CLASSIFY(1 2 ... N)
> which might turn into a mess, or better shift the mess from one place to
> another.
Sure it's a shift: out of the [script] language, so it's 'black
boxing' learn as it is classify, and into the [C] code.
I think for general use it's less mess because you need to 'remember'
less about the script language and the 'learn' interface, because
apart from the extra index (in a sense you're _feeding_ it the pR
which would pop out of classify as a result) it's exactly like
classify. I really like language layout where general use requires the
least number of 'rules' and 'details' to be remembered: it makes for a
simpler language overall which is good for me as I work with multiple
languages and a limited brain. ;-)
(This learn/classify stuff is - in a way - comparable to old
discussions about 'coding standards' and such for Pascal or C, where
there's a class of folks that say: "you can skip the braces/begin-end
and the semicolons so you should" while I am clearly with the folks
that say: "don't matter what you do, always apply the same structure:
braces/begin-end and semicolons and stuff, unless it is _prohibited_
by the language". Right now 'learn (A)' is prohibiting me from using
'learn (A|B)'. I think that bit didn't make it through last night.)
>> Cost for Trever @ 60 classes? nil.
>
> wasn't thinking of run time cost, but script readability.
Same here. But Trever was starting to worry, it seemed to me,
performance would drop, if ever so slightly, if we'd be introducing
this. And in case others were going to think it mattered.
> yes, though once N classes get mmaped for a CLASSIFY a single class LEARN
> can check for it and won't mmap() again, and mmsync() can be deferred
> iff other processes that use same class(es) do that via shared mem.
Yep. When you construct your scripts to handle classification and
subsequent learning in the same crm114 instance, you get that
advantage today.
A (very limited) 'server'-y approach doable right now is writing a
script which loops, waiting for messages available on disc or stdin,
and keep on processing them one after the other in the same instance:
you have the 'CSS stays in mem' benefit then as well (note: ignoring
how to code for cutting up stdin into messages and/or poll/wait for
disc-based messages here - that's another subject)
>> Want some real, achievable gain? convert crm114 to play 'server', i.e.
>> permanently loaded and CSS files (close to) permanently mapped in
>
> yes yes yes yes - the endless daemon saga :)
[...]
> yeah, maybe the ability to run pre-compiled scripts can be good idea
> for a number of applications.
You mean a kind of .java p-coded crm114 scripts, i.e. a real crm114
*compiler* (.crm --> .114 binary file) and, er, accompanying 'virtual
machine'? Oh boy, the table rises here. ;-P But that's just the geek
in me getting all exited. It's not on my list of 'things worth doing @
mid/short-term' though, but fun anyway. A crude/cheap way might be an
option to 'dump' and 'load' tokenized script as it leaves the crm114
tokenizer going to the execution unit. Tokenize once, run multiple
times.
It's not worth it for me (I ran tiny scripts) but all the folks out
there enjoying mailreaver and friends might get some good delight out
of that as mailreaver/mailtrainer are _significant_ sized scripts.
>> seriously considering hacking crm114 into becoming mod_crm114, i.e. an
>> Apache2 plugin: you get the server, the socket I/O and the
>
> like Apache's Lucene and derivatives.
Sorta. Yes.
>> live in there like a wicked PHP-alike server-side scripting language
>> and you will definitely achieve instant notoriety. ;-)
>
> and support headache ;)
I like my native Americans ;-))
Granted, moving from 1.3 to 2.0/2.2 wasn't easy for a mod_xxx, but
still I like it way more than 'roll your own [TCP-based] server'
again: linking it to Apache (and no, despite the fact that I do
Win32/64, I don't think I'll be the go-to guy if you want IIS plugin
support: IIS6 is nice, in a way, and has good performance, but I run
Apache on Windows for free projects and only do IIS for paying
customers. Got to draw the line _somewhere_. If they open-source IIS,
I'll reconsider that statement.)
Anyhow, the crm114 scripts would still be there as they are right now;
I would just take the std I/O and bend it so stdin = request and
stdout (and stderr?) == response. Maybe add a touch of XML if you want
to have a freeze-dried instant low-cal 'web service' (which is hot
stuff these days, but rather old wine in fashionable new Walmart bags
if you ask me, but then folks don't seem to study IT history anymore)
Why Apache really? Because I can then 'lean on' the stick provided by
them when I need to scale up: distributed servers, pardon, *services*,
and the whole bloody lot are documented already. Besides, my purposes
lead me towards a production environment as a 'web backend' anyhow, so
why not bolt it to the web server itself? Yup, doing so requires some
understanding of the Apache API interfacing and that's raising the
tech level by +1, but at least you can be spared some significant
intricacies regarding TCP/server performance tactics at server level.
It's fun to write it, but in this case, my feeling was it's faster to
go for mod_crm114 in dev time. And yes: that's 'faster' regarding a
_production quality_ mod_crm114 compared to _production quality_
crm114d (note the 'd').
(For free as well: SSL secured communications with the crm114
'service' - which might be something to cheer the 'remote services'
folks up quite a bit.)
Anyway, I don't 'do' the alpha release of mod_crm114 in one week, nor
can I deliver alpha stage crm114d in the same timeframe, so it'll
probably stay a great idea over whiskey on Friday as I don't see Bill
getting his hands on a particular red phone booth with free access
either. ;-)
> see above: CAN but definitely should not be a MUST.
Does my approach of 'set' as described at start of this email match
your CAN, or does it still sound like MUST to you?
>> ONE: strict adherence to 'backwards compatibility' at CRM114 script
>
> just one good reason.
well.... ;-)
-- 
Met vriendelijke groeten / Best regards,
Ger Hobbelt
--------------------------------------------------
web: http://www.hobbelt.com/
 http://www.hebbut.net/
mail: ge...@ho...
mobile: +31-6-11 120 978
--------------------------------------------------
On Wed, Sep 03, 2008 at 05:03:35AM +0200, Ger Hobbelt wrote:
...
> Given 60 classes (= CSS files), Paolo can have his KISS and I can eat
> my pie too. Simple.
let me stress once again that I question the _requirement_. 
 
> The set passed to classify is a set and should be passed to learn as
right, if we had vector/array struct it'd be 'natural' ...
> isolate (:c:) /class1 | class2 | and so on .../
...
> classify (:*:c:) [message]
which is a fake vector, works on strict assumptions on how to name 
var/classes.
Like in other situations, having true array data structure would be quite
useful.
> learn (:*:c:) (index) [message]
> 
> Both look good to _me_. ;-)
agreed, provided that
! learn (:*:s:) [message] <i flags>
where :s: is just a subset (1 as limit, so that 'i' can be dropped) of the 
N classes in use, is (remains) legal (where allowed). 
 
> Because you always pass along the whole set at script level, the
> classifier code (both learn and classify implementation) gets to pick
there's no need for that, where's the binding between script level and 
classifiers implementation? eg I can define N classes, but use any subset
for both LEARN / CLASSIFY at any point to my taste/needs, with the limit
of the actual classifier's requirement:
!# use classes: one two three four five six seven
! learn (one two three four five six seven) <i flags> [msg_x]
! learn (three four five) <i flags> [msg_y]
! learn (one) <flags> [msg1]
! ...
! classify (one two three four five six seven) <flags>
! classify (five six seven) <flags>
! classify (three six seven) <flags>
! classify (one three four six seven) <flags>
! classify (six) <flags> (cm)	# class membership -> cm, unsupported atm
...
> what they want/need, you get the chance to apply filters & processes
> in learn that are simply impossible right now PLUS you don't have to
that's C level, SVM wants 3 because it uses 3 in both cases.
> worry anymore either which classifier you're gonna use because today
> all the bloody buggers require their own particular incantation when
> it comes to number of css files (classes) passed to learn.
there are categories of classifiers that have same requirements wrt
#classes and params. Now suppose the actual classes are compatible, but
one classifier needs 1+ extras (eg SVM) and I want to compare classifiers,
then it'd be nice to do (SVM case, forget 4now actual class compatibility):
! learn (a b a_v_b) <svm flags>		# wants all 3
! classify (a b a_v_b) <svm flags> (s_svm)	# wants all 3
! classify (a b) <xxx flags> (s_xxx)	# can't use the extra a_v_b
> So no unified ... mess; I'd say it's unified ... structure / design.
maybe, but that's not as simple as saying :
define:	N classes
hence:	LEARN(1 2 ... N)
	CLASSIFY(1 2 ... N)
which might turn into a mess, or better shift the mess from one place to
another.
> Cost for Trever @ 60 classes? nil.
wasn't thinking of run time cost, but script readability.
> You save far more time when you find a way to reduce disc I/O cache
> misses on your memory-mapped CSS files, even when you achieve such a
> feat for learn alone (which would be rather weird and besides, unless
> you 'Train Everything', optimizing classify is the winner). I have a
yes, though once N classes get mmaped for a CLASSIFY a single class LEARN
can check for it and won't mmap() again, and mmsync() can be deferred
iff other processes that use same class(es) do that via shared mem.
> Want some real, achievable gain? convert crm114 to play 'server', i.e.
> permanently loaded and CSS files (close to) permanently mapped in
yes yes yes yes - the endless daemon saga :)
> invocation of crm114 and the moment the script *tokenizer* kicks in.
> You're not even *executing* script yet by then! The rest (8%) is
> spread across tokenizing ('compiling the [small!] script'), tokenized
> script code execution, wrap-up and unidentified fluff elsewhere.
> Believe me, if I'd see an easy way to kick that bugger into higher
> gear, you'd already have it.
yeah, maybe the ability to run pre-compiled scripts can be good idea 
for a number of applications.
> seriously considering hacking crm114 into becoming mod_crm114, i.e. an
> Apache2 plugin: you get the server, the socket I/O and the
like Apache's Lucene and derivatives.
> live in there like a wicked PHP-alike server-side scripting language
> and you will definitely achieve instant notoriety. ;-)
and support headache ;)
> Anyhow, I don't see any good reason why the learn (classes) argument
> cannot be identical to the related classify (classes) argument, except
see above: CAN but definitely should not be a MUST.
> ONE: strict adherence to 'backwards compatibility' at CRM114 script
just one good reason.
-- 
paolo
From: Ger H. <ge...@ho...> - 2008年07月22日 06:00:52
On Tue, Jul 22, 2008 at 1:21 AM, Chris Babcock <cba...@as...> wrote:
> The answer to these questions might very well be, "test and measure".
> If that's the case, I appreciate pointer to whatever help is available for the methodology since CRM114 and working with text classification in general are new to me. (I hate Perl, but I'm pretty handy with sed... Go figure.)
>
> How do you calculate the hardware requirements, especially the size of
> CSS files needed, with a CRM114 program?
Okay, let's try my hand at this on the quick.
First of all, there's the classifier you pick. Different classifier,
different behaviour, different size requirements. A few of them
require unlimited space (CSS files grow a little every time), most of
them have fixed space requirements.
I'm a bottom-up guy most of the time, so let's start bottom up for
some goodness.
All 'production quality' classifiers (that's OSB and friends) are
based on a fixed-sized hash table. Since the method chosen to store
stuff in that hash table is the 'linear probe' algorithm, you should
never ever try to get beyond the 50% fill rate point as then the hash
table performance is QUICKLY deteriorating (quite a few papers on
that; 50% isn't 'hard' but a 'rule of thumb' number).
Since the hash table is filled with hash elements and we assume a
reasonable quality hash here (flat distribution in N dimensions and
bla bla bla), best case is a flat fill. To satisfy the mechanical
engineer in me who's learned there's no such thing as a 'best case' in
daily practice unless you get your lifetime's joss delivered in a
single day, we add a fudge factor and after sucking on my thumb (mjam)
I'd say a fill of about 20-30% would be swell.
Performance? Can be assumed to be about near flat (O(1)); don't have
'live numbers' on this one, but my guess is bigger CSS files will be
slower due to more chances at disc and CPU cache misses while poking
at the hash table entries. Fast disc I/O helps there. RAID5, maybe
RAID6 or other dandiness...
Now one element is one hash plus a number, clocking in at 4 bytes
each, so at N elements, that's N*8 bytes disc space per 'feature'.
Given a 32-bit box and a safety margin of 2 for signed versus unsigned
queerness -- NOT to be mistaken with the use of signed versus unsigned
int discussed elsewhere ;-) -- that's a max size of 2GB / 8 =
2**(31-3) = 2**28 elements, then use 25% fill because it's a nice
number (1/2**2) that means we can store 2**26 elements max on 32-bit
'without noticeable loss of performance'. (Thanks to the 2G instead of
4G edge I also have a fighting chance at getting this to actually
/work/ on such a box as CRM114's using memory mapping and we can't eat
all for just the CSS file. No space left for the binary and misc data
there.)
Ah! But to classify you need two CSS files at least! And given our
memory mapping is done all at the same time, I'll dial down that
max(N) number to 2**25.
Because you can smell the napalm from here when looking at your
numbers, let's quickly see what 64-bit has on offer in 'best case';
and that would include additional money for harddisc technology
researchers an' all: 2**(63-3-2-1) ==> max(N) storage capacity at
2**57 which would mean you're good to go, topping out somewhere beyond
0.125 ExaFeatures (where one Feature is one CRM114 token a.k.a.
'hash'). If you get that kinda space, could I maybe charge you please
for a measly commission fee in the form of .00001% of your disc space,
yes? MY problems are solved then. ;-)
So far the 'practical' limits.
Now from your side of the fence:
Taking 17*5*(175!/158!) on faith (this is my morning coffee, and it's
gotta be fast, so I DO believe) at 6K sized docs? Hmmmm.... Let's just
assume one doc is one(1) Feature (it probably isn't but what the heck,
my backbone already feels where this is going; trying to beat Big Blue
at it, are we, eh? :-)) ) that would mean, say, n!/m! =~= 100**(n-m)
for m >= 100 here (and that's a BIG lie! but a really sweet one.) ==>
we're going to be hit by a feed of over 17*5*(100**(175-158)) =? and
since we're ballparking here like there's no tomorrow, that'd be
somewhere beyond 100**(175+1-158)=100**18 == 10**36 which is somewhere
over the rainbow and beyond an Exa SQUARED.
Like the backbone already knew: ...OOPS?!
Not to be the bee in your bonnet - I like the idea! :-)) - but (a) all
them Features are never ever gonna fit, even when you get unlimited
sponsoring by Hitachi and IBM, heck, you /buy/ them, and (b) assuming
for now that (pre-)calculating/learning/whatever one such item takes
about a single modern day CPU clock tick, i.e. ~ 1.0**-9 seconds,
which is rather optimistic and out the /other/ side, you'd /still/ be
at it when the Four Horsemen are having a snack on our offspring.
Of course, we can make the bugger 'learn on the job' (don't we all?)
and then it turns into the question of 'lifetime': how much do you
want it to learn and how good should the bugger be at playing
Diplomacy... in the end? Because there's surely to be found 'pathways'
in that data a.k.a. 'successful strategies'. Guestimating what
learning _those_ will cost is _way_ beyond the morning coffee, though.
Sooooo... getting that Diplomacy-playing Big Blue going somewhere
during /this/ lifetime, brute force ain't gonna cut it. Assuming the
above was kicking in an open stabledoor (but fun!), the plus benefit
of it all is that we have one practical usable result here:
if you know how many different words ('features') you want this Bayes
box to 'remember', you can take that number (N), multiply it by 4*8=32
for a 25% filled OSB[F].Markov/... classifier and your
advised/preferred CSS file size would be N*32 bytes. To be eligible
for one(1) yes/no style classification question ("is it or isn't
it?"), that takes two(2) CSS files, so total disc /cost/ would be
about N*64 bytes, excluding a negligible bit of header icing on the
cake.
Of course, that doesn't say nothing at what a 'feature' would BE in
your case; in email it's generally one word, but that's also an 'it
depends...' so there's lots of puzzling to do before you hit the Bayes
box. Big question before plugging it in: exactly _what_ are we going
to feed the animal? (See also a blurb about stocks analysis a few
months ago in this ML. Simply plonking in raw data ain't gonna cut it.
Same here.)
On another note - before I run: if you want 'win/loose/draw'
three-ways or other 'multiway' decisions, it is theoretically (and
practically) possible with CRM114; for every extra choice you have to
add one(1) more CSS file (and a | pipe symbol in your script).
Multiway weighting is 'supported' in the code but I haven't heard
about anybody actively using it since I first popped by in autumn 2007
so software-wise YMMV, Caveat Emptor, pick your classifier wisely and
all that and here's a rabbit foot as well. Cause you're gonna _need_
it.
Having unsettled you sufficiently, I'm exit left outa here. The
laboring masses and all that. Still, I love your idea. Tip if you want
to pursue this: check out what the chess boys have been doing. Same
problem; smaller scale (cough).
>
> One of my long range projects is to write an AI for the boardgame
> Diplomacy using CRM114. The approach is to archive combinations of turn
> results and moves sorted by how favorable the outcome was. The program
> builds movement sets by parsing game results to determine the
> disposition of its units then consults a movement matrix to generate
> all possible order sets. Each movement set and result combination is
> submitted to the classifier to determine how closely it "resembles"
> winning combinations from games in its training.
>
> What do I need to know in order to estimate the necessary size of the
> CSS file? There's ~175 unit dispositions and an average of 5 possible
> destinations for each unit. A well trained classifier which has
> not eliminated any trivial cases will have no more than 17*5*(175!/158!)
> documents of ~6 KB each - consisting of an order set and the results that produced it - in the corpus for each of the 7 game powers and 5 game phases. I need to be able to determine the optimal classification granularity (number of categories to sort to). Logically, I think that I would get the best speed and accuracy sorting to "win" and "not a win", but that only holds true as long as the classification file is within program limits without microgrooming. Dividing the classification files into more
> outcomes - "win", "draw (by size of draw)" and "elimination" - doesn't reduce the size of the "win" CSS file, nor does any other outcome-based
> refinements in classification. If I divide the classification according
> to a metric of game progress then I can effectively reduce the size of
> the CSS files at the expense of calculating that metric each turn. Are
> there any guidelines for determining how the size of the CSS files
> affects classification speeds?
>
> Chris
>
>
>
>
>
>
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
> Build the coolest Linux based applications with Moblin SDK & win great prizes
> Grand prize is a trip for two to an Open Source event anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> Crm114-discuss mailing list
> Crm...@li...
> https://lists.sourceforge.net/lists/listinfo/crm114-discuss
>
>
-- 
Met vriendelijke groeten / Best regards,
Ger Hobbelt
--------------------------------------------------
web: http://www.hobbelt.com/
 http://www.hebbut.net/
mail: ge...@ho...
mobile: +31-6-11 120 978
--------------------------------------------------
From: Chris B. <cba...@as...> - 2008年07月21日 23:21:30
The answer to these questions might very well be, "test and measure".
If that's the case, I appreciate pointer to whatever help is available for the methodology since CRM114 and working with text classification in general are new to me. (I hate Perl, but I'm pretty handy with sed... Go figure.)
How do you calculate the hardware requirements, especially the size of
CSS files needed, with a CRM114 program?
One of my long range projects is to write an AI for the boardgame
Diplomacy using CRM114. The approach is to archive combinations of turn
results and moves sorted by how favorable the outcome was. The program
builds movement sets by parsing game results to determine the
disposition of its units then consults a movement matrix to generate
all possible order sets. Each movement set and result combination is
submitted to the classifier to determine how closely it "resembles"
winning combinations from games in its training.
What do I need to know in order to estimate the necessary size of the
CSS file? There's ~175 unit dispositions and an average of 5 possible
destinations for each unit. A well trained classifier which has
not eliminated any trivial cases will have no more than 17*5*(175!/158!)
documents of ~6 KB each - consisting of an order set and the results that produced it - in the corpus for each of the 7 game powers and 5 game phases. I need to be able to determine the optimal classification granularity (number of categories to sort to). Logically, I think that I would get the best speed and accuracy sorting to "win" and "not a win", but that only holds true as long as the classification file is within program limits without microgrooming. Dividing the classification files into more
outcomes - "win", "draw (by size of draw)" and "elimination" - doesn't reduce the size of the "win" CSS file, nor does any other outcome-based
refinements in classification. If I divide the classification according
to a metric of game progress then I can effectively reduce the size of
the CSS files at the expense of calculating that metric each turn. Are
there any guidelines for determining how the size of the CSS files
affects classification speeds?
Chris
 
From: Ger H. <ge...@ho...> - 2008年04月19日 00:35:27
Hi,
I assume that the numbers you reported are for the testset which was
NOT trained as the numbers are lower than 70% of 266.
Anyway, I would not worry too much about your numbers in relation to
crm114 performance. Nothing which makes my eyebrows go up. I'm rather
surprised crm114 got this far on its own, really.
The problem lies elsewhere as it looks like you are running into the
same /fundamental/ issue as I did when I decided to use crm114 for my
signal analysis.
The basic two questions you should answer for yourself first and foremost are:
1a- how do Bayesian and other statistical filters like crm114 work
EXACTLY? (I refer to recent discussion in the crm114-developer mailing
list (Bill/Paolo/Ger) where crm114 innards are explained and discussed
using the analogy of a sandbox, green and red balls and a gold ball.
It's way too much to reproduce here, but read up on that and make sure
you understand what's going on. Research the algorithms used by
crm114, before you continue. Key element to understand is how crm114
compares data elements to arrive at similarity figures. Which leads to
question
1b- ask yourself where in your data is the 'equality' / identity in
elements in the evaluated inputs, which is a low level engineering
question derived from the second major question:
2- what are the metrics I want crm114 to compare to help me arrive at
the answers which I seek? And which answer am I looking for, really?
NOTE: express answers in both functional goals (for yourself) and
technical implementation terms, because you are designing the
automation of a 'human' system here, so you must be able to instruct
the computer what to do /exactly/ what you want it to do to emulate
the human process you try to model.
Tip of the week: This implies, technically speaking, that you /may/
find you need to preprocess your data.
I give this rather generic answer, because I believe it will help you
far more in understanding the core of what you are doing than when I
focus on a little detail (symptom) in your email and maybe up your
successrate right now. Understanding what is going on in there is
mandatory for anyone wishing to use statistical filters in a domain
where they have not been 'preconfigured' by other researchers for you.
> another. Is my understanding correct? Also, I found each time crm114 is made
> to learn the same thing, it produces different classification result on
> testing case. Is there a correct behavior?
A few bits of info are lacking to answer this, but when there's no
randomness involved in any way, the process should be completely
reproducible, i.e. provide you with the same results after every
complete re-run. Some learning methods (when you use mailtrainer for
instance) /may/ employ randomizer learn ordering, which will jolt
results for test sets; more so for small test sets like yours.
Of course, further questions and results are welcomed.
Best regards,
Ger Hobbelt
On Fri, Apr 18, 2008 at 9:48 PM, Weide Zhang <wz...@gm...> wrote:
>
>
> Hi, I am using crm114 to do text mining on stock annual report 10K to make
> prediction on their performances. The sample has 266 rows, each containing 1
> column indicating their annual report segment, and the other indicating
> whether or not they perform better in that year compared to the industry
> average. I use 70% of the data(data before 2006) as training and I tried
> different training method.
>
> Below are the correct number for each category('good' and 'bad' meaning
> perform better or worse). I use the python wrapper found on the crm114
> wiki. The accuracy is quite low and I notice that for osbf, there are no bad
> case that are classified correctly and for markov, only 2 good cases are
> classified correctly. It seems that the algorithms is biased one over
> another. Is my understanding correct? Also, I found each time crm114 is made
> to learn the same thing, it produces different classification result on
> testing case. Is there a correct behavior?
>
>
> good bad
> entropy corr 14 6
> total 18 19
>
> markov corr 2 14
> total 18 19
>
> osb corr 10 4
> total 18 19
>
> osbf corr 18 0
> total 18 19
>
> Thanks for your answer,
>
> Weide
> -------------------------------------------------------------------------
> This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
> Don't miss this year's exciting event. There's still time to save 100ドル.
> Use priority code J8TL2D2.
>
> http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
> _______________________________________________
> Crm114-discuss mailing list
> Crm...@li...
> https://lists.sourceforge.net/lists/listinfo/crm114-discuss
>
>
-- 
Met vriendelijke groeten / Best regards,
Ger Hobbelt
--------------------------------------------------
web: http://www.hobbelt.com/
 http://www.hebbut.net/
mail: ge...@ho...
mobile: +31-6-11 120 978
--------------------------------------------------
From: Weide Z. <wz...@gm...> - 2008年04月18日 19:48:34
Hi, I am using crm114 to do text mining on stock annual report 10K to make prediction on their performances. The sample has 266 rows, each containing 1 column indicating their annual report segment, and the other indicating whether or not they perform better in that year compared to the industry average. I use 70% of the data(data before 2006) as training and I tried different training method. 
Below are the correct number for each category('good' and 'bad' meaning perform better or worse). I use the python wrapper found on the crm114 wiki. The accuracy is quite low and I notice that for osbf, there are no bad case that are classified correctly and for markov, only 2 good cases are classified correctly. It seems that the algorithms is biased one over another. Is my understanding correct? Also, I found each time crm114 is made to learn the same thing, it produces different classification result on testing case. Is there a correct behavior? 
 good bad 
 entropy corr 14 6 
 total 18 19 
 
 markov corr 2 14 
 total 18 19 
 
 osb corr 10 4 
 total 18 19 
 
 osbf corr 18 0 
 total 18 19 
Thanks for your answer,
Weide 
From: Tobias S. <tob...@fr...> - 2008年01月07日 22:21:31
Thanks for your response.=20
But if I have for example two words (N=3D2) and put it in the formula, =
the
resulting weight is 16 (2^2*2) and not 4.
Where is my mistake?
-----Urspr=FCngliche Nachricht-----
Von: crm...@li...
[mailto:crm...@li...] Im Auftrag von =
Paolo
Gesendet: Monday, January 07, 2008 9:58 PM
An: crm...@li...
Betreff: Re: [Crm114-discuss] Question about the weighting formula in
theplateau paper
On Fri, Jan 04, 2008 at 04:41:40PM +0100, Tobias Schneider wrote:
> Weight =3D 2^2N=20
>=20
> Thus, for features containing 1, 2, 3, 4, and 5 words, the weights =
of
> those features would be 1, 4, 16, 64, and 256 respectively."
>=20
>=20
> What does the variable N in the weighting formula stand for?
I think you get the answer in the following slide:
(3) the 2^2N weighting means that weights were=20
 1, 4, 16, 64, 256, ...=20
for the span lengths of 1, 2, 3, 4, 5 ... words=20
Thus N stands for the number of words in the N-gram.
HTH
--=20
 paolo
=20
 GPG/PGP id:0x1D5A11A4 - 04FC 8EB9 51A1 5158 1425 BC12 EA57 3382 1D5A =
11A4
 - 9/11: the outrageous deception and ongoing coverup: =
http://911review.org
-
-------------------------------------------------------------------------=
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketpl=
ace
_______________________________________________
Crm114-discuss mailing list
Crm...@li...
https://lists.sourceforge.net/lists/listinfo/crm114-discuss
From: Paolo <oo...@us...> - 2008年01月07日 20:58:37
On Fri, Jan 04, 2008 at 04:41:40PM +0100, Tobias Schneider wrote:
> Weight = 2^2N 
> 
> Thus, for features containing 1, 2, 3, 4, and 5 words, the weights of
> those features would be 1, 4, 16, 64, and 256 respectively."
> 
> 
> What does the variable N in the weighting formula stand for?
I think you get the answer in the following slide:
(3) the 2^2N weighting means that weights were 
 1, 4, 16, 64, 256, ... 
for the span lengths of 1, 2, 3, 4, 5 ... words 
Thus N stands for the number of words in the N-gram.
HTH
-- 
 paolo
 
 GPG/PGP id:0x1D5A11A4 - 04FC 8EB9 51A1 5158 1425 BC12 EA57 3382 1D5A 11A4
 - 9/11: the outrageous deception and ongoing coverup: http://911review.org -
From: Tobias S. <tob...@fr...> - 2008年01月04日 15:41:52
I read the paper "The Spam-Filtering Accuracy Plateau at 99.9% Accuracy and
How to Get Past It." and I have a question about the following part:
 
"In this experiment, we used superincreasing weights as determined by the
formula 
Weight = 22N 
Thus, for features containing 1, 2, 3, 4, and 5 words, the weights of those
features would be 1, 4, 16, 64, and 256 respectively."
 
What does the variable N in the weighting formula stand for?
 
From: martin f k. <ma...@ma...> - 2007年11月20日 13:14:41
Hi list,
I recently upgraded from 20070320 to 20070810. Since that upgrade,
I get a very large number of false positives, which previously was
not the case.
I have been training-on-errors for almost three weeks, but crm114
still classifies almost every mail as spam. What's weird is that
I thought the cut-off point was 0 and negative scores would be
indicative of spam, positives would be ham, but this seems not the
case, I have messages with a score of 10 be GOOD and a score of 16
be SPAM.
I am using mailreaver with :clf: /osb unique microgroom/
Does anyone have any advice?
--=20
martin | http://madduck.net/ | http://two.sentenc.es/
=20
eleventh law of acoustics:
 in a minimum-phase system there is an inextricable link between
 frequency response, phase response and transient response, as they
 are all merely transforms of one another. this combined with
 minimalization of open-loop errors in output amplifiers and correct
 compensation for non-linear passive crossover network loading can
 lead to a significant decrease in system resolution lost. however,
 of course, this all means jack when you listen to pink floyd.
=20
spamtraps: mad...@ma...
From: Paolo <oo...@us...> - 2007年08月27日 22:11:07
On Sun, Aug 26, 2007 at 09:54:18PM +0200, martin f krafft wrote:
> I upgraded to 20070810-BlameTheSegfault and started to see errors
...
> /usr/bin/crm: *ERROR* 
> This file should have learncounts, but doesn't, and the learncount slot is busy. It's hosed. Time to die.
... 
> What's going on? It seems to work fine with 20070320.
weird ... did you change anything in mailfilter.cf along with the upgrade?
what's the :clf: in use?
how/when did you make the .css in use?
--
paolo
PS: this is rather matter for -general ML than -discuss
From: martin f k. <ma...@ma...> - 2007年08月26日 19:54:30
I upgraded to 20070810-BlameTheSegfault and started to see errors
like this whenever I used mailreaver to train spam/ham:
 ERROR: mailreaver.crm broke. Here's the error\:=20
ERROR:=20
/usr/bin/crm: *ERROR*=20
 This file should have learncounts, but doesn't, and the learncount slot i=
s busy. It's hosed. Time to die.
 Sorry, but this program is very sick and probably should be killed off.
This happened at line 529 of file /usr/share/crm114/mailreaver.crm
What's going on? It seems to work fine with 20070320.
--=20
martin; (greetings from the heart of the sun.)
 \____ echo mailto: !#^."<*>"|tr "<*> mailto:" net@madduck
=20
"the truth is rarely pure and never simple. modern life would be very
 tedious if it were either, and modern literature a complete
 impossibility!"
 -- oscar wilde
=20
spamtraps: mad...@ma...
From: Gerrit E.G. H. <Ger...@be...> - 2007年08月08日 21:51:14
Paolo wrote:
>> Brilliant idea! Hadn't thought about the 'head -x', but I _like_ it. I 
>> was thinking about maybe 1 or 2 Kbytes reserved for the header anyway, 
>> 
> these are ideas that floated in ML threads long ago. Note that OSBF makes
> room for 4k header.
> 
Yes, I saw there was some version checking and header code in there already.
BTW, 'man head' on my box doesn't give a -x option. Is that an option to 
read until the EOF (or NUL?) character in an ASCII file?
> ok, ok - no b2b ;)
> 
Sorry, recalled some 'cool hacking' sessions of long past that went pear 
shaped as nobody could'handle' it. With 20-20 hindsight it was an 
exercise in complexity capability (how much nasty little details can 
you handle all at once).
> no, if you put the classes in 2 sets like
> ! classify (classA_1 classA_2 ... | classB_1 classB_2 ...)
> you get a scalar (success|fail, but still all pR values). If you insted
> say
> ! classify (classA_1 classA_2 ... classB_1 classB_2 ...)
> (note no '|') ie run in 'stats-only' - you get just the pR vector. I think
> you can use that for building your fuzzifier, either in CRM or your favourite
> prog.lang. A tricky point is that pR is normalized, so that it cannot be
> used as class-membership function as is; an artifice could be to add a 
> class 'AnythingElse', ie the complement to the set of your classes.
> 
I've copied this to my project notes. At the moment, the details of this 
are beyond my grasp, but that will change when I move away from the code 
cleanup into the actual algorithmic material of crm114.
Thank you for this tip for it gives me a direction to investigate.
> note that not all classifiers work well for N >2, nor those that are 
> *supposed* to work have been thoroughly tested.
> 
I already suspected that much. That's why I don't mind going through all 
the code: I expect I'll need this exercise later on.
> well, crm114 is a jit engine + classifiers plugged-in (bolted-in, at present).
> [...]
> I think that, if none of the (pR output from) current classifiers fits your
> task, it'd be relatively easy to hack one of them into a new one, which 
> would be named eg f-osb (Fuzzy-OSB) or even OSBG (OSB-Gerrit) ;).
> 
:-) heh, OSBG, now that would be something.
Seriously though, I immediately recognized the plug/bolting in features 
when I first had a look at the crm114 code.
Of course, a bit less of a copy&paste approach would have been 'nice' 
from a certain design point of view, but given the research nature of 
this type of tool (as Bill put it so eloquently somewhere: 'spam is a 
moving target') copy&paste is a very good approach (you can always 
refactor the sections that have stabilized).
Besides, there are very nice tools out there to ease diff&merge-ing 
source files, so it's not much of a hassle to keep them in sync for now 
(like I did with my copy of SVM vs SKS: SKS seems to have started as an 
utterly stripped version of SVM, but the behaviour is _very_ similar so 
I merged the SVM code back in, just so I have lesser diffs to look at 
when cross-checking SKS vs SVM after a code change in either one of them.
Ger
From: Raul M. <mo...@ma...> - 2007年08月08日 16:00:33
On Tue, Aug 07, 2007 at 08:37:09PM +0200, Gerrit E.G. Hobbelt wrote:
> Right now, as I see it, you can't provide hard guarantees that 
> conversions will work (and I suspect that, given my goal with crm114, 
Sure you can: Reaver Cache.
That works across versions, across classifiers, etc.
-- 
Raul
From: Paolo <oo...@us...> - 2007年08月08日 08:41:51
On Tue, Aug 07, 2007 at 08:37:09PM +0200, Gerrit E.G. Hobbelt wrote:
> > 
> Brilliant idea! Hadn't thought about the 'head -x', but I _like_ it. I 
> was thinking about maybe 1 or 2 Kbytes reserved for the header anyway, 
these are ideas that floated in ML threads long ago. Note that OSBF makes
room for 4k header.
> I've seen the CSV interformat and I was thinking about using that. No 
> bin-2-bin direct stuff, as that would complicate matters beyond control: 
...
> I've done direct bin-2-bin conversions in the past, but they're a true 
> support nightmare. It's doable, but you can have someone spend a serious 
ok, ok - no b2b ;)
> <off-topic>
...
> and a _learning_ 'fuzzy' discriminator, which has to wade through a slew 
> of 'crap' to arrive at a 'proper' rule or decision. Here I'm more 
> interested in decision _vectors_ (rather small ones) than _scalars_, but 
> I'll tackle that hurdle when I've got crm114 to a state where I can 
> really dive into the classifiers themselves, because I believe right now 
> it only supports single output bits(scalar) (pR?) but I'm not entirely 
no, if you put the classes in 2 sets like
! classify (classA_1 classA_2 ... | classB_1 classB_2 ...)
you get a scalar (success|fail, but still all pR values). If you insted
say
! classify (classA_1 classA_2 ... classB_1 classB_2 ...)
(note no '|') ie run in 'stats-only' - you get just the pR vector. I think
you can use that for building your fuzzifier, either in CRM or your favourite
prog.lang. A tricky point is that pR is normalized, so that it cannot be
used as class-membership function as is; an artifice could be to add a 
class 'AnythingElse', ie the complement to the set of your classes.
> The problem for me is that I need to understand/learn the algorithm 
note that not all classifiers work well for N >2, nor those that are 
*supposed* to work have been thoroughly tested.
> I've got the idea, I have a 'feeling' that this is the right direction, 
> but it's really still just guesswork regarding feasibility so far.
well, crm114 is a jit engine + classifiers plugged-in (bolted-in, at present).
The whole thing about pR is how you measure the stats for X against the
N classes, which is just a bunch of lines that can be tweaked at pleasure.
...
> I don't mind too much if crm114 doesn't work out for goal #2 - though it 
> would be a serious setback - as there's still the spam filter feature 
I think that, if none of the (pR output from) current classifiers fits your
task, it'd be relatively easy to hack one of them into a new one, which 
would be named eg f-osb (Fuzzy-OSB) or even OSBG (OSB-Gerrit) ;).
--
paolo
From: Gerrit E.G. H. <Ger...@be...> - 2007年08月07日 18:37:13
Paolo wrote:
> On Mon, Aug 06, 2007 at 09:31:29PM +0200, Gerrit E.G. Hobbelt wrote:
> 
>> - start each file with a versioned header (I'll come back to that later)
>> 
>
> that's well established for Fidelis' OSBF
> 
I saw. It's just that I'm looking for a rather more generic solution, 
which is copy&paste-able when anyone (probably Bill) feels like adding 
other classifiers to crm114. Say some sort of 'file format/coding 
practice' thing: rip if off the other classifiers and just add your own 
classifier constant (so no fancy footwork with index [0] in the data 
arrays itself or anything like that).
>> a) import all acceptable data, or
>> 
>
> there's a catch, as the original arch on which to do the export 1st might
> not be avail anymore ...
> 
Heh :-) That's where I refer to the legalese in there: 'sorry sir, it's 
_forward_ compatible as of this release' ;-)
The whole point is that I'm trying to get at a mechanism which clearly 
identifies the data, both in type and version, so that we can develop a 
'sure fire' and sane conversion.
This while keeping in mind that design/devel/test time is a rather 
limited resource, so the 'management decision' may well turn out to be 
to forego the availability of a complete 'conversion' for specific 
versions (and that may include crm file versions predating this 
versioning mechanism).
Right now, as I see it, you can't provide hard guarantees that 
conversions will work (and I suspect that, given my goal with crm114, 
I'll need that sort of thing), as you have several classifiers and 
software versions, while there's no way to tell them apart in a 
_guaranteed_ manner: all one can go on is some version info (OSBF et al) 
and a bit of heuristics. And 'it may work' isn't an option for me when 
I'm going to employ crm114, so I like to be able to _specifically_ test 
(and thus support) crm software versions and classifiers.
Longwinded paragraphs cut short:
I want to end up with a chart which tells me: "You've got crm114 release 
X and are using classifier C, well, we do support a 'full data transfer' 
for the current crm114 release."
and maybe an additional (sub-)chart which says: "And incidentally, when 
you have crm114 running on system S, you can also _share_ that 
classifier's data on system type T using our import/export-based sync 
system."
These charts have three ticks in each cell of their matrices: (a) may 
work (a.k.a. there's code for this in there) + (b) tested, a.k.a. we got 
word it works + (c) supported, a.k.a. you may bother us / complain when 
it isn't working.
No tick in your cell on those charts means: you're SOL. Time for a 
retraining and ditching of the old files, probably.
This would solve the problem of the ever lasting questions: can I keep 
my files or should I start from scratch?
For folks that cannot retrain as they go, this 'charted' approach will 
provide them with a clear decision chart: can/should I upgrade, or 
shouldn't I?
>> b) report the incompatibility and hence the need to 'recreate/relearn' 
>> the files.
>> 
>
> ... and b) might not always be an option.
> 
See above. I'm well aware of that. I'm driving at a mechanism which 
allows everyone to clearly see when and what can/has been done.
That includes you (J.R. User) helping the crm114 team by adding 
export/import support for those situations where the chart says 'not 
available' while you need that sort of thing.
That also includes collecting and archiving feedback on [user] test 
results: did their transfer/upgrade work out ok?
It's added work, but the benefit is that the upgrade process (and the 
decision to upgrade) can be fully automated in the end: for unmanned 
systems: only upgrade when our locally used version + classifier has a 
tested (and supported?) data migration path towards this new crm114 
upgrade release.
> yep, but I'd consider a bug (which might be just a TODO) a convertion
> util/function which is unable to properly convert our own stuff from arch1
> to arch2, both ways, whatever arch* are. 
> Such converters won't be exactly trivial (byte swapping, aligning, padding, 
> etc) but feasable.
> 
That's where the limited design/devel resourcing comes into play: I 
don't mind if the 'standard' decision is NOT to support/provide a data 
conversion path. It's understandable that we do so as we don't have an 
unlimited supply of dev power.
But when we do choose to provide a conversion path it's clearly 
identifiable. (someone may need it and help Bill, you and the others by 
putting in the dev effort there, just like I'm reviving the Win32 port 
and adding error checking and stuff along the way)
And, BTW, I've been writing that sort of cross-platform stuff more 
often. It gets a bit wicked when you need to convert VAX/VMS Fortran 
floating point values to PC/x86 IEEE format, for instance. ;-)) 
Otherwise, it's just really careful coding and a bit of proper up-front 
thinking. And then keeping a lookout for register/word-size issues (e.g. 
32- vs. 64-bit) throughout the crm implementation, which is the hard part.
Padding, endianess, etc. can be handled rather easily: define a 'special 
struct' with all the basic types in there and load it with a special 
byte sequence: that gives you endianess and alignment for all basic 
types. Floating point values need a bit of a special treatment when you 
travel outside the IEEE realm, but that's doable too. Not trivial, 
though, indeed.
>> The binary format header will include these information items (at least):
>>
>> - the crm version used to create the file
>> - the platform (integer size and format(endianess), floating point size 
>> and format, structure alignment, etc.)
>> - the classifier used to create the file
>> - the data content type (some classifiers use multiple files)
>> - space for future expansion (this is a research tool too: allow folks 
>> to add their own stuff which may not fit the header items above)
>> 
>
> +file-format version and, since there'll be plenty of space, plain-text
> file-format blurb and summary file-stats, so that head -x css would be
> just fine to report the relevant things.
> 
Brilliant idea! Hadn't thought about the 'head -x', but I _like_ it. I 
was thinking about maybe 1 or 2 Kbytes reserved for the header anyway, 
so, yes, plenty of space of a little informational text up front. A few 
Kbytes won't hurt.
+file-format: yes. In case we find the format needs to be changed again 
(hopefully not before 2038 ;-) ) Another very good point.
>> The approach includes the existence of an export/import tool to convert 
>> the data to/from a cross-platform portable text format, where 
>> 
>
> that's the current CSV inter-format, though the converter should be able
> to do it at once binary-2-binary.
> 
I've seen the CSV interformat and I was thinking about using that. No 
bin-2-bin direct stuff, as that would complicate matters beyond control: 
given the 'cross-platform' tack, it would mean that a developer would 
have to code - and maintain - software which includes a table of file 
layout definitions, one for each supported platform (and probably the 
crm release version too).
Compare this to databases: right now I'm in a project where I've found 
that Oracle cannot copy database files as-is across patch versions 
(that's the ultra-minor version number), let alone moving the binary 
database files as-is on to different unix architectures (HPUX vs. 
Linux, of course with differtent CPUs too). And that makes sense!
The point? When Oracle DBAs are used to export-dumping and importing 
databases running in the many-multi-Gigabyte range to provide an 
migration/upgrade path for the data stored therein, I'd like to do 
_exactly_ the same. That means: use the CSV format (probably augmented) 
as an intermediate. (Or XML when I feel like getting fancy and really 
21st century ;-) )
I've done direct bin-2-bin conversions in the past, but they're a true 
support nightmare. It's doable, but you can have someone spend a serious 
chunk of his/her life on that alone. And when that person quits 
supporting it, you're SOL as a tool provider, really. (Imagine your 
customers use a platform which you didn't support just yet. Maybe a new 
CPU type even. Can your _design_ of the bin2bin handle that? Or do you 
need to spend a significant amount of devel effort just to add the 
generation of these new-CPU-type files to your ware?)
The easy way out is to provide all your customers with a single, 
portable format: they've got the software built on their own machines 
and who better than the machine itself can convert to/from that portable 
format? Thus, the conversion effort is off-loaded to the compiler 
vendor, who has to cope with it anyway. (sscanf/printf/etc.)
XML is a good example as a solution invented for solving precisely this 
very issue. (cross-platform, cross-version, cross-X-whatever data transfer)
We might even consider using XML as a replacement for the CSV format, 
though XML tends to be rather, er, obese, when it comes to data file 
sizes. XML is hierarchical, so we can easily store our header info and 
crm classifier data in there, while nicely separated/organized.
> for spam filtering, it's easier (and usually better) to start from scratch,
> but in other applications hashes DB might be precious stuff, so as people
> extends crm114 use to other tasks, such tool might become highly desirable.
> 
Yes, indeed. Verily.
<off-topic>
I have looked around at software supporting Bayesian/Markovian/etc. 
statistics and selected crm114 because it looked like it had the right 
amount of 'vim' (.i.e. lively dev community) while offering a feature 
set which might cover my needs - or get very close indeed.
I intend to use crm114 for spam filtering (when combined with xmail) and 
for a second purpose: I'm not going to disclose what it is exactly, but 
think of it as a sort of fuzzy decision-making / monitoring process, 
which is a bit of a cross-breed between a constraint-driven scheduler 
and a _learning_ 'fuzzy' discriminator, which has to wade through a slew 
of 'crap' to arrive at a 'proper' rule or decision. Here I'm more 
interested in decision _vectors_ (rather small ones) than _scalars_, but 
I'll tackle that hurdle when I've got crm114 to a state where I can 
really dive into the classifiers themselves, because I believe right now 
it only supports single output bits(scalar) (pR?) but I'm not entirely 
sure there (lacking sufficient algorithm understanding). Anyway, I guess 
the 'vector solution' would be to use multiple crm (file) instances in 
parallel: one pR for each decision item in the output vector. Of course, 
that's a crude way, so the 'clean' approach I originally aiming for was 
convert crm114 into a library which could be called/used from within my 
own special purpose software. Alas, that's not a Q4 2007 target anyway. ;-)
The problem for me is that I need to understand/learn the algorithm 
internals for this advanced statistics stuff as that is new to me and I 
want to understand what it's actually doing, i.e. how this stuff arrives 
at a decision, as I need to understand the implicit restrictions on the 
classifiers (and learning methods). Let's just say I don't want to join 
the mass who can't handle the meaning and implications of 'statistical 
significance', such as by just grabbing a likely classifier and 
'slapping it on'. I fear that would cause some serious burn in the long 
term.
You may have seen from my work so far, that I'm a bit paranoid at times 
^H^H^H^H^H^H^H acutely aware of failure conditions, and it would be 
utterly stupid to fall into that beartrap at a systems level by grabbing 
this tool and applying it to a problem without really understanding 
where and what the limitations of the various parts are. I've met too 
many design decisions _not_ too worry.
I've got the idea, I have a 'feeling' that this is the right direction, 
but it's really still just guesswork regarding feasibility so far.
I arrived at crm114 while I had been looking for a decision filter which 
could easily handle _huge_ inputs for tiny outputs (spam: input = whole 
emails, output vector size = 1), produce consistent and significant 
decisions (spam: > 99% filter success rate in a very short learning 
period) while including a good 'learning' mode: somehow I don't think 
Bayesian is the bee's knees when it comes to my second goal. And it has 
been shown it's certainly not the end of it for spam either.
And besides, crm114 isn't written in Perl (or some other interpreted 
language). Which in my world is a big plus. ;-)
I don't mind too much if crm114 doesn't work out for goal #2 - though it 
would be a serious setback - as there's still the spam filter feature 
which is useful to me. So I don't mind spending some time on this baby 
to push it to a level where I can sit back, have a beer and say "yeah! 
Looks good, feels good. Let's do it!"
</off-topic>
Best regards,
Ger
From: Paolo <oo...@us...> - 2007年08月06日 23:31:45
On Mon, Aug 06, 2007 at 09:31:29PM +0200, Gerrit E.G. Hobbelt wrote:
> 
> - start each file with a versioned header (I'll come back to that later)
that's well established for Fidelis' OSBF
> The way to provide the forward portability would be through providing an 
> export/import mechanism (already exists for a few formats: cssdump) 
...
> The versioned header should contain enough information for an 
> export/import function to operate correctly:
> a) import all acceptable data, or
there's a catch, as the original arch on which to do the export 1st might
not be avail anymore ...
> b) report the incompatibility and hence the need to 'recreate/relearn' 
> the files.
... and b) might not always be an option.
 
> Especially (b) is important as that'd enable (automated) upgrades to 
> properly interact with the users: one would then be able to select 
yep, but I'd consider a bug (which might be just a TODO) a convertion
util/function which is unable to properly convert our own stuff from arch1
to arch2, both ways, whatever arch* are. 
Such converters won't be exactly trivial (byte swapping, aligning, padding, 
etc) but feasable.
> The binary format header will include these information items (at least):
> 
> - the crm version used to create the file
> - the platform (integer size and format(endianess), floating point size 
> and format, structure alignment, etc.)
> - the classifier used to create the file
> - the data content type (some classifiers use multiple files)
> - space for future expansion (this is a research tool too: allow folks 
> to add their own stuff which may not fit the header items above)
+file-format version and, since there'll be plenty of space, plain-text
file-format blurb and summary file-stats, so that head -x css would be
just fine to report the relevant things.
> The approach includes the existence of an export/import tool to convert 
> the data to/from a cross-platform portable text format, where 
that's the current CSV inter-format, though the converter should be able
to do it at once binary-2-binary.
> What are your thoughts on this matter? Is this worth persuing (and hence 
> augmenting the code to support such a header from now on) or is this, 
> well...
for spam filtering, it's easier (and usually better) to start from scratch,
but in other applications hashes DB might be precious stuff, so as people
extends crm114 use to other tasks, such tool might become highly desirable.
--
paolo

Showing results of 112

1 2 3 .. 5 > >> (Page 1 of 5)
Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.
Thanks for helping keep SourceForge clean.
X





Briefly describe the problem (required):
Upload screenshot of ad (required):
Select a file, or drag & drop file here.
Screenshot instructions:

Click URL instructions:
Right-click on the ad, choose "Copy Link", then paste here →
(This may not be possible with some types of ads)

More information about our ad policies

Ad destination/click URL:

AltStyle によって変換されたページ (->オリジナル) /