You can subscribe to this list here.
2004 |
Jan
|
Feb
|
Mar
(27) |
Apr
(25) |
May
(8) |
Jun
(2) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2005 |
Jan
(1) |
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
(1) |
Nov
|
Dec
|
2006 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
(2) |
Nov
|
Dec
|
2007 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
(8) |
Jun
|
Jul
(3) |
Aug
(8) |
Sep
|
Oct
|
Nov
(1) |
Dec
|
2008 |
Jan
(3) |
Feb
|
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
(3) |
Oct
|
Nov
(3) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(2) |
2013 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
Hi, I wanted to know whether The C-callable library, LIBCRM114 would work for ideographic languages like Chinese, Korean or Japanese ? As these languages do not have word boundaries so how tokenization stuff would work ? Is there any way around like converting the ISO-2022 encoding into UTF-8 then training and classifying ? Or is there any other solution ? Please provide feedback. -Viks
Hello Bill, Thanks for you reply. I'm not a programmer, but I can find my way around Linux and do basic bash scripting. I looked at both mailfilter.crm and mailreaver.crm and was, with my current knowledge of the crm language, a bit overwhelmed at the prospect of modifying anyone of those to my needs. So I would much prefer any command-line scripts that I could modify to test this out. Best Lars Sorensen On Dec 12, 2012, at 3:33 PM, ws...@me... wrote: > > Yes, CRM114 can do multi-class sorting; one of the test cases actually > does that (four classes, I believe). > > Now, a question: do you want to do this from command-line, or are you a > C programmer? The reason I ask is that we have two user-compatible but > NOT binary-compatible CRM114's now. > > - There's the command-line version, which has it's own language; > > - There's the C-callable library (written in ANSI C)- you call it from > a program you write. (yes, there's example code, including, if I > recall correctly - four-class examples) > > Which would you prefer? > > - Bill
Hello, I have an email account that receives a fairly high volume (500-800) of daily emails (500-800), and would like to categorize/classify these emails automatically into about 100 categories/folders. The last two months I have been trying out POPFile with some limited success. (http://getpopfile.org/) After I have been inspecting keywords and decision trees in POPFile, it would seem to me like a classifier using phrases for classification might classify this type of emails better than the Naive Bayes implementation in POPFile. As I'm not a programmer, but trying to learn, I have been searching for preexisting tools that might work for what I want to achieve. Searching the web I can see that leaves me with the two options: CRM114 or OSBF-lua as classifiers and as I understand CRM114 now uses the OSBF classifier as the default! Are there any implementations/scripts out there that will allow multiple classes for general email sorting using CRM114 or OSBF-lua as the classification engine? It seems from what I read that this should be possible, but I'm unable to find any practical implementations to test with. As I understand both mailfilter.crm and mailreaver.crm use only 3 classifications: 1.spam 2.nonspam 3.unsure, so these would not be useful for me in this regard I presume. I could use some advice in how to go about this the right way. Are there any scripts or tools out there that will do general email classification with CRM114 or OSBF-lua that could be implemented with maildrop or procmail on a Linux OS? Any ideas or pointers would be greatly appreciated. Best Lars Sorensen
Hi, I'm trying to use crm114 on our mail server to filters bounced messages into categories : user_unknown host_not_found relay_denied mailbox_full mailbox_blocked detected_as_spam on_vacation message_too_large not_a_bounce unknown I'm using the learn and classify commands from this script : https://github.com/samdeane/code-snippets/blob/master/python/crm.py : categorization : "<osb unique microgroom>" learn :" '-{ learn %s( %s) }'" classify : " '-{ isolate (:stats:); classify %s( %s) (:stats:); match [:stats:] (:: :best: :prob:) /Best match to file .. \(%s\/([[:graph:]]+)\\%s\) prob: ([0-9.]+)/; output /:*:best:\\t:*:prob:/ }'" My question is which categorization method would you suggest to achieve this kind of filtering ? thanks, Matthieu
The only spot where he seems to be aware of incoming news/messages is his facebook and someone tried to reach him (Simon Vans-Colina) there. No lights there, tho. The point is that, if this guy work was ever an opensource project, i was wondering if anyone had a piece of it, or any other implementation of CRM114 regarding CVs classification for recruiting. Thanks again! Alejandro El jue, 20-11-2008 a las 10:39 +0100, Gerrit E.G. Hobbelt escribió: > > I see he's on LinkedIn; did you try to reach him there?
Sorry, can't help you out. I see he's on LinkedIn; did you try to reach him there? Take care, Ger Alejandro Fernandez Japkin wrote: > Hello everyone, > > I'm in the middle of a hurry that includes implementing CRM114 > as a CV -resume- classifier for hiring purposes. Is of my understanding > that someone named "Simon Vans-Colina" was involved in some tool > on this subject, but the few links available over the net are just dead. > Is there *anyone with *any information about this? I'd really appreciate > a straight answer, since i'm running out of time and i want this > monkey off my back. Writing from scratch is not an option at the point i > am. > > Thanks really -a lot > > > Alejandro > > > ------------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move Developer's challenge > Build the coolest Linux based applications with Moblin SDK & win great prizes > Grand prize is a trip for two to an Open Source event anywhere in the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > _______________________________________________ > Crm114-discuss mailing list > Crm...@li... > https://lists.sourceforge.net/lists/listinfo/crm114-discuss > > -- Met vriendelijke groeten / Best regards, Ger Hobbelt -------------------------------------------------- web: http://www.hobbelt.com/ http://www.hebbut.net/ mail: ge...@ho... mobile: +31-6-11 120 978 --------------------------------------------------
Hello everyone, I'm in the middle of a hurry that includes implementing CRM114 as a CV -resume- classifier for hiring purposes. Is of my understanding that someone named "Simon Vans-Colina" was involved in some tool on this subject, but the few links available over the net are just dead. Is there *anyone with *any information about this? I'd really appreciate a straight answer, since i'm running out of time and i want this monkey off my back. Writing from scratch is not an option at the point i am. Thanks really -a lot Alejandro
On Wed, Sep 3, 2008 at 3:09 PM, Bill Yerazunis <ws...@me...> wrote: > More like: > > LEARN ( c1.stat c2.stat | c5.stat ... c127.stat) < osbf unique> [my.txt] > > which means "train my.txt in as a positive example in statistics files C1 > and C2, and as a negative example in files C5 through C127". If a > file is not found, initialize it as "osbf unique", otherwise use > the self identification in the file to choose the correct learning > method. Whoa. I am probably OD'ing on Microsoft Excel right now so my 'grok' is down to zero, but can you please run that "self identification" bit by me again? Or is that something along the lines of 'open file, read header, check classifier id+config in there, *then* jump to classifier? (Which can be done, if you provide the 'csscreate' script opcode or some such (which is only a stupid stub in GerH now, btw) which is then to be used to 'create/set-up' any new CSS file. (mailreaver's 'learn zilch' trick to create css on the fly has to be replaced then with such a csscreate opcode.) Am I thinking too 'classical/procedural' here regarding learn? Anyway, from what I read in your text is that you're going for something like this: assume message M which will be classified, then [unidentified intelligent code] will train message M as 'spam' or 'ham' --> code assuming auto-ID'ing classifier as described above so no attributes needed: classify (S|H) [M] ... learn (S|H) [M] --> learn as spam (left side is 'S'pam CSS files, right is 'H'am CSS) ... learn (H|S) [M] --> learn as ham (because now 'H'am is at left) which means you rotate the S/H CSS file[s] [collections] around that | pipe symbol there. That would be identical - I think - to Paolo's learn (S|H) <1> [M] --> learn '1st' side == left side == spam ... learn (S|H) <2> [M] --> learn '2nd' side == ham Now for multiclass A|B|C|D|... it would probably work the same, you just rotate the proper class E {A,B,C,D,...} (E == element of, no math symbols in email) to the front while Paolo's would send along the proper 'index' value as an attribute or some such. If it's like that, I'd rather have the 'indexed' variant instead of the 'rotated around | pipe' style because it would take one isolated var only to schlepp that bunch around and it saves on possible if/else conditionals as well, because I might be able to blunty derive index i E {<1>, <2>, ..} from a previously determined pR using a bit of :@: math, but that's just me. The 'rotating' style is auto-backwards compatible (while keeping 'details' like <refute> outside that equation for now) when you have 'optional pipe' instead of 'required pipe' (and provided "you know what you are doing" caveat applies to script writer). Meanwhile, SVM still has 2 pipes and 3 files where anybody else uses A|B (1 pipe, 2 files) for same, so there's still a bit of 'irregularity' there to my mind, but then I probably should stick to looking at lotsa numbers in rectangles instead of attempting brain activity today. -- Met vriendelijke groeten / Best regards, Ger Hobbelt -------------------------------------------------- web: http://www.hobbelt.com/ http://www.hebbut.net/ mail: ge...@ho... mobile: +31-6-11 120 978 --------------------------------------------------
On Wed, Sep 3, 2008 at 9:27 AM, Paolo <oo...@us...> wrote: > On Wed, Sep 03, 2008 at 05:03:35AM +0200, Ger Hobbelt wrote: > ... >> Given 60 classes (= CSS files), Paolo can have his KISS and I can eat >> my pie too. Simple. > > let me stress once again that I question the _requirement_. [...] > agreed, provided that > > ! learn (:*:s:) [message] <i flags> > > where :s: is just a subset (1 as limit, so that 'i' can be dropped) of the > N classes in use, is (remains) legal (where allowed). Yes, that should be possible in my line of thought. Assuming "you know what you're doing" i.e. are aware of classifier internals, you can do this in the new 'learn'. Take existing OSB for example (*forget* my 'delta' stuff for a sec there), which touches only a single CSS file on learn, then learn (A|B) <1> is identical to learn (A) <1> is identical to learn (A) is identical to learn (A|B) because <1> is a possible 'default' -- though that might be a disputable thing - I'd rather see an error report, because learn (A|B) isn't 'obviously' going to teach the way of A. The thing I'm really after is that at script level learn (A|B|...) <i> is supported for _all_ classifiers. When you're doing smart stuff script-wise where you like to code learn (A) while you classify code is classify (A|B|C|D|E|F|..) fine. The bit of 'cut at pipe, pick the ones you want' code I envision can handle it, so you've got options script-code-wise. In other words: a 'set' of one, is still a set in my book. That you as a script writer might want to take that thought to the edge (set of 1) is fine with me. I always appreciate that kind of craftiness. It's just that the starting point shifts for people new to this: keep the set around and apply to both classify and learn equally. When you are ready to read the fine print in the manual, you can decide to use 'set of 1' as a valid 'fringe case' (fringe from script-language structural point of view). What I *need* is learn (A|B) support for classifiers that don't have it yet (OSB and friends) and currently there's no possibility for coding learn (A|B) <i osb> so I am prevented from testing my ideas for the classifier itself. >> what they want/need, you get the chance to apply filters & processes >> in learn that are simply impossible right now PLUS you don't have to > > that's C level, SVM wants 3 because it uses 3 in both cases. Aicks! You _got_ me there. Forgot the 3rd one in SVM. DANG! Still a remaining 'oddity' hence. :-(( No good answer there expect mumbling about the implicit 'variable size' of a 'set' as I approach it. > ! learn (a b a_v_b) <svm flags> # wants all 3 > ! classify (a b a_v_b) <svm flags> (s_svm) # wants all 3 > ! classify (a b) <xxx flags> (s_xxx) # can't use the extra a_v_b > >> So no unified ... mess; I'd say it's unified ... structure / design. > > maybe, but that's not as simple as saying : > define: N classes > hence: LEARN(1 2 ... N) > CLASSIFY(1 2 ... N) > which might turn into a mess, or better shift the mess from one place to > another. Sure it's a shift: out of the [script] language, so it's 'black boxing' learn as it is classify, and into the [C] code. I think for general use it's less mess because you need to 'remember' less about the script language and the 'learn' interface, because apart from the extra index (in a sense you're _feeding_ it the pR which would pop out of classify as a result) it's exactly like classify. I really like language layout where general use requires the least number of 'rules' and 'details' to be remembered: it makes for a simpler language overall which is good for me as I work with multiple languages and a limited brain. ;-) (This learn/classify stuff is - in a way - comparable to old discussions about 'coding standards' and such for Pascal or C, where there's a class of folks that say: "you can skip the braces/begin-end and the semicolons so you should" while I am clearly with the folks that say: "don't matter what you do, always apply the same structure: braces/begin-end and semicolons and stuff, unless it is _prohibited_ by the language". Right now 'learn (A)' is prohibiting me from using 'learn (A|B)'. I think that bit didn't make it through last night.) >> Cost for Trever @ 60 classes? nil. > > wasn't thinking of run time cost, but script readability. Same here. But Trever was starting to worry, it seemed to me, performance would drop, if ever so slightly, if we'd be introducing this. And in case others were going to think it mattered. > yes, though once N classes get mmaped for a CLASSIFY a single class LEARN > can check for it and won't mmap() again, and mmsync() can be deferred > iff other processes that use same class(es) do that via shared mem. Yep. When you construct your scripts to handle classification and subsequent learning in the same crm114 instance, you get that advantage today. A (very limited) 'server'-y approach doable right now is writing a script which loops, waiting for messages available on disc or stdin, and keep on processing them one after the other in the same instance: you have the 'CSS stays in mem' benefit then as well (note: ignoring how to code for cutting up stdin into messages and/or poll/wait for disc-based messages here - that's another subject) >> Want some real, achievable gain? convert crm114 to play 'server', i.e. >> permanently loaded and CSS files (close to) permanently mapped in > > yes yes yes yes - the endless daemon saga :) [...] > yeah, maybe the ability to run pre-compiled scripts can be good idea > for a number of applications. You mean a kind of .java p-coded crm114 scripts, i.e. a real crm114 *compiler* (.crm --> .114 binary file) and, er, accompanying 'virtual machine'? Oh boy, the table rises here. ;-P But that's just the geek in me getting all exited. It's not on my list of 'things worth doing @ mid/short-term' though, but fun anyway. A crude/cheap way might be an option to 'dump' and 'load' tokenized script as it leaves the crm114 tokenizer going to the execution unit. Tokenize once, run multiple times. It's not worth it for me (I ran tiny scripts) but all the folks out there enjoying mailreaver and friends might get some good delight out of that as mailreaver/mailtrainer are _significant_ sized scripts. >> seriously considering hacking crm114 into becoming mod_crm114, i.e. an >> Apache2 plugin: you get the server, the socket I/O and the > > like Apache's Lucene and derivatives. Sorta. Yes. >> live in there like a wicked PHP-alike server-side scripting language >> and you will definitely achieve instant notoriety. ;-) > > and support headache ;) I like my native Americans ;-)) Granted, moving from 1.3 to 2.0/2.2 wasn't easy for a mod_xxx, but still I like it way more than 'roll your own [TCP-based] server' again: linking it to Apache (and no, despite the fact that I do Win32/64, I don't think I'll be the go-to guy if you want IIS plugin support: IIS6 is nice, in a way, and has good performance, but I run Apache on Windows for free projects and only do IIS for paying customers. Got to draw the line _somewhere_. If they open-source IIS, I'll reconsider that statement.) Anyhow, the crm114 scripts would still be there as they are right now; I would just take the std I/O and bend it so stdin = request and stdout (and stderr?) == response. Maybe add a touch of XML if you want to have a freeze-dried instant low-cal 'web service' (which is hot stuff these days, but rather old wine in fashionable new Walmart bags if you ask me, but then folks don't seem to study IT history anymore) Why Apache really? Because I can then 'lean on' the stick provided by them when I need to scale up: distributed servers, pardon, *services*, and the whole bloody lot are documented already. Besides, my purposes lead me towards a production environment as a 'web backend' anyhow, so why not bolt it to the web server itself? Yup, doing so requires some understanding of the Apache API interfacing and that's raising the tech level by +1, but at least you can be spared some significant intricacies regarding TCP/server performance tactics at server level. It's fun to write it, but in this case, my feeling was it's faster to go for mod_crm114 in dev time. And yes: that's 'faster' regarding a _production quality_ mod_crm114 compared to _production quality_ crm114d (note the 'd'). (For free as well: SSL secured communications with the crm114 'service' - which might be something to cheer the 'remote services' folks up quite a bit.) Anyway, I don't 'do' the alpha release of mod_crm114 in one week, nor can I deliver alpha stage crm114d in the same timeframe, so it'll probably stay a great idea over whiskey on Friday as I don't see Bill getting his hands on a particular red phone booth with free access either. ;-) > see above: CAN but definitely should not be a MUST. Does my approach of 'set' as described at start of this email match your CAN, or does it still sound like MUST to you? >> ONE: strict adherence to 'backwards compatibility' at CRM114 script > > just one good reason. well.... ;-) -- Met vriendelijke groeten / Best regards, Ger Hobbelt -------------------------------------------------- web: http://www.hobbelt.com/ http://www.hebbut.net/ mail: ge...@ho... mobile: +31-6-11 120 978 --------------------------------------------------
On Wed, Sep 03, 2008 at 05:03:35AM +0200, Ger Hobbelt wrote: ... > Given 60 classes (= CSS files), Paolo can have his KISS and I can eat > my pie too. Simple. let me stress once again that I question the _requirement_. > The set passed to classify is a set and should be passed to learn as right, if we had vector/array struct it'd be 'natural' ... > isolate (:c:) /class1 | class2 | and so on .../ ... > classify (:*:c:) [message] which is a fake vector, works on strict assumptions on how to name var/classes. Like in other situations, having true array data structure would be quite useful. > learn (:*:c:) (index) [message] > > Both look good to _me_. ;-) agreed, provided that ! learn (:*:s:) [message] <i flags> where :s: is just a subset (1 as limit, so that 'i' can be dropped) of the N classes in use, is (remains) legal (where allowed). > Because you always pass along the whole set at script level, the > classifier code (both learn and classify implementation) gets to pick there's no need for that, where's the binding between script level and classifiers implementation? eg I can define N classes, but use any subset for both LEARN / CLASSIFY at any point to my taste/needs, with the limit of the actual classifier's requirement: !# use classes: one two three four five six seven ! learn (one two three four five six seven) <i flags> [msg_x] ! learn (three four five) <i flags> [msg_y] ! learn (one) <flags> [msg1] ! ... ! classify (one two three four five six seven) <flags> ! classify (five six seven) <flags> ! classify (three six seven) <flags> ! classify (one three four six seven) <flags> ! classify (six) <flags> (cm) # class membership -> cm, unsupported atm ... > what they want/need, you get the chance to apply filters & processes > in learn that are simply impossible right now PLUS you don't have to that's C level, SVM wants 3 because it uses 3 in both cases. > worry anymore either which classifier you're gonna use because today > all the bloody buggers require their own particular incantation when > it comes to number of css files (classes) passed to learn. there are categories of classifiers that have same requirements wrt #classes and params. Now suppose the actual classes are compatible, but one classifier needs 1+ extras (eg SVM) and I want to compare classifiers, then it'd be nice to do (SVM case, forget 4now actual class compatibility): ! learn (a b a_v_b) <svm flags> # wants all 3 ! classify (a b a_v_b) <svm flags> (s_svm) # wants all 3 ! classify (a b) <xxx flags> (s_xxx) # can't use the extra a_v_b > So no unified ... mess; I'd say it's unified ... structure / design. maybe, but that's not as simple as saying : define: N classes hence: LEARN(1 2 ... N) CLASSIFY(1 2 ... N) which might turn into a mess, or better shift the mess from one place to another. > Cost for Trever @ 60 classes? nil. wasn't thinking of run time cost, but script readability. > You save far more time when you find a way to reduce disc I/O cache > misses on your memory-mapped CSS files, even when you achieve such a > feat for learn alone (which would be rather weird and besides, unless > you 'Train Everything', optimizing classify is the winner). I have a yes, though once N classes get mmaped for a CLASSIFY a single class LEARN can check for it and won't mmap() again, and mmsync() can be deferred iff other processes that use same class(es) do that via shared mem. > Want some real, achievable gain? convert crm114 to play 'server', i.e. > permanently loaded and CSS files (close to) permanently mapped in yes yes yes yes - the endless daemon saga :) > invocation of crm114 and the moment the script *tokenizer* kicks in. > You're not even *executing* script yet by then! The rest (8%) is > spread across tokenizing ('compiling the [small!] script'), tokenized > script code execution, wrap-up and unidentified fluff elsewhere. > Believe me, if I'd see an easy way to kick that bugger into higher > gear, you'd already have it. yeah, maybe the ability to run pre-compiled scripts can be good idea for a number of applications. > seriously considering hacking crm114 into becoming mod_crm114, i.e. an > Apache2 plugin: you get the server, the socket I/O and the like Apache's Lucene and derivatives. > live in there like a wicked PHP-alike server-side scripting language > and you will definitely achieve instant notoriety. ;-) and support headache ;) > Anyhow, I don't see any good reason why the learn (classes) argument > cannot be identical to the related classify (classes) argument, except see above: CAN but definitely should not be a MUST. > ONE: strict adherence to 'backwards compatibility' at CRM114 script just one good reason. -- paolo
On Tue, Jul 22, 2008 at 1:21 AM, Chris Babcock <cba...@as...> wrote: > The answer to these questions might very well be, "test and measure". > If that's the case, I appreciate pointer to whatever help is available for the methodology since CRM114 and working with text classification in general are new to me. (I hate Perl, but I'm pretty handy with sed... Go figure.) > > How do you calculate the hardware requirements, especially the size of > CSS files needed, with a CRM114 program? Okay, let's try my hand at this on the quick. First of all, there's the classifier you pick. Different classifier, different behaviour, different size requirements. A few of them require unlimited space (CSS files grow a little every time), most of them have fixed space requirements. I'm a bottom-up guy most of the time, so let's start bottom up for some goodness. All 'production quality' classifiers (that's OSB and friends) are based on a fixed-sized hash table. Since the method chosen to store stuff in that hash table is the 'linear probe' algorithm, you should never ever try to get beyond the 50% fill rate point as then the hash table performance is QUICKLY deteriorating (quite a few papers on that; 50% isn't 'hard' but a 'rule of thumb' number). Since the hash table is filled with hash elements and we assume a reasonable quality hash here (flat distribution in N dimensions and bla bla bla), best case is a flat fill. To satisfy the mechanical engineer in me who's learned there's no such thing as a 'best case' in daily practice unless you get your lifetime's joss delivered in a single day, we add a fudge factor and after sucking on my thumb (mjam) I'd say a fill of about 20-30% would be swell. Performance? Can be assumed to be about near flat (O(1)); don't have 'live numbers' on this one, but my guess is bigger CSS files will be slower due to more chances at disc and CPU cache misses while poking at the hash table entries. Fast disc I/O helps there. RAID5, maybe RAID6 or other dandiness... Now one element is one hash plus a number, clocking in at 4 bytes each, so at N elements, that's N*8 bytes disc space per 'feature'. Given a 32-bit box and a safety margin of 2 for signed versus unsigned queerness -- NOT to be mistaken with the use of signed versus unsigned int discussed elsewhere ;-) -- that's a max size of 2GB / 8 = 2**(31-3) = 2**28 elements, then use 25% fill because it's a nice number (1/2**2) that means we can store 2**26 elements max on 32-bit 'without noticeable loss of performance'. (Thanks to the 2G instead of 4G edge I also have a fighting chance at getting this to actually /work/ on such a box as CRM114's using memory mapping and we can't eat all for just the CSS file. No space left for the binary and misc data there.) Ah! But to classify you need two CSS files at least! And given our memory mapping is done all at the same time, I'll dial down that max(N) number to 2**25. Because you can smell the napalm from here when looking at your numbers, let's quickly see what 64-bit has on offer in 'best case'; and that would include additional money for harddisc technology researchers an' all: 2**(63-3-2-1) ==> max(N) storage capacity at 2**57 which would mean you're good to go, topping out somewhere beyond 0.125 ExaFeatures (where one Feature is one CRM114 token a.k.a. 'hash'). If you get that kinda space, could I maybe charge you please for a measly commission fee in the form of .00001% of your disc space, yes? MY problems are solved then. ;-) So far the 'practical' limits. Now from your side of the fence: Taking 17*5*(175!/158!) on faith (this is my morning coffee, and it's gotta be fast, so I DO believe) at 6K sized docs? Hmmmm.... Let's just assume one doc is one(1) Feature (it probably isn't but what the heck, my backbone already feels where this is going; trying to beat Big Blue at it, are we, eh? :-)) ) that would mean, say, n!/m! =~= 100**(n-m) for m >= 100 here (and that's a BIG lie! but a really sweet one.) ==> we're going to be hit by a feed of over 17*5*(100**(175-158)) =? and since we're ballparking here like there's no tomorrow, that'd be somewhere beyond 100**(175+1-158)=100**18 == 10**36 which is somewhere over the rainbow and beyond an Exa SQUARED. Like the backbone already knew: ...OOPS?! Not to be the bee in your bonnet - I like the idea! :-)) - but (a) all them Features are never ever gonna fit, even when you get unlimited sponsoring by Hitachi and IBM, heck, you /buy/ them, and (b) assuming for now that (pre-)calculating/learning/whatever one such item takes about a single modern day CPU clock tick, i.e. ~ 1.0**-9 seconds, which is rather optimistic and out the /other/ side, you'd /still/ be at it when the Four Horsemen are having a snack on our offspring. Of course, we can make the bugger 'learn on the job' (don't we all?) and then it turns into the question of 'lifetime': how much do you want it to learn and how good should the bugger be at playing Diplomacy... in the end? Because there's surely to be found 'pathways' in that data a.k.a. 'successful strategies'. Guestimating what learning _those_ will cost is _way_ beyond the morning coffee, though. Sooooo... getting that Diplomacy-playing Big Blue going somewhere during /this/ lifetime, brute force ain't gonna cut it. Assuming the above was kicking in an open stabledoor (but fun!), the plus benefit of it all is that we have one practical usable result here: if you know how many different words ('features') you want this Bayes box to 'remember', you can take that number (N), multiply it by 4*8=32 for a 25% filled OSB[F].Markov/... classifier and your advised/preferred CSS file size would be N*32 bytes. To be eligible for one(1) yes/no style classification question ("is it or isn't it?"), that takes two(2) CSS files, so total disc /cost/ would be about N*64 bytes, excluding a negligible bit of header icing on the cake. Of course, that doesn't say nothing at what a 'feature' would BE in your case; in email it's generally one word, but that's also an 'it depends...' so there's lots of puzzling to do before you hit the Bayes box. Big question before plugging it in: exactly _what_ are we going to feed the animal? (See also a blurb about stocks analysis a few months ago in this ML. Simply plonking in raw data ain't gonna cut it. Same here.) On another note - before I run: if you want 'win/loose/draw' three-ways or other 'multiway' decisions, it is theoretically (and practically) possible with CRM114; for every extra choice you have to add one(1) more CSS file (and a | pipe symbol in your script). Multiway weighting is 'supported' in the code but I haven't heard about anybody actively using it since I first popped by in autumn 2007 so software-wise YMMV, Caveat Emptor, pick your classifier wisely and all that and here's a rabbit foot as well. Cause you're gonna _need_ it. Having unsettled you sufficiently, I'm exit left outa here. The laboring masses and all that. Still, I love your idea. Tip if you want to pursue this: check out what the chess boys have been doing. Same problem; smaller scale (cough). > > One of my long range projects is to write an AI for the boardgame > Diplomacy using CRM114. The approach is to archive combinations of turn > results and moves sorted by how favorable the outcome was. The program > builds movement sets by parsing game results to determine the > disposition of its units then consults a movement matrix to generate > all possible order sets. Each movement set and result combination is > submitted to the classifier to determine how closely it "resembles" > winning combinations from games in its training. > > What do I need to know in order to estimate the necessary size of the > CSS file? There's ~175 unit dispositions and an average of 5 possible > destinations for each unit. A well trained classifier which has > not eliminated any trivial cases will have no more than 17*5*(175!/158!) > documents of ~6 KB each - consisting of an order set and the results that produced it - in the corpus for each of the 7 game powers and 5 game phases. I need to be able to determine the optimal classification granularity (number of categories to sort to). Logically, I think that I would get the best speed and accuracy sorting to "win" and "not a win", but that only holds true as long as the classification file is within program limits without microgrooming. Dividing the classification files into more > outcomes - "win", "draw (by size of draw)" and "elimination" - doesn't reduce the size of the "win" CSS file, nor does any other outcome-based > refinements in classification. If I divide the classification according > to a metric of game progress then I can effectively reduce the size of > the CSS files at the expense of calculating that metric each turn. Are > there any guidelines for determining how the size of the CSS files > affects classification speeds? > > Chris > > > > > > > ------------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move Developer's challenge > Build the coolest Linux based applications with Moblin SDK & win great prizes > Grand prize is a trip for two to an Open Source event anywhere in the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > _______________________________________________ > Crm114-discuss mailing list > Crm...@li... > https://lists.sourceforge.net/lists/listinfo/crm114-discuss > > -- Met vriendelijke groeten / Best regards, Ger Hobbelt -------------------------------------------------- web: http://www.hobbelt.com/ http://www.hebbut.net/ mail: ge...@ho... mobile: +31-6-11 120 978 --------------------------------------------------
The answer to these questions might very well be, "test and measure". If that's the case, I appreciate pointer to whatever help is available for the methodology since CRM114 and working with text classification in general are new to me. (I hate Perl, but I'm pretty handy with sed... Go figure.) How do you calculate the hardware requirements, especially the size of CSS files needed, with a CRM114 program? One of my long range projects is to write an AI for the boardgame Diplomacy using CRM114. The approach is to archive combinations of turn results and moves sorted by how favorable the outcome was. The program builds movement sets by parsing game results to determine the disposition of its units then consults a movement matrix to generate all possible order sets. Each movement set and result combination is submitted to the classifier to determine how closely it "resembles" winning combinations from games in its training. What do I need to know in order to estimate the necessary size of the CSS file? There's ~175 unit dispositions and an average of 5 possible destinations for each unit. A well trained classifier which has not eliminated any trivial cases will have no more than 17*5*(175!/158!) documents of ~6 KB each - consisting of an order set and the results that produced it - in the corpus for each of the 7 game powers and 5 game phases. I need to be able to determine the optimal classification granularity (number of categories to sort to). Logically, I think that I would get the best speed and accuracy sorting to "win" and "not a win", but that only holds true as long as the classification file is within program limits without microgrooming. Dividing the classification files into more outcomes - "win", "draw (by size of draw)" and "elimination" - doesn't reduce the size of the "win" CSS file, nor does any other outcome-based refinements in classification. If I divide the classification according to a metric of game progress then I can effectively reduce the size of the CSS files at the expense of calculating that metric each turn. Are there any guidelines for determining how the size of the CSS files affects classification speeds? Chris
Hi, I assume that the numbers you reported are for the testset which was NOT trained as the numbers are lower than 70% of 266. Anyway, I would not worry too much about your numbers in relation to crm114 performance. Nothing which makes my eyebrows go up. I'm rather surprised crm114 got this far on its own, really. The problem lies elsewhere as it looks like you are running into the same /fundamental/ issue as I did when I decided to use crm114 for my signal analysis. The basic two questions you should answer for yourself first and foremost are: 1a- how do Bayesian and other statistical filters like crm114 work EXACTLY? (I refer to recent discussion in the crm114-developer mailing list (Bill/Paolo/Ger) where crm114 innards are explained and discussed using the analogy of a sandbox, green and red balls and a gold ball. It's way too much to reproduce here, but read up on that and make sure you understand what's going on. Research the algorithms used by crm114, before you continue. Key element to understand is how crm114 compares data elements to arrive at similarity figures. Which leads to question 1b- ask yourself where in your data is the 'equality' / identity in elements in the evaluated inputs, which is a low level engineering question derived from the second major question: 2- what are the metrics I want crm114 to compare to help me arrive at the answers which I seek? And which answer am I looking for, really? NOTE: express answers in both functional goals (for yourself) and technical implementation terms, because you are designing the automation of a 'human' system here, so you must be able to instruct the computer what to do /exactly/ what you want it to do to emulate the human process you try to model. Tip of the week: This implies, technically speaking, that you /may/ find you need to preprocess your data. I give this rather generic answer, because I believe it will help you far more in understanding the core of what you are doing than when I focus on a little detail (symptom) in your email and maybe up your successrate right now. Understanding what is going on in there is mandatory for anyone wishing to use statistical filters in a domain where they have not been 'preconfigured' by other researchers for you. > another. Is my understanding correct? Also, I found each time crm114 is made > to learn the same thing, it produces different classification result on > testing case. Is there a correct behavior? A few bits of info are lacking to answer this, but when there's no randomness involved in any way, the process should be completely reproducible, i.e. provide you with the same results after every complete re-run. Some learning methods (when you use mailtrainer for instance) /may/ employ randomizer learn ordering, which will jolt results for test sets; more so for small test sets like yours. Of course, further questions and results are welcomed. Best regards, Ger Hobbelt On Fri, Apr 18, 2008 at 9:48 PM, Weide Zhang <wz...@gm...> wrote: > > > Hi, I am using crm114 to do text mining on stock annual report 10K to make > prediction on their performances. The sample has 266 rows, each containing 1 > column indicating their annual report segment, and the other indicating > whether or not they perform better in that year compared to the industry > average. I use 70% of the data(data before 2006) as training and I tried > different training method. > > Below are the correct number for each category('good' and 'bad' meaning > perform better or worse). I use the python wrapper found on the crm114 > wiki. The accuracy is quite low and I notice that for osbf, there are no bad > case that are classified correctly and for markov, only 2 good cases are > classified correctly. It seems that the algorithms is biased one over > another. Is my understanding correct? Also, I found each time crm114 is made > to learn the same thing, it produces different classification result on > testing case. Is there a correct behavior? > > > good bad > entropy corr 14 6 > total 18 19 > > markov corr 2 14 > total 18 19 > > osb corr 10 4 > total 18 19 > > osbf corr 18 0 > total 18 19 > > Thanks for your answer, > > Weide > ------------------------------------------------------------------------- > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > Don't miss this year's exciting event. There's still time to save 100ドル. > Use priority code J8TL2D2. > > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > _______________________________________________ > Crm114-discuss mailing list > Crm...@li... > https://lists.sourceforge.net/lists/listinfo/crm114-discuss > > -- Met vriendelijke groeten / Best regards, Ger Hobbelt -------------------------------------------------- web: http://www.hobbelt.com/ http://www.hebbut.net/ mail: ge...@ho... mobile: +31-6-11 120 978 --------------------------------------------------
Hi, I am using crm114 to do text mining on stock annual report 10K to make prediction on their performances. The sample has 266 rows, each containing 1 column indicating their annual report segment, and the other indicating whether or not they perform better in that year compared to the industry average. I use 70% of the data(data before 2006) as training and I tried different training method. Below are the correct number for each category('good' and 'bad' meaning perform better or worse). I use the python wrapper found on the crm114 wiki. The accuracy is quite low and I notice that for osbf, there are no bad case that are classified correctly and for markov, only 2 good cases are classified correctly. It seems that the algorithms is biased one over another. Is my understanding correct? Also, I found each time crm114 is made to learn the same thing, it produces different classification result on testing case. Is there a correct behavior? good bad entropy corr 14 6 total 18 19 markov corr 2 14 total 18 19 osb corr 10 4 total 18 19 osbf corr 18 0 total 18 19 Thanks for your answer, Weide
Thanks for your response.=20 But if I have for example two words (N=3D2) and put it in the formula, = the resulting weight is 16 (2^2*2) and not 4. Where is my mistake? -----Urspr=FCngliche Nachricht----- Von: crm...@li... [mailto:crm...@li...] Im Auftrag von = Paolo Gesendet: Monday, January 07, 2008 9:58 PM An: crm...@li... Betreff: Re: [Crm114-discuss] Question about the weighting formula in theplateau paper On Fri, Jan 04, 2008 at 04:41:40PM +0100, Tobias Schneider wrote: > Weight =3D 2^2N=20 >=20 > Thus, for features containing 1, 2, 3, 4, and 5 words, the weights = of > those features would be 1, 4, 16, 64, and 256 respectively." >=20 >=20 > What does the variable N in the weighting formula stand for? I think you get the answer in the following slide: (3) the 2^2N weighting means that weights were=20 1, 4, 16, 64, 256, ...=20 for the span lengths of 1, 2, 3, 4, 5 ... words=20 Thus N stands for the number of words in the N-gram. HTH --=20 paolo =20 GPG/PGP id:0x1D5A11A4 - 04FC 8EB9 51A1 5158 1425 BC12 EA57 3382 1D5A = 11A4 - 9/11: the outrageous deception and ongoing coverup: = http://911review.org - -------------------------------------------------------------------------= Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketpl= ace _______________________________________________ Crm114-discuss mailing list Crm...@li... https://lists.sourceforge.net/lists/listinfo/crm114-discuss
On Fri, Jan 04, 2008 at 04:41:40PM +0100, Tobias Schneider wrote: > Weight = 2^2N > > Thus, for features containing 1, 2, 3, 4, and 5 words, the weights of > those features would be 1, 4, 16, 64, and 256 respectively." > > > What does the variable N in the weighting formula stand for? I think you get the answer in the following slide: (3) the 2^2N weighting means that weights were 1, 4, 16, 64, 256, ... for the span lengths of 1, 2, 3, 4, 5 ... words Thus N stands for the number of words in the N-gram. HTH -- paolo GPG/PGP id:0x1D5A11A4 - 04FC 8EB9 51A1 5158 1425 BC12 EA57 3382 1D5A 11A4 - 9/11: the outrageous deception and ongoing coverup: http://911review.org -
I read the paper "The Spam-Filtering Accuracy Plateau at 99.9% Accuracy and How to Get Past It." and I have a question about the following part: "In this experiment, we used superincreasing weights as determined by the formula Weight = 22N Thus, for features containing 1, 2, 3, 4, and 5 words, the weights of those features would be 1, 4, 16, 64, and 256 respectively." What does the variable N in the weighting formula stand for?
Hi list, I recently upgraded from 20070320 to 20070810. Since that upgrade, I get a very large number of false positives, which previously was not the case. I have been training-on-errors for almost three weeks, but crm114 still classifies almost every mail as spam. What's weird is that I thought the cut-off point was 0 and negative scores would be indicative of spam, positives would be ham, but this seems not the case, I have messages with a score of 10 be GOOD and a score of 16 be SPAM. I am using mailreaver with :clf: /osb unique microgroom/ Does anyone have any advice? --=20 martin | http://madduck.net/ | http://two.sentenc.es/ =20 eleventh law of acoustics: in a minimum-phase system there is an inextricable link between frequency response, phase response and transient response, as they are all merely transforms of one another. this combined with minimalization of open-loop errors in output amplifiers and correct compensation for non-linear passive crossover network loading can lead to a significant decrease in system resolution lost. however, of course, this all means jack when you listen to pink floyd. =20 spamtraps: mad...@ma...
On Sun, Aug 26, 2007 at 09:54:18PM +0200, martin f krafft wrote: > I upgraded to 20070810-BlameTheSegfault and started to see errors ... > /usr/bin/crm: *ERROR* > This file should have learncounts, but doesn't, and the learncount slot is busy. It's hosed. Time to die. ... > What's going on? It seems to work fine with 20070320. weird ... did you change anything in mailfilter.cf along with the upgrade? what's the :clf: in use? how/when did you make the .css in use? -- paolo PS: this is rather matter for -general ML than -discuss
I upgraded to 20070810-BlameTheSegfault and started to see errors like this whenever I used mailreaver to train spam/ham: ERROR: mailreaver.crm broke. Here's the error\:=20 ERROR:=20 /usr/bin/crm: *ERROR*=20 This file should have learncounts, but doesn't, and the learncount slot i= s busy. It's hosed. Time to die. Sorry, but this program is very sick and probably should be killed off. This happened at line 529 of file /usr/share/crm114/mailreaver.crm What's going on? It seems to work fine with 20070320. --=20 martin; (greetings from the heart of the sun.) \____ echo mailto: !#^."<*>"|tr "<*> mailto:" net@madduck =20 "the truth is rarely pure and never simple. modern life would be very tedious if it were either, and modern literature a complete impossibility!" -- oscar wilde =20 spamtraps: mad...@ma...
Paolo wrote: >> Brilliant idea! Hadn't thought about the 'head -x', but I _like_ it. I >> was thinking about maybe 1 or 2 Kbytes reserved for the header anyway, >> > these are ideas that floated in ML threads long ago. Note that OSBF makes > room for 4k header. > Yes, I saw there was some version checking and header code in there already. BTW, 'man head' on my box doesn't give a -x option. Is that an option to read until the EOF (or NUL?) character in an ASCII file? > ok, ok - no b2b ;) > Sorry, recalled some 'cool hacking' sessions of long past that went pear shaped as nobody could'handle' it. With 20-20 hindsight it was an exercise in complexity capability (how much nasty little details can you handle all at once). > no, if you put the classes in 2 sets like > ! classify (classA_1 classA_2 ... | classB_1 classB_2 ...) > you get a scalar (success|fail, but still all pR values). If you insted > say > ! classify (classA_1 classA_2 ... classB_1 classB_2 ...) > (note no '|') ie run in 'stats-only' - you get just the pR vector. I think > you can use that for building your fuzzifier, either in CRM or your favourite > prog.lang. A tricky point is that pR is normalized, so that it cannot be > used as class-membership function as is; an artifice could be to add a > class 'AnythingElse', ie the complement to the set of your classes. > I've copied this to my project notes. At the moment, the details of this are beyond my grasp, but that will change when I move away from the code cleanup into the actual algorithmic material of crm114. Thank you for this tip for it gives me a direction to investigate. > note that not all classifiers work well for N >2, nor those that are > *supposed* to work have been thoroughly tested. > I already suspected that much. That's why I don't mind going through all the code: I expect I'll need this exercise later on. > well, crm114 is a jit engine + classifiers plugged-in (bolted-in, at present). > [...] > I think that, if none of the (pR output from) current classifiers fits your > task, it'd be relatively easy to hack one of them into a new one, which > would be named eg f-osb (Fuzzy-OSB) or even OSBG (OSB-Gerrit) ;). > :-) heh, OSBG, now that would be something. Seriously though, I immediately recognized the plug/bolting in features when I first had a look at the crm114 code. Of course, a bit less of a copy&paste approach would have been 'nice' from a certain design point of view, but given the research nature of this type of tool (as Bill put it so eloquently somewhere: 'spam is a moving target') copy&paste is a very good approach (you can always refactor the sections that have stabilized). Besides, there are very nice tools out there to ease diff&merge-ing source files, so it's not much of a hassle to keep them in sync for now (like I did with my copy of SVM vs SKS: SKS seems to have started as an utterly stripped version of SVM, but the behaviour is _very_ similar so I merged the SVM code back in, just so I have lesser diffs to look at when cross-checking SKS vs SVM after a code change in either one of them. Ger
On Tue, Aug 07, 2007 at 08:37:09PM +0200, Gerrit E.G. Hobbelt wrote: > Right now, as I see it, you can't provide hard guarantees that > conversions will work (and I suspect that, given my goal with crm114, Sure you can: Reaver Cache. That works across versions, across classifiers, etc. -- Raul
On Tue, Aug 07, 2007 at 08:37:09PM +0200, Gerrit E.G. Hobbelt wrote: > > > Brilliant idea! Hadn't thought about the 'head -x', but I _like_ it. I > was thinking about maybe 1 or 2 Kbytes reserved for the header anyway, these are ideas that floated in ML threads long ago. Note that OSBF makes room for 4k header. > I've seen the CSV interformat and I was thinking about using that. No > bin-2-bin direct stuff, as that would complicate matters beyond control: ... > I've done direct bin-2-bin conversions in the past, but they're a true > support nightmare. It's doable, but you can have someone spend a serious ok, ok - no b2b ;) > <off-topic> ... > and a _learning_ 'fuzzy' discriminator, which has to wade through a slew > of 'crap' to arrive at a 'proper' rule or decision. Here I'm more > interested in decision _vectors_ (rather small ones) than _scalars_, but > I'll tackle that hurdle when I've got crm114 to a state where I can > really dive into the classifiers themselves, because I believe right now > it only supports single output bits(scalar) (pR?) but I'm not entirely no, if you put the classes in 2 sets like ! classify (classA_1 classA_2 ... | classB_1 classB_2 ...) you get a scalar (success|fail, but still all pR values). If you insted say ! classify (classA_1 classA_2 ... classB_1 classB_2 ...) (note no '|') ie run in 'stats-only' - you get just the pR vector. I think you can use that for building your fuzzifier, either in CRM or your favourite prog.lang. A tricky point is that pR is normalized, so that it cannot be used as class-membership function as is; an artifice could be to add a class 'AnythingElse', ie the complement to the set of your classes. > The problem for me is that I need to understand/learn the algorithm note that not all classifiers work well for N >2, nor those that are *supposed* to work have been thoroughly tested. > I've got the idea, I have a 'feeling' that this is the right direction, > but it's really still just guesswork regarding feasibility so far. well, crm114 is a jit engine + classifiers plugged-in (bolted-in, at present). The whole thing about pR is how you measure the stats for X against the N classes, which is just a bunch of lines that can be tweaked at pleasure. ... > I don't mind too much if crm114 doesn't work out for goal #2 - though it > would be a serious setback - as there's still the spam filter feature I think that, if none of the (pR output from) current classifiers fits your task, it'd be relatively easy to hack one of them into a new one, which would be named eg f-osb (Fuzzy-OSB) or even OSBG (OSB-Gerrit) ;). -- paolo
Paolo wrote: > On Mon, Aug 06, 2007 at 09:31:29PM +0200, Gerrit E.G. Hobbelt wrote: > >> - start each file with a versioned header (I'll come back to that later) >> > > that's well established for Fidelis' OSBF > I saw. It's just that I'm looking for a rather more generic solution, which is copy&paste-able when anyone (probably Bill) feels like adding other classifiers to crm114. Say some sort of 'file format/coding practice' thing: rip if off the other classifiers and just add your own classifier constant (so no fancy footwork with index [0] in the data arrays itself or anything like that). >> a) import all acceptable data, or >> > > there's a catch, as the original arch on which to do the export 1st might > not be avail anymore ... > Heh :-) That's where I refer to the legalese in there: 'sorry sir, it's _forward_ compatible as of this release' ;-) The whole point is that I'm trying to get at a mechanism which clearly identifies the data, both in type and version, so that we can develop a 'sure fire' and sane conversion. This while keeping in mind that design/devel/test time is a rather limited resource, so the 'management decision' may well turn out to be to forego the availability of a complete 'conversion' for specific versions (and that may include crm file versions predating this versioning mechanism). Right now, as I see it, you can't provide hard guarantees that conversions will work (and I suspect that, given my goal with crm114, I'll need that sort of thing), as you have several classifiers and software versions, while there's no way to tell them apart in a _guaranteed_ manner: all one can go on is some version info (OSBF et al) and a bit of heuristics. And 'it may work' isn't an option for me when I'm going to employ crm114, so I like to be able to _specifically_ test (and thus support) crm software versions and classifiers. Longwinded paragraphs cut short: I want to end up with a chart which tells me: "You've got crm114 release X and are using classifier C, well, we do support a 'full data transfer' for the current crm114 release." and maybe an additional (sub-)chart which says: "And incidentally, when you have crm114 running on system S, you can also _share_ that classifier's data on system type T using our import/export-based sync system." These charts have three ticks in each cell of their matrices: (a) may work (a.k.a. there's code for this in there) + (b) tested, a.k.a. we got word it works + (c) supported, a.k.a. you may bother us / complain when it isn't working. No tick in your cell on those charts means: you're SOL. Time for a retraining and ditching of the old files, probably. This would solve the problem of the ever lasting questions: can I keep my files or should I start from scratch? For folks that cannot retrain as they go, this 'charted' approach will provide them with a clear decision chart: can/should I upgrade, or shouldn't I? >> b) report the incompatibility and hence the need to 'recreate/relearn' >> the files. >> > > ... and b) might not always be an option. > See above. I'm well aware of that. I'm driving at a mechanism which allows everyone to clearly see when and what can/has been done. That includes you (J.R. User) helping the crm114 team by adding export/import support for those situations where the chart says 'not available' while you need that sort of thing. That also includes collecting and archiving feedback on [user] test results: did their transfer/upgrade work out ok? It's added work, but the benefit is that the upgrade process (and the decision to upgrade) can be fully automated in the end: for unmanned systems: only upgrade when our locally used version + classifier has a tested (and supported?) data migration path towards this new crm114 upgrade release. > yep, but I'd consider a bug (which might be just a TODO) a convertion > util/function which is unable to properly convert our own stuff from arch1 > to arch2, both ways, whatever arch* are. > Such converters won't be exactly trivial (byte swapping, aligning, padding, > etc) but feasable. > That's where the limited design/devel resourcing comes into play: I don't mind if the 'standard' decision is NOT to support/provide a data conversion path. It's understandable that we do so as we don't have an unlimited supply of dev power. But when we do choose to provide a conversion path it's clearly identifiable. (someone may need it and help Bill, you and the others by putting in the dev effort there, just like I'm reviving the Win32 port and adding error checking and stuff along the way) And, BTW, I've been writing that sort of cross-platform stuff more often. It gets a bit wicked when you need to convert VAX/VMS Fortran floating point values to PC/x86 IEEE format, for instance. ;-)) Otherwise, it's just really careful coding and a bit of proper up-front thinking. And then keeping a lookout for register/word-size issues (e.g. 32- vs. 64-bit) throughout the crm implementation, which is the hard part. Padding, endianess, etc. can be handled rather easily: define a 'special struct' with all the basic types in there and load it with a special byte sequence: that gives you endianess and alignment for all basic types. Floating point values need a bit of a special treatment when you travel outside the IEEE realm, but that's doable too. Not trivial, though, indeed. >> The binary format header will include these information items (at least): >> >> - the crm version used to create the file >> - the platform (integer size and format(endianess), floating point size >> and format, structure alignment, etc.) >> - the classifier used to create the file >> - the data content type (some classifiers use multiple files) >> - space for future expansion (this is a research tool too: allow folks >> to add their own stuff which may not fit the header items above) >> > > +file-format version and, since there'll be plenty of space, plain-text > file-format blurb and summary file-stats, so that head -x css would be > just fine to report the relevant things. > Brilliant idea! Hadn't thought about the 'head -x', but I _like_ it. I was thinking about maybe 1 or 2 Kbytes reserved for the header anyway, so, yes, plenty of space of a little informational text up front. A few Kbytes won't hurt. +file-format: yes. In case we find the format needs to be changed again (hopefully not before 2038 ;-) ) Another very good point. >> The approach includes the existence of an export/import tool to convert >> the data to/from a cross-platform portable text format, where >> > > that's the current CSV inter-format, though the converter should be able > to do it at once binary-2-binary. > I've seen the CSV interformat and I was thinking about using that. No bin-2-bin direct stuff, as that would complicate matters beyond control: given the 'cross-platform' tack, it would mean that a developer would have to code - and maintain - software which includes a table of file layout definitions, one for each supported platform (and probably the crm release version too). Compare this to databases: right now I'm in a project where I've found that Oracle cannot copy database files as-is across patch versions (that's the ultra-minor version number), let alone moving the binary database files as-is on to different unix architectures (HPUX vs. Linux, of course with differtent CPUs too). And that makes sense! The point? When Oracle DBAs are used to export-dumping and importing databases running in the many-multi-Gigabyte range to provide an migration/upgrade path for the data stored therein, I'd like to do _exactly_ the same. That means: use the CSV format (probably augmented) as an intermediate. (Or XML when I feel like getting fancy and really 21st century ;-) ) I've done direct bin-2-bin conversions in the past, but they're a true support nightmare. It's doable, but you can have someone spend a serious chunk of his/her life on that alone. And when that person quits supporting it, you're SOL as a tool provider, really. (Imagine your customers use a platform which you didn't support just yet. Maybe a new CPU type even. Can your _design_ of the bin2bin handle that? Or do you need to spend a significant amount of devel effort just to add the generation of these new-CPU-type files to your ware?) The easy way out is to provide all your customers with a single, portable format: they've got the software built on their own machines and who better than the machine itself can convert to/from that portable format? Thus, the conversion effort is off-loaded to the compiler vendor, who has to cope with it anyway. (sscanf/printf/etc.) XML is a good example as a solution invented for solving precisely this very issue. (cross-platform, cross-version, cross-X-whatever data transfer) We might even consider using XML as a replacement for the CSV format, though XML tends to be rather, er, obese, when it comes to data file sizes. XML is hierarchical, so we can easily store our header info and crm classifier data in there, while nicely separated/organized. > for spam filtering, it's easier (and usually better) to start from scratch, > but in other applications hashes DB might be precious stuff, so as people > extends crm114 use to other tasks, such tool might become highly desirable. > Yes, indeed. Verily. <off-topic> I have looked around at software supporting Bayesian/Markovian/etc. statistics and selected crm114 because it looked like it had the right amount of 'vim' (.i.e. lively dev community) while offering a feature set which might cover my needs - or get very close indeed. I intend to use crm114 for spam filtering (when combined with xmail) and for a second purpose: I'm not going to disclose what it is exactly, but think of it as a sort of fuzzy decision-making / monitoring process, which is a bit of a cross-breed between a constraint-driven scheduler and a _learning_ 'fuzzy' discriminator, which has to wade through a slew of 'crap' to arrive at a 'proper' rule or decision. Here I'm more interested in decision _vectors_ (rather small ones) than _scalars_, but I'll tackle that hurdle when I've got crm114 to a state where I can really dive into the classifiers themselves, because I believe right now it only supports single output bits(scalar) (pR?) but I'm not entirely sure there (lacking sufficient algorithm understanding). Anyway, I guess the 'vector solution' would be to use multiple crm (file) instances in parallel: one pR for each decision item in the output vector. Of course, that's a crude way, so the 'clean' approach I originally aiming for was convert crm114 into a library which could be called/used from within my own special purpose software. Alas, that's not a Q4 2007 target anyway. ;-) The problem for me is that I need to understand/learn the algorithm internals for this advanced statistics stuff as that is new to me and I want to understand what it's actually doing, i.e. how this stuff arrives at a decision, as I need to understand the implicit restrictions on the classifiers (and learning methods). Let's just say I don't want to join the mass who can't handle the meaning and implications of 'statistical significance', such as by just grabbing a likely classifier and 'slapping it on'. I fear that would cause some serious burn in the long term. You may have seen from my work so far, that I'm a bit paranoid at times ^H^H^H^H^H^H^H acutely aware of failure conditions, and it would be utterly stupid to fall into that beartrap at a systems level by grabbing this tool and applying it to a problem without really understanding where and what the limitations of the various parts are. I've met too many design decisions _not_ too worry. I've got the idea, I have a 'feeling' that this is the right direction, but it's really still just guesswork regarding feasibility so far. I arrived at crm114 while I had been looking for a decision filter which could easily handle _huge_ inputs for tiny outputs (spam: input = whole emails, output vector size = 1), produce consistent and significant decisions (spam: > 99% filter success rate in a very short learning period) while including a good 'learning' mode: somehow I don't think Bayesian is the bee's knees when it comes to my second goal. And it has been shown it's certainly not the end of it for spam either. And besides, crm114 isn't written in Perl (or some other interpreted language). Which in my world is a big plus. ;-) I don't mind too much if crm114 doesn't work out for goal #2 - though it would be a serious setback - as there's still the spam filter feature which is useful to me. So I don't mind spending some time on this baby to push it to a level where I can sit back, have a beer and say "yeah! Looks good, feels good. Let's do it!" </off-topic> Best regards, Ger
On Mon, Aug 06, 2007 at 09:31:29PM +0200, Gerrit E.G. Hobbelt wrote: > > - start each file with a versioned header (I'll come back to that later) that's well established for Fidelis' OSBF > The way to provide the forward portability would be through providing an > export/import mechanism (already exists for a few formats: cssdump) ... > The versioned header should contain enough information for an > export/import function to operate correctly: > a) import all acceptable data, or there's a catch, as the original arch on which to do the export 1st might not be avail anymore ... > b) report the incompatibility and hence the need to 'recreate/relearn' > the files. ... and b) might not always be an option. > Especially (b) is important as that'd enable (automated) upgrades to > properly interact with the users: one would then be able to select yep, but I'd consider a bug (which might be just a TODO) a convertion util/function which is unable to properly convert our own stuff from arch1 to arch2, both ways, whatever arch* are. Such converters won't be exactly trivial (byte swapping, aligning, padding, etc) but feasable. > The binary format header will include these information items (at least): > > - the crm version used to create the file > - the platform (integer size and format(endianess), floating point size > and format, structure alignment, etc.) > - the classifier used to create the file > - the data content type (some classifiers use multiple files) > - space for future expansion (this is a research tool too: allow folks > to add their own stuff which may not fit the header items above) +file-format version and, since there'll be plenty of space, plain-text file-format blurb and summary file-stats, so that head -x css would be just fine to report the relevant things. > The approach includes the existence of an export/import tool to convert > the data to/from a cross-platform portable text format, where that's the current CSV inter-format, though the converter should be able to do it at once binary-2-binary. > What are your thoughts on this matter? Is this worth persuing (and hence > augmenting the code to support such a header from now on) or is this, > well... for spam filtering, it's easier (and usually better) to start from scratch, but in other applications hashes DB might be precious stuff, so as people extends crm114 use to other tasks, such tool might become highly desirable. -- paolo