Thread: [Crm114-discuss] CRM binary file format & (future?) portability

Brought to you by: nkadel, oopla, vanbaal, wsy

crm114-discuss

[Crm114-discuss] CRM binary file format & (future?) portability

From: Gerrit E.G. H. <Ger...@be...> - 2007年08月06日 19:59:24

L.S.,
I've been thinking a bit about this binary/native file data stuff and 
had a idea I'd like to discuss.
First off, I'm pretty sure by now that 'native binary format' for the 
storage is a Good Thing(tm) as it makes CRM act _fast_. (And that's 
precisely what I need.)
Why is it 'good'? Because crm uses a very fast mechanism to access the 
files called memory mapping (mmap(3) or the Win32 equivalent). Having 
crm translate/convert those bytes to/from 'native format' would either 
slow down the memory mapped file access (each read/write would have to 
convert int/double formats when mmap is used) or we'd have to throw 
memory mapping out the window, i.e. load every file into memory, convert 
once on load and once on save (for learn) causing an increase in both 
startup and memory usage overhead. Both these methods, aimed at 
providing a 'portable binary file format', would have serious impact on 
crm performance IMHO.
(And it just might make the code a bit less legible too, but that can be 
handled in other ways.)
Then another thought struck me: while scanning through my (rather short) 
archive of the mailing list, I've seen a few questions about the 
'portability' of these files across multiple CRM versions. (e.g. the 
64-bit discussion with regards to a new Debian build, IIRC)
I don't have a proper solution for resolving the latter, but we might 
consider ways to ensure crm binary formats are 'forwards portable' 
across platforms _and_ versions/releases. (forwards portable = anything 
before this moment may be 'broken', but from now on we can offer 
portability)
The idea is this (it has been done in a minor way before within crm):
- start each file with a versioned header (I'll come back to that later)
- the rest of the file is as it is right now: the hashes and all the 
other goodies, stored in a native format (int, double, etc.) for _fast_ 
every access/use.
The way to provide the forward portability would be through providing an 
export/import mechanism (already exists for a few formats: cssdump) 
which can be used to export binary files to a portable (preferably 
text?) format, which can then be imported into another crm: different 
version and/or platform.
A bit of a snag there: I've found that the current way of calculating 
the hashes is not portable, i.e. does not deliver the exact same values 
on different platforms. That has to be fixed before 'cross-platform' 
would be feasable. (As I don't see a 'hash converter' happening: hashes 
are essentially a one-way trip, so there's no way to convert hashes that 
were calculated based on slightly different bit patterns.)
I'm not sure fixing the strnhash() code so as to calculate the same hash 
bitpattern on every platform is quite enough to ensure cross-platform 
compatibility though: the data in the files is sometimes derived from 
the hashes: that would imply those 'derivation paths' would also have to 
be bit-compatible across platforms. (I'm not sure, have to check again 
there if we stored derived data next to config params and hash series.)
Let's assume however that we've got the hashes/import/export thing 
covered. And even if we don't we would be well aware of the limitations 
then.
That leaves that 'versioned header' I mentioned.
The versioned header should contain enough information for an 
export/import function to operate correctly:
a) import all acceptable data, or
b) report the incompatibility and hence the need to 'recreate/relearn' 
the files.
Especially (b) is important as that'd enable (automated) upgrades to 
properly interact with the users: one would then be able to select 
wether to commence with an 'incompatible' upgrade or delay it until a 
later, opportune moment.
To make it all cross compatible, the header would need to include these 
bits of info:
- actual crm version
- the classifier used
- the platform (little/big endian, 32/64/... bits, etc.)
- data type (just to make sure we're not messing with the wrong sort of 
stuff)
so the im/exporter can handle possible data content / format changes 
and/or data compatibility checks.
The header has to be in a cross-platform portable format so that every 
crm toolset of there can recognize copied/moved files and act 
accordingly: report or process the data.
Given the 'memory mapped' file access approach of crm, this would also 
mean that the header is fixed length (bytes). The software can then 
easily jump over this header to get at the data itself.
Conclusion
-----------
I suggest the introduction of a cross-platform binary format, fixed 
width header for all crm binary files (css et al) to provide (at least) 
forward compatibility.
For the remainder, all these files remain as is: data will be stored in 
a native format for fast access.
The binary format header will include these information items (at least):
- the crm version used to create the file
- the platform (integer size and format(endianess), floating point size 
and format, structure alignment, etc.)
- the classifier used to create the file
- the data content type (some classifiers use multiple files)
- space for future expansion (this is a research tool too: allow folks 
to add their own stuff which may not fit the header items above)
The approach includes the existence of an export/import tool to convert 
the data to/from a cross-platform portable text format, where 
applicable. At least, the im/exporter will then be able to accurately 
predict whether you can 'migrate' your data or need to start with a 
clean slate.
This also depends on the sophistication of the im/exporter, of course, 
but the least benefit is a quick but accurate decision on 
'migrateability' of your data. Which, I believe, doesn't exist now, 
unless you find 'better always rebuild from scratch' a good choice for 
every major/minor software upgrade. ;-)
What are your thoughts on this matter? Is this worth persuing (and hence 
augmenting the code to support such a header from now on) or is this, 
well...
Best regards,
Ger

Re: [Crm114-discuss] CRM binary file format & (future?) portability

From: Paolo <oo...@us...> - 2007年08月06日 23:31:45

On Mon, Aug 06, 2007 at 09:31:29PM +0200, Gerrit E.G. Hobbelt wrote:
> 
> - start each file with a versioned header (I'll come back to that later)
that's well established for Fidelis' OSBF
> The way to provide the forward portability would be through providing an 
> export/import mechanism (already exists for a few formats: cssdump) 
...
> The versioned header should contain enough information for an 
> export/import function to operate correctly:
> a) import all acceptable data, or
there's a catch, as the original arch on which to do the export 1st might
not be avail anymore ...
> b) report the incompatibility and hence the need to 'recreate/relearn' 
> the files.
... and b) might not always be an option.
 
> Especially (b) is important as that'd enable (automated) upgrades to 
> properly interact with the users: one would then be able to select 
yep, but I'd consider a bug (which might be just a TODO) a convertion
util/function which is unable to properly convert our own stuff from arch1
to arch2, both ways, whatever arch* are. 
Such converters won't be exactly trivial (byte swapping, aligning, padding, 
etc) but feasable.
> The binary format header will include these information items (at least):
> 
> - the crm version used to create the file
> - the platform (integer size and format(endianess), floating point size 
> and format, structure alignment, etc.)
> - the classifier used to create the file
> - the data content type (some classifiers use multiple files)
> - space for future expansion (this is a research tool too: allow folks 
> to add their own stuff which may not fit the header items above)
+file-format version and, since there'll be plenty of space, plain-text
file-format blurb and summary file-stats, so that head -x css would be
just fine to report the relevant things.
> The approach includes the existence of an export/import tool to convert 
> the data to/from a cross-platform portable text format, where 
that's the current CSV inter-format, though the converter should be able
to do it at once binary-2-binary.
> What are your thoughts on this matter? Is this worth persuing (and hence 
> augmenting the code to support such a header from now on) or is this, 
> well...
for spam filtering, it's easier (and usually better) to start from scratch,
but in other applications hashes DB might be precious stuff, so as people
extends crm114 use to other tasks, such tool might become highly desirable.
--
paolo

Re: [Crm114-discuss] CRM binary file format & (future?) portability

From: Gerrit E.G. H. <Ger...@be...> - 2007年08月07日 18:37:13

Paolo wrote:
> On Mon, Aug 06, 2007 at 09:31:29PM +0200, Gerrit E.G. Hobbelt wrote:
> 
>> - start each file with a versioned header (I'll come back to that later)
>> 
>
> that's well established for Fidelis' OSBF
> 
I saw. It's just that I'm looking for a rather more generic solution, 
which is copy&paste-able when anyone (probably Bill) feels like adding 
other classifiers to crm114. Say some sort of 'file format/coding 
practice' thing: rip if off the other classifiers and just add your own 
classifier constant (so no fancy footwork with index [0] in the data 
arrays itself or anything like that).
>> a) import all acceptable data, or
>> 
>
> there's a catch, as the original arch on which to do the export 1st might
> not be avail anymore ...
> 
Heh :-) That's where I refer to the legalese in there: 'sorry sir, it's 
_forward_ compatible as of this release' ;-)
The whole point is that I'm trying to get at a mechanism which clearly 
identifies the data, both in type and version, so that we can develop a 
'sure fire' and sane conversion.
This while keeping in mind that design/devel/test time is a rather 
limited resource, so the 'management decision' may well turn out to be 
to forego the availability of a complete 'conversion' for specific 
versions (and that may include crm file versions predating this 
versioning mechanism).
Right now, as I see it, you can't provide hard guarantees that 
conversions will work (and I suspect that, given my goal with crm114, 
I'll need that sort of thing), as you have several classifiers and 
software versions, while there's no way to tell them apart in a 
_guaranteed_ manner: all one can go on is some version info (OSBF et al) 
and a bit of heuristics. And 'it may work' isn't an option for me when 
I'm going to employ crm114, so I like to be able to _specifically_ test 
(and thus support) crm software versions and classifiers.
Longwinded paragraphs cut short:
I want to end up with a chart which tells me: "You've got crm114 release 
X and are using classifier C, well, we do support a 'full data transfer' 
for the current crm114 release."
and maybe an additional (sub-)chart which says: "And incidentally, when 
you have crm114 running on system S, you can also _share_ that 
classifier's data on system type T using our import/export-based sync 
system."
These charts have three ticks in each cell of their matrices: (a) may 
work (a.k.a. there's code for this in there) + (b) tested, a.k.a. we got 
word it works + (c) supported, a.k.a. you may bother us / complain when 
it isn't working.
No tick in your cell on those charts means: you're SOL. Time for a 
retraining and ditching of the old files, probably.
This would solve the problem of the ever lasting questions: can I keep 
my files or should I start from scratch?
For folks that cannot retrain as they go, this 'charted' approach will 
provide them with a clear decision chart: can/should I upgrade, or 
shouldn't I?
>> b) report the incompatibility and hence the need to 'recreate/relearn' 
>> the files.
>> 
>
> ... and b) might not always be an option.
> 
See above. I'm well aware of that. I'm driving at a mechanism which 
allows everyone to clearly see when and what can/has been done.
That includes you (J.R. User) helping the crm114 team by adding 
export/import support for those situations where the chart says 'not 
available' while you need that sort of thing.
That also includes collecting and archiving feedback on [user] test 
results: did their transfer/upgrade work out ok?
It's added work, but the benefit is that the upgrade process (and the 
decision to upgrade) can be fully automated in the end: for unmanned 
systems: only upgrade when our locally used version + classifier has a 
tested (and supported?) data migration path towards this new crm114 
upgrade release.
> yep, but I'd consider a bug (which might be just a TODO) a convertion
> util/function which is unable to properly convert our own stuff from arch1
> to arch2, both ways, whatever arch* are. 
> Such converters won't be exactly trivial (byte swapping, aligning, padding, 
> etc) but feasable.
> 
That's where the limited design/devel resourcing comes into play: I 
don't mind if the 'standard' decision is NOT to support/provide a data 
conversion path. It's understandable that we do so as we don't have an 
unlimited supply of dev power.
But when we do choose to provide a conversion path it's clearly 
identifiable. (someone may need it and help Bill, you and the others by 
putting in the dev effort there, just like I'm reviving the Win32 port 
and adding error checking and stuff along the way)
And, BTW, I've been writing that sort of cross-platform stuff more 
often. It gets a bit wicked when you need to convert VAX/VMS Fortran 
floating point values to PC/x86 IEEE format, for instance. ;-)) 
Otherwise, it's just really careful coding and a bit of proper up-front 
thinking. And then keeping a lookout for register/word-size issues (e.g. 
32- vs. 64-bit) throughout the crm implementation, which is the hard part.
Padding, endianess, etc. can be handled rather easily: define a 'special 
struct' with all the basic types in there and load it with a special 
byte sequence: that gives you endianess and alignment for all basic 
types. Floating point values need a bit of a special treatment when you 
travel outside the IEEE realm, but that's doable too. Not trivial, 
though, indeed.
>> The binary format header will include these information items (at least):
>>
>> - the crm version used to create the file
>> - the platform (integer size and format(endianess), floating point size 
>> and format, structure alignment, etc.)
>> - the classifier used to create the file
>> - the data content type (some classifiers use multiple files)
>> - space for future expansion (this is a research tool too: allow folks 
>> to add their own stuff which may not fit the header items above)
>> 
>
> +file-format version and, since there'll be plenty of space, plain-text
> file-format blurb and summary file-stats, so that head -x css would be
> just fine to report the relevant things.
> 
Brilliant idea! Hadn't thought about the 'head -x', but I _like_ it. I 
was thinking about maybe 1 or 2 Kbytes reserved for the header anyway, 
so, yes, plenty of space of a little informational text up front. A few 
Kbytes won't hurt.
+file-format: yes. In case we find the format needs to be changed again 
(hopefully not before 2038 ;-) ) Another very good point.
>> The approach includes the existence of an export/import tool to convert 
>> the data to/from a cross-platform portable text format, where 
>> 
>
> that's the current CSV inter-format, though the converter should be able
> to do it at once binary-2-binary.
> 
I've seen the CSV interformat and I was thinking about using that. No 
bin-2-bin direct stuff, as that would complicate matters beyond control: 
given the 'cross-platform' tack, it would mean that a developer would 
have to code - and maintain - software which includes a table of file 
layout definitions, one for each supported platform (and probably the 
crm release version too).
Compare this to databases: right now I'm in a project where I've found 
that Oracle cannot copy database files as-is across patch versions 
(that's the ultra-minor version number), let alone moving the binary 
database files as-is on to different unix architectures (HPUX vs. 
Linux, of course with differtent CPUs too). And that makes sense!
The point? When Oracle DBAs are used to export-dumping and importing 
databases running in the many-multi-Gigabyte range to provide an 
migration/upgrade path for the data stored therein, I'd like to do 
_exactly_ the same. That means: use the CSV format (probably augmented) 
as an intermediate. (Or XML when I feel like getting fancy and really 
21st century ;-) )
I've done direct bin-2-bin conversions in the past, but they're a true 
support nightmare. It's doable, but you can have someone spend a serious 
chunk of his/her life on that alone. And when that person quits 
supporting it, you're SOL as a tool provider, really. (Imagine your 
customers use a platform which you didn't support just yet. Maybe a new 
CPU type even. Can your _design_ of the bin2bin handle that? Or do you 
need to spend a significant amount of devel effort just to add the 
generation of these new-CPU-type files to your ware?)
The easy way out is to provide all your customers with a single, 
portable format: they've got the software built on their own machines 
and who better than the machine itself can convert to/from that portable 
format? Thus, the conversion effort is off-loaded to the compiler 
vendor, who has to cope with it anyway. (sscanf/printf/etc.)
XML is a good example as a solution invented for solving precisely this 
very issue. (cross-platform, cross-version, cross-X-whatever data transfer)
We might even consider using XML as a replacement for the CSV format, 
though XML tends to be rather, er, obese, when it comes to data file 
sizes. XML is hierarchical, so we can easily store our header info and 
crm classifier data in there, while nicely separated/organized.
> for spam filtering, it's easier (and usually better) to start from scratch,
> but in other applications hashes DB might be precious stuff, so as people
> extends crm114 use to other tasks, such tool might become highly desirable.
> 
Yes, indeed. Verily.
<off-topic>
I have looked around at software supporting Bayesian/Markovian/etc. 
statistics and selected crm114 because it looked like it had the right 
amount of 'vim' (.i.e. lively dev community) while offering a feature 
set which might cover my needs - or get very close indeed.
I intend to use crm114 for spam filtering (when combined with xmail) and 
for a second purpose: I'm not going to disclose what it is exactly, but 
think of it as a sort of fuzzy decision-making / monitoring process, 
which is a bit of a cross-breed between a constraint-driven scheduler 
and a _learning_ 'fuzzy' discriminator, which has to wade through a slew 
of 'crap' to arrive at a 'proper' rule or decision. Here I'm more 
interested in decision _vectors_ (rather small ones) than _scalars_, but 
I'll tackle that hurdle when I've got crm114 to a state where I can 
really dive into the classifiers themselves, because I believe right now 
it only supports single output bits(scalar) (pR?) but I'm not entirely 
sure there (lacking sufficient algorithm understanding). Anyway, I guess 
the 'vector solution' would be to use multiple crm (file) instances in 
parallel: one pR for each decision item in the output vector. Of course, 
that's a crude way, so the 'clean' approach I originally aiming for was 
convert crm114 into a library which could be called/used from within my 
own special purpose software. Alas, that's not a Q4 2007 target anyway. ;-)
The problem for me is that I need to understand/learn the algorithm 
internals for this advanced statistics stuff as that is new to me and I 
want to understand what it's actually doing, i.e. how this stuff arrives 
at a decision, as I need to understand the implicit restrictions on the 
classifiers (and learning methods). Let's just say I don't want to join 
the mass who can't handle the meaning and implications of 'statistical 
significance', such as by just grabbing a likely classifier and 
'slapping it on'. I fear that would cause some serious burn in the long 
term.
You may have seen from my work so far, that I'm a bit paranoid at times 
^H^H^H^H^H^H^H acutely aware of failure conditions, and it would be 
utterly stupid to fall into that beartrap at a systems level by grabbing 
this tool and applying it to a problem without really understanding 
where and what the limitations of the various parts are. I've met too 
many design decisions _not_ too worry.
I've got the idea, I have a 'feeling' that this is the right direction, 
but it's really still just guesswork regarding feasibility so far.
I arrived at crm114 while I had been looking for a decision filter which 
could easily handle _huge_ inputs for tiny outputs (spam: input = whole 
emails, output vector size = 1), produce consistent and significant 
decisions (spam: > 99% filter success rate in a very short learning 
period) while including a good 'learning' mode: somehow I don't think 
Bayesian is the bee's knees when it comes to my second goal. And it has 
been shown it's certainly not the end of it for spam either.
And besides, crm114 isn't written in Perl (or some other interpreted 
language). Which in my world is a big plus. ;-)
I don't mind too much if crm114 doesn't work out for goal #2 - though it 
would be a serious setback - as there's still the spam filter feature 
which is useful to me. So I don't mind spending some time on this baby 
to push it to a level where I can sit back, have a beer and say "yeah! 
Looks good, feels good. Let's do it!"
</off-topic>
Best regards,
Ger

Re: [Crm114-discuss] CRM binary file format & (future?) portability

From: Paolo <oo...@us...> - 2007年08月08日 08:41:51

On Tue, Aug 07, 2007 at 08:37:09PM +0200, Gerrit E.G. Hobbelt wrote:
> > 
> Brilliant idea! Hadn't thought about the 'head -x', but I _like_ it. I 
> was thinking about maybe 1 or 2 Kbytes reserved for the header anyway, 
these are ideas that floated in ML threads long ago. Note that OSBF makes
room for 4k header.
> I've seen the CSV interformat and I was thinking about using that. No 
> bin-2-bin direct stuff, as that would complicate matters beyond control: 
...
> I've done direct bin-2-bin conversions in the past, but they're a true 
> support nightmare. It's doable, but you can have someone spend a serious 
ok, ok - no b2b ;)
> <off-topic>
...
> and a _learning_ 'fuzzy' discriminator, which has to wade through a slew 
> of 'crap' to arrive at a 'proper' rule or decision. Here I'm more 
> interested in decision _vectors_ (rather small ones) than _scalars_, but 
> I'll tackle that hurdle when I've got crm114 to a state where I can 
> really dive into the classifiers themselves, because I believe right now 
> it only supports single output bits(scalar) (pR?) but I'm not entirely 
no, if you put the classes in 2 sets like
! classify (classA_1 classA_2 ... | classB_1 classB_2 ...)
you get a scalar (success|fail, but still all pR values). If you insted
say
! classify (classA_1 classA_2 ... classB_1 classB_2 ...)
(note no '|') ie run in 'stats-only' - you get just the pR vector. I think
you can use that for building your fuzzifier, either in CRM or your favourite
prog.lang. A tricky point is that pR is normalized, so that it cannot be
used as class-membership function as is; an artifice could be to add a 
class 'AnythingElse', ie the complement to the set of your classes.
> The problem for me is that I need to understand/learn the algorithm 
note that not all classifiers work well for N >2, nor those that are 
*supposed* to work have been thoroughly tested.
> I've got the idea, I have a 'feeling' that this is the right direction, 
> but it's really still just guesswork regarding feasibility so far.
well, crm114 is a jit engine + classifiers plugged-in (bolted-in, at present).
The whole thing about pR is how you measure the stats for X against the
N classes, which is just a bunch of lines that can be tweaked at pleasure.
...
> I don't mind too much if crm114 doesn't work out for goal #2 - though it 
> would be a serious setback - as there's still the spam filter feature 
I think that, if none of the (pR output from) current classifiers fits your
task, it'd be relatively easy to hack one of them into a new one, which 
would be named eg f-osb (Fuzzy-OSB) or even OSBG (OSB-Gerrit) ;).
--
paolo

Re: [Crm114-discuss] CRM binary file format & (future?) portability

From: Gerrit E.G. H. <Ger...@be...> - 2007年08月08日 21:51:14

Paolo wrote:
>> Brilliant idea! Hadn't thought about the 'head -x', but I _like_ it. I 
>> was thinking about maybe 1 or 2 Kbytes reserved for the header anyway, 
>> 
> these are ideas that floated in ML threads long ago. Note that OSBF makes
> room for 4k header.
> 
Yes, I saw there was some version checking and header code in there already.
BTW, 'man head' on my box doesn't give a -x option. Is that an option to 
read until the EOF (or NUL?) character in an ASCII file?
> ok, ok - no b2b ;)
> 
Sorry, recalled some 'cool hacking' sessions of long past that went pear 
shaped as nobody could'handle' it. With 20-20 hindsight it was an 
exercise in complexity capability (how much nasty little details can 
you handle all at once).
> no, if you put the classes in 2 sets like
> ! classify (classA_1 classA_2 ... | classB_1 classB_2 ...)
> you get a scalar (success|fail, but still all pR values). If you insted
> say
> ! classify (classA_1 classA_2 ... classB_1 classB_2 ...)
> (note no '|') ie run in 'stats-only' - you get just the pR vector. I think
> you can use that for building your fuzzifier, either in CRM or your favourite
> prog.lang. A tricky point is that pR is normalized, so that it cannot be
> used as class-membership function as is; an artifice could be to add a 
> class 'AnythingElse', ie the complement to the set of your classes.
> 
I've copied this to my project notes. At the moment, the details of this 
are beyond my grasp, but that will change when I move away from the code 
cleanup into the actual algorithmic material of crm114.
Thank you for this tip for it gives me a direction to investigate.
> note that not all classifiers work well for N >2, nor those that are 
> *supposed* to work have been thoroughly tested.
> 
I already suspected that much. That's why I don't mind going through all 
the code: I expect I'll need this exercise later on.
> well, crm114 is a jit engine + classifiers plugged-in (bolted-in, at present).
> [...]
> I think that, if none of the (pR output from) current classifiers fits your
> task, it'd be relatively easy to hack one of them into a new one, which 
> would be named eg f-osb (Fuzzy-OSB) or even OSBG (OSB-Gerrit) ;).
> 
:-) heh, OSBG, now that would be something.
Seriously though, I immediately recognized the plug/bolting in features 
when I first had a look at the crm114 code.
Of course, a bit less of a copy&paste approach would have been 'nice' 
from a certain design point of view, but given the research nature of 
this type of tool (as Bill put it so eloquently somewhere: 'spam is a 
moving target') copy&paste is a very good approach (you can always 
refactor the sections that have stabilized).
Besides, there are very nice tools out there to ease diff&merge-ing 
source files, so it's not much of a hassle to keep them in sync for now 
(like I did with my copy of SVM vs SKS: SKS seems to have started as an 
utterly stripped version of SVM, but the behaviour is _very_ similar so 
I merged the SVM code back in, just so I have lesser diffs to look at 
when cross-checking SKS vs SVM after a code change in either one of them.
Ger

Re: [Crm114-discuss] CRM binary file format & (future?) portability

From: Raul M. <mo...@ma...> - 2007年08月08日 16:00:33

On Tue, Aug 07, 2007 at 08:37:09PM +0200, Gerrit E.G. Hobbelt wrote:
> Right now, as I see it, you can't provide hard guarantees that 
> conversions will work (and I suspect that, given my goal with crm114, 
Sure you can: Reaver Cache.
That works across versions, across classifiers, etc.
-- 
Raul

Thanks for helping keep SourceForge clean.