Hi,
I assume that the numbers you reported are for the testset which was
NOT trained as the numbers are lower than 70% of 266.
Anyway, I would not worry too much about your numbers in relation to
crm114 performance. Nothing which makes my eyebrows go up. I'm rather
surprised crm114 got this far on its own, really.
The problem lies elsewhere as it looks like you are running into the
same /fundamental/ issue as I did when I decided to use crm114 for my
signal analysis.
The basic two questions you should answer for yourself first and foremost are:
1a- how do Bayesian and other statistical filters like crm114 work
EXACTLY? (I refer to recent discussion in the crm114-developer mailing
list (Bill/Paolo/Ger) where crm114 innards are explained and discussed
using the analogy of a sandbox, green and red balls and a gold ball.
It's way too much to reproduce here, but read up on that and make sure
you understand what's going on. Research the algorithms used by
crm114, before you continue. Key element to understand is how crm114
compares data elements to arrive at similarity figures. Which leads to
question
1b- ask yourself where in your data is the 'equality' / identity in
elements in the evaluated inputs, which is a low level engineering
question derived from the second major question:
2- what are the metrics I want crm114 to compare to help me arrive at
the answers which I seek? And which answer am I looking for, really?
NOTE: express answers in both functional goals (for yourself) and
technical implementation terms, because you are designing the
automation of a 'human' system here, so you must be able to instruct
the computer what to do /exactly/ what you want it to do to emulate
the human process you try to model.
Tip of the week: This implies, technically speaking, that you /may/
find you need to preprocess your data.
I give this rather generic answer, because I believe it will help you
far more in understanding the core of what you are doing than when I
focus on a little detail (symptom) in your email and maybe up your
successrate right now. Understanding what is going on in there is
mandatory for anyone wishing to use statistical filters in a domain
where they have not been 'preconfigured' by other researchers for you.
> another. Is my understanding correct? Also, I found each time crm114 is made
> to learn the same thing, it produces different classification result on
> testing case. Is there a correct behavior?
A few bits of info are lacking to answer this, but when there's no
randomness involved in any way, the process should be completely
reproducible, i.e. provide you with the same results after every
complete re-run. Some learning methods (when you use mailtrainer for
instance) /may/ employ randomizer learn ordering, which will jolt
results for test sets; more so for small test sets like yours.
Of course, further questions and results are welcomed.
Best regards,
Ger Hobbelt
On Fri, Apr 18, 2008 at 9:48 PM, Weide Zhang <wz...@gm...> wrote:
>
>
> Hi, I am using crm114 to do text mining on stock annual report 10K to make
> prediction on their performances. The sample has 266 rows, each containing 1
> column indicating their annual report segment, and the other indicating
> whether or not they perform better in that year compared to the industry
> average. I use 70% of the data(data before 2006) as training and I tried
> different training method.
>
> Below are the correct number for each category('good' and 'bad' meaning
> perform better or worse). I use the python wrapper found on the crm114
> wiki. The accuracy is quite low and I notice that for osbf, there are no bad
> case that are classified correctly and for markov, only 2 good cases are
> classified correctly. It seems that the algorithms is biased one over
> another. Is my understanding correct? Also, I found each time crm114 is made
> to learn the same thing, it produces different classification result on
> testing case. Is there a correct behavior?
>
>
> good bad
> entropy corr 14 6
> total 18 19
>
> markov corr 2 14
> total 18 19
>
> osb corr 10 4
> total 18 19
>
> osbf corr 18 0
> total 18 19
>
> Thanks for your answer,
>
> Weide
> -------------------------------------------------------------------------
> This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
> Don't miss this year's exciting event. There's still time to save 100ドル.
> Use priority code J8TL2D2.
>
> http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
> _______________________________________________
> Crm114-discuss mailing list
> Crm...@li...
> https://lists.sourceforge.net/lists/listinfo/crm114-discuss
>
>
--
Met vriendelijke groeten / Best regards,
Ger Hobbelt
--------------------------------------------------
web: http://www.hobbelt.com/
http://www.hebbut.net/
mail: ge...@ho...
mobile: +31-6-11 120 978
--------------------------------------------------