Thread: [Crm114-discuss] continuation: current model in crm114

Brought to you by: nkadel, oopla, vanbaal, wsy

crm114-discuss

[Crm114-discuss] continuation: current model in crm114

From: Laird B. <lb...@us...> - 2004年03月28日 01:13:42

This is a continuation of the discussion in the crm114-general thread.
Previous reply by Christian Siefkes:
> Well, it converts the probability of a document T = (F_1, F_2
> ... F_n) into the probabilities of generating all the _features_ F_k
> (k from 1 to n) in this document.
> 
> P(T is spam | F_k) = P(F_k is in spam) * P(T is spam [the prior
> probability]) / P(F_k is in any document)
>
> Where P(F_k is in any document) = P(F_k is in spam) * P(T is spam) +
> P(F_k is in nonspam) * P(T is nonspam)
>
> In the case of CRM, P(F_k is in [non]spam) is calculated by the
> local probability formula given in the Plateau Paper.
>
> We start with uniform priors ( P(T is spam) = P(T is nonspam) = 0.5
> ) and apply this formula for each feature F_k (k from 1 to n) in the
> document T.
> 
> So illegal documents in D_5 do not affect these calculations because
> the probability of generating documents is not used. There is no
> leakage of probability mass.
We're still not discussing the same thing. I agree with the formula
stated above. This is just a calculation on D_5. The point where
things break down is when you take the expression P(T is spam | F_k) 
and use it without modification for real documents, ie on the set D.
It is true that the set D is basically a subset of D_5, but P must
still be converted into some probability on D, which you haven't yet 
addressed. I am claiming that the restriction of P to the set D is 
not suitable for this.
In case this doesn't convince you, let me be more explicit: On D_5,
you can calculate P(T is spam | T = \phi(t) ), which you've done
above, for any fixed real document t \in D. 
But crm114 only outputs a conditional probability for real
documents. The set D_5 is an internal aspect of its algorithm,
but has no other relation to crm114's inputs or outputs.
As a machine, crm114 works like this:
real documents -> crm114 -> probabilities of real documents
So you are saying that the correct thing to do is to output
the numerical value computed on D_5, ie you output
Prob( t is spam | t = (t_1,...,t_n) ) =(def) 
 P( T is spam | T = \phi(t_1,...,t_n) )
But this is inconsistent on D, because if I feed crm114 every possible
real document t \in D, then it will output a set of "conditional 
probabilities" 
(*) Prob( t is spam | t = (t_1,...,t_n) ) for all possible documents t.
But this family of conditional probabilities is inconsistent. 
There is no measure Prob on D which is compatible with them. 
It you don't believe me, I'll give you an outline proof of this 
(just ask). But let's first see if you follow me so far.
-- 
Laird Breyer.

[Crm114-discuss] Re: continuation: current model in crm114

From: Christian S. <si...@mi...> - 2004年03月28日 10:47:42

Hi,
On 2004年3月28日, Laird Breyer wrote:
> We're still not discussing the same thing. I agree with the formula
> stated above. This is just a calculation on D_5. The point where
> things break down is when you take the expression P(T is spam | F_k)
> and use it without modification for real documents, ie on the set D.
>
> It is true that the set D is basically a subset of D_5, but P must
> still be converted into some probability on D, which you haven't yet
> addressed. I am claiming that the restriction of P to the set D is
> not suitable for this.
>
> In case this doesn't convince you, let me be more explicit: On D_5,
> you can calculate P(T is spam | T = \phi(t) ), which you've done
> above, for any fixed real document t \in D.
Yes.
> But crm114 only outputs a conditional probability for real
> documents. The set D_5 is an internal aspect of its algorithm,
> but has no other relation to crm114's inputs or outputs.
> As a machine, crm114 works like this:
>
> real documents -> crm114 -> probabilities of real documents
>
> So you are saying that the correct thing to do is to output
> the numerical value computed on D_5, ie you output
>
> Prob( t is spam | t = (t_1,...,t_n) ) =(def)
> P( T is spam | T = \phi(t_1,...,t_n) )
Yes. Maybe not "the" correct thing, but I think it's legitimate and
consistent.
> But this is inconsistent on D, because if I feed crm114 every possible
> real document t \in D, then it will output a set of "conditional
> probabilities"
>
> (*) Prob( t is spam | t = (t_1,...,t_n) ) for all possible documents t.
Hm, this would be an infinite set because the length of documents (number
of features) is not limited; the number of possible different features
isn't limited either. But you could certainly do so for any possible
document in t...
> But this family of conditional probabilities is inconsistent.
> There is no measure Prob on D which is compatible with them.
>
> It you don't believe me, I'll give you an outline proof of this
> (just ask). But let's first see if you follow me so far.
Please do so -- I still don't have a clue why it should be inconsistent.
Bye
	Christian
------------ Christian Siefkes -----------------------------------------
| Email: chr...@si... | Web: http://www.siefkes.net/
| Graduate School in Distributed IS: http://www.wiwi.hu-berlin.de/gkvi/
-------------------- Offline P2P: http://www.leihnetzwerk.de/ ----------
Companies go through periods in which middle and upper managers vie to
out-do each other in thinking up ways to improve near-term performance
(quarterly earnings) by sacrificing the longer term. This is usually
called "bottom-line consciousness," but we prefer to give it another name:
"eating the seed corn."
 -- Tom DeMarco and Timothy Lister, Peopleware

[Crm114-discuss] Re: continuation: current model in crm114

From: Laird B. <lb...@us...> - 2004年03月28日 13:26:08

Hi Christian,
Here's the proof (a bit long, so I didn't want to include it earlier).
Let's first assume that D has finite cardinality |D|. This is the case
if the vocabulary is finite and documents have a maximum size. We can
take the limit afterwards.
Claim: Assume |D| is finite. Define for any t \in D
> > Prob( t is spam | t = (t_1,...,t_n) ) =(def)
> > P( T is spam | T = \phi(t_1,...,t_n) )
Then there exists no probability measure "Prob" defined on D such 
that the above are conditional probabilities for all documents t \in D.
Proof:
Let M be the 2x|D| matrix, where the row label represents
{spam,notspam} and the column label represents an enumeration of the
documents t \in D. 
M_st =(def) Prob(d is "s" | d = t) =(def) P(T is "s" | T = \phi(t)),
Let N be the |D|x2 matrix, where the row label is a document and the
column label is in {spam,notspam}:
N_tr =(def) Prob(d = t | d is "r") = \prod_k P(T_k = \phi(t)_k | T is "r")
The last product is because on D_5, the terms T_k are independent,
conditionally on the spam/notspamc class. I'm only defining these
matrices so I don't have to write long formulas.
If the measure Prob on D exists and is a probability, then we must
certainly have 
Prob(d is "s") = \sum_t Prob(d is "s"| d = t) Prob(d = t)
 = \sum_{t,r} Prob(d is "s"| d = t)Prob(d = t|d is "r")Prob(d is "r")
v = MNv
where v is the column vector v = (Prob(d is "spam"), Prob(d is "notspam"))^T
and we are summing over the real documents t \in D only. 
Let A = MN, then v is an eigenvector with eigenvalue 1, and v has the
special form v = (a, 1 - a)^T with 0 <= a <= 1. Now
v_i = \sum_j A_ij v_j, and \sum_i v_i = 1 gives
\sum_j (\sum_i A_ij) v_j = 1, which simplifies to 
(*) a (\sum_i A_{i,spam}) + (1-a) (\sum_i A_{i,notspam}) = 1.
The elements of A are of course all nonnegative, and I claim there is
no solution for 0 <= a <= 1. If no such a exists, then there is no
measure Prob on D. 
Here is why no such a exists: Let M_5 and N_5 be defined exactly like
M and N, but on D_5, with respect to the probability measure P which
we agree exists. The only difference is that on D_5, there are many
more documents, but still a finite number. 
Let A_5 = (M_5)(N_5). You can see that \sum_i (A_5)_{i,spam} = 1,
because
\sum_i (A_5)_{i,spam} = 
 \sum_{T \in D_5} [ P(spam|T)P(T|spam) + P(notspam|T)P(T|spam) ]
= \sum_{T \in D_5} P(T|spam) = 1.
where we use P(notspam|T) = 1 - P(spam|T).
Similarly, you can show that \sum_i (A_5)_{i,notspam} = 1.
But now, notice that 
(**) 1 = \sum_i (A_5)_{i,r} > \sum_i A_{i,r}, 
so in (*), both sums are < 1. It follows that 0 <= a <= 1 cannot exist.[]
Now that we know that no Prob measure which equals P can exist for D
finite, consider the infinite case. Notice that the proof does not
use the finite nature of D at all. The finite sums on t can be
replaced by infinite sums, and can be interchanged by Fubini's
theorem, because all terms are nonnegative. All we need is to be able
to enumerate the t's on D and D_5. Moreover, the inequality in (**) 
is strict even in the infinite case (easy, just use a nonreal
document in D_5 with positive P probability). So the same proof works
in the infinite case.
-- 
Laird Breyer.

Re: [Crm114-discuss] Re: continuation: current model in crm114

From: Christian S. <si...@mi...> - 2004年03月29日 14:55:06

Hi Laird,
On 2004年3月28日, Laird Breyer wrote:
> Here's the proof (a bit long, so I didn't want to include it earlier).
hm, there are some points where I'm not sure what you mean. Not sure if
we'll find the time to sort this out but maybe...
> Let's first assume that D has finite cardinality |D|. This is the case
> if the vocabulary is finite and documents have a maximum size. We can
> take the limit afterwards.
>
> Claim: Assume |D| is finite. Define for any t \in D
>
> > > Prob( t is spam | t = (t_1,...,t_n) ) =(def)
> > > P( T is spam | T = \phi(t_1,...,t_n) )
>
> Then there exists no probability measure "Prob" defined on D such
> that the above are conditional probabilities for all documents t \in D.
>
> Proof:
>
> Let M be the 2x|D| matrix, where the row label represents
> {spam,notspam} and the column label represents an enumeration of the
> documents t \in D.
>
> M_st =(def) Prob(d is "s" | d = t) =(def) P(T is "s" | T = \phi(t)),
>
> Let N be the |D|x2 matrix, where the row label is a document and the
> column label is in {spam,notspam}:
>
> N_tr =(def) Prob(d = t | d is "r") = \prod_k P(T_k = \phi(t)_k | T is "r")
This looks more or less as if you're applying the transformation function
\phi to each feature t_k in the document t? But you can only apply this
function to the whole document t, otherwise you'll get wrong results...
> The last product is because on D_5, the terms T_k are independent,
> conditionally on the spam/notspamc class. I'm only defining these
> matrices so I don't have to write long formulas.
>
> If the measure Prob on D exists and is a probability, then we must
> certainly have
>
> Prob(d is "s") = \sum_t Prob(d is "s"| d = t) Prob(d = t)
What exactly is it you're postulating here?
> = \sum_{t,r} Prob(d is "s"| d = t)Prob(d = t|d is "r")Prob(d is "r")
> v = MNv
>
> where v is the column vector v = (Prob(d is "spam"), Prob(d is "notspam"))^T
What's T here? The set of features T_k in the document T = \phi(t) ?
> and we are summing over the real documents t \in D only.
>
> Let A = MN, then v is an eigenvector with eigenvalue 1, and v has the
> special form v = (a, 1 - a)^T with 0 <= a <= 1. Now
>
> v_i = \sum_j A_ij v_j, and \sum_i v_i = 1 gives
>
> \sum_j (\sum_i A_ij) v_j = 1, which simplifies to
>
> (*) a (\sum_i A_{i,spam}) + (1-a) (\sum_i A_{i,notspam}) = 1.
>
> The elements of A are of course all nonnegative, and I claim there is
> no solution for 0 <= a <= 1. If no such a exists, then there is no
> measure Prob on D.
>
> Here is why no such a exists: Let M_5 and N_5 be defined exactly like
> M and N, but on D_5, with respect to the probability measure P which
> we agree exists. The only difference is that on D_5, there are many
> more documents, but still a finite number.
>
> Let A_5 = (M_5)(N_5). You can see that \sum_i (A_5)_{i,spam} = 1,
> because
>
> \sum_i (A_5)_{i,spam} =
> \sum_{T \in D_5} [ P(spam|T)P(T|spam) + P(notspam|T)P(T|spam) ]
> = \sum_{T \in D_5} P(T|spam) = 1.
>
> where we use P(notspam|T) = 1 - P(spam|T).
> Similarly, you can show that \sum_i (A_5)_{i,notspam} = 1.
>
> But now, notice that
>
> (**) 1 = \sum_i (A_5)_{i,r} > \sum_i A_{i,r},
>
> so in (*), both sums are < 1. It follows that 0 <= a <= 1 cannot exist.[]
>
> Now that we know that no Prob measure which equals P can exist for D
> finite, consider the infinite case. Notice that the proof does not
> use the finite nature of D at all. The finite sums on t can be
> replaced by infinite sums, and can be interchanged by Fubini's
> theorem, because all terms are nonnegative. All we need is to be able
> to enumerate the t's on D and D_5. Moreover, the inequality in (**)
> is strict even in the infinite case (easy, just use a nonreal
> document in D_5 with positive P probability). So the same proof works
> in the infinite case.
------------ Christian Siefkes -----------------------------------------
| Email: chr...@si... | Web: http://www.siefkes.net/
| Graduate School in Distributed IS: http://www.wiwi.hu-berlin.de/gkvi/
-------------------- Offline P2P: http://www.leihnetzwerk.de/ ----------
The documentation said "For Windows 95 or better", so I used
Linux... ;-)

Re: [Crm114-discuss] Re: continuation: current model in crm114

From: Laird B. <lb...@us...> - 2004年03月30日 00:08:46

On Mar 29 2004, Christian Siefkes wrote:
> > Let M be the 2x|D| matrix, where the row label represents
> > {spam,notspam} and the column label represents an enumeration of the
> > documents t \in D.
> >
> > M_st =(def) Prob(d is "s" | d = t) =(def) P(T is "s" | T = \phi(t)),
> >
> > Let N be the |D|x2 matrix, where the row label is a document and the
> > column label is in {spam,notspam}:
> >
> > N_tr =(def) Prob(d = t | d is "r") = \prod_k P(T_k = \phi(t)_k | T is "r")
> 
> This looks more or less as if you're applying the transformation function
> \phi to each feature t_k in the document t? But you can only apply this
> function to the whole document t, otherwise you'll get wrong results...
> 
Sorry, this was meant to be as succinct as possible, but is what you
expect. If \phi(t) = U, then \phi(t)_k = U_k where U = (U_1,...,U_|D_5|). 
Each element of the matrix N_{tr} has a fixed full document t, 
and a fixed label r. I could have written this as
N_{tr} = \prod_k P( G_k = T_k | G is "r"), where T = \phi(t) = (T_1,...,T_n)
> 
> > The last product is because on D_5, the terms T_k are independent,
> > conditionally on the spam/notspamc class. I'm only defining these
> > matrices so I don't have to write long formulas.
> >
> > If the measure Prob on D exists and is a probability, then we must
> > certainly have
> >
> > Prob(d is "s") = \sum_t Prob(d is "s"| d = t) Prob(d = t)
> 
> What exactly is it you're postulating here?
I am saying: if a probability measure Prob exists on D, then such a
measure follows all the usual rules of probability theory. The above is
sometimes called the law of total probability, and has the form
Prob(A) = \sum_i Prob(A|B_i)Prob(B_i), 
This equation is true for all probabilities on D, whenever the B_i form an
exhaustive partition of D. This is the case with the choice 
B_t = {d = t}. 
> 
> 
> > = \sum_{t,r} Prob(d is "s"| d = t)Prob(d = t|d is "r")Prob(d is "r")
> > v = MNv
> >
> > where v is the column vector v = (Prob(d is "spam"), Prob(d is "notspam"))^T
> 
> What's T here? The set of features T_k in the document T = \phi(t) ?
> 
Oops, that T is for "transpose", because v is a column vector and I
can't write columns in an email. Sorry, you can probably ignore that. 
I'm also using a latex convention that _ indicates a subscript, ^
indicates a superscript. Writing mathematical notation in emails is
terribly inefficient.
-- 
Laird Breyer.

Thanks for helping keep SourceForge clean.