This is a continuation of the discussion in the crm114-general thread. Previous reply by Christian Siefkes: > Well, it converts the probability of a document T = (F_1, F_2 > ... F_n) into the probabilities of generating all the _features_ F_k > (k from 1 to n) in this document. > > P(T is spam | F_k) = P(F_k is in spam) * P(T is spam [the prior > probability]) / P(F_k is in any document) > > Where P(F_k is in any document) = P(F_k is in spam) * P(T is spam) + > P(F_k is in nonspam) * P(T is nonspam) > > In the case of CRM, P(F_k is in [non]spam) is calculated by the > local probability formula given in the Plateau Paper. > > We start with uniform priors ( P(T is spam) = P(T is nonspam) = 0.5 > ) and apply this formula for each feature F_k (k from 1 to n) in the > document T. > > So illegal documents in D_5 do not affect these calculations because > the probability of generating documents is not used. There is no > leakage of probability mass. We're still not discussing the same thing. I agree with the formula stated above. This is just a calculation on D_5. The point where things break down is when you take the expression P(T is spam | F_k) and use it without modification for real documents, ie on the set D. It is true that the set D is basically a subset of D_5, but P must still be converted into some probability on D, which you haven't yet addressed. I am claiming that the restriction of P to the set D is not suitable for this. In case this doesn't convince you, let me be more explicit: On D_5, you can calculate P(T is spam | T = \phi(t) ), which you've done above, for any fixed real document t \in D. But crm114 only outputs a conditional probability for real documents. The set D_5 is an internal aspect of its algorithm, but has no other relation to crm114's inputs or outputs. As a machine, crm114 works like this: real documents -> crm114 -> probabilities of real documents So you are saying that the correct thing to do is to output the numerical value computed on D_5, ie you output Prob( t is spam | t = (t_1,...,t_n) ) =(def) P( T is spam | T = \phi(t_1,...,t_n) ) But this is inconsistent on D, because if I feed crm114 every possible real document t \in D, then it will output a set of "conditional probabilities" (*) Prob( t is spam | t = (t_1,...,t_n) ) for all possible documents t. But this family of conditional probabilities is inconsistent. There is no measure Prob on D which is compatible with them. It you don't believe me, I'll give you an outline proof of this (just ask). But let's first see if you follow me so far. -- Laird Breyer.
Hi, On 2004年3月28日, Laird Breyer wrote: > We're still not discussing the same thing. I agree with the formula > stated above. This is just a calculation on D_5. The point where > things break down is when you take the expression P(T is spam | F_k) > and use it without modification for real documents, ie on the set D. > > It is true that the set D is basically a subset of D_5, but P must > still be converted into some probability on D, which you haven't yet > addressed. I am claiming that the restriction of P to the set D is > not suitable for this. > > In case this doesn't convince you, let me be more explicit: On D_5, > you can calculate P(T is spam | T = \phi(t) ), which you've done > above, for any fixed real document t \in D. Yes. > But crm114 only outputs a conditional probability for real > documents. The set D_5 is an internal aspect of its algorithm, > but has no other relation to crm114's inputs or outputs. > As a machine, crm114 works like this: > > real documents -> crm114 -> probabilities of real documents > > So you are saying that the correct thing to do is to output > the numerical value computed on D_5, ie you output > > Prob( t is spam | t = (t_1,...,t_n) ) =(def) > P( T is spam | T = \phi(t_1,...,t_n) ) Yes. Maybe not "the" correct thing, but I think it's legitimate and consistent. > But this is inconsistent on D, because if I feed crm114 every possible > real document t \in D, then it will output a set of "conditional > probabilities" > > (*) Prob( t is spam | t = (t_1,...,t_n) ) for all possible documents t. Hm, this would be an infinite set because the length of documents (number of features) is not limited; the number of possible different features isn't limited either. But you could certainly do so for any possible document in t... > But this family of conditional probabilities is inconsistent. > There is no measure Prob on D which is compatible with them. > > It you don't believe me, I'll give you an outline proof of this > (just ask). But let's first see if you follow me so far. Please do so -- I still don't have a clue why it should be inconsistent. Bye Christian ------------ Christian Siefkes ----------------------------------------- | Email: chr...@si... | Web: http://www.siefkes.net/ | Graduate School in Distributed IS: http://www.wiwi.hu-berlin.de/gkvi/ -------------------- Offline P2P: http://www.leihnetzwerk.de/ ---------- Companies go through periods in which middle and upper managers vie to out-do each other in thinking up ways to improve near-term performance (quarterly earnings) by sacrificing the longer term. This is usually called "bottom-line consciousness," but we prefer to give it another name: "eating the seed corn." -- Tom DeMarco and Timothy Lister, Peopleware
Hi Christian, Here's the proof (a bit long, so I didn't want to include it earlier). Let's first assume that D has finite cardinality |D|. This is the case if the vocabulary is finite and documents have a maximum size. We can take the limit afterwards. Claim: Assume |D| is finite. Define for any t \in D > > Prob( t is spam | t = (t_1,...,t_n) ) =(def) > > P( T is spam | T = \phi(t_1,...,t_n) ) Then there exists no probability measure "Prob" defined on D such that the above are conditional probabilities for all documents t \in D. Proof: Let M be the 2x|D| matrix, where the row label represents {spam,notspam} and the column label represents an enumeration of the documents t \in D. M_st =(def) Prob(d is "s" | d = t) =(def) P(T is "s" | T = \phi(t)), Let N be the |D|x2 matrix, where the row label is a document and the column label is in {spam,notspam}: N_tr =(def) Prob(d = t | d is "r") = \prod_k P(T_k = \phi(t)_k | T is "r") The last product is because on D_5, the terms T_k are independent, conditionally on the spam/notspamc class. I'm only defining these matrices so I don't have to write long formulas. If the measure Prob on D exists and is a probability, then we must certainly have Prob(d is "s") = \sum_t Prob(d is "s"| d = t) Prob(d = t) = \sum_{t,r} Prob(d is "s"| d = t)Prob(d = t|d is "r")Prob(d is "r") v = MNv where v is the column vector v = (Prob(d is "spam"), Prob(d is "notspam"))^T and we are summing over the real documents t \in D only. Let A = MN, then v is an eigenvector with eigenvalue 1, and v has the special form v = (a, 1 - a)^T with 0 <= a <= 1. Now v_i = \sum_j A_ij v_j, and \sum_i v_i = 1 gives \sum_j (\sum_i A_ij) v_j = 1, which simplifies to (*) a (\sum_i A_{i,spam}) + (1-a) (\sum_i A_{i,notspam}) = 1. The elements of A are of course all nonnegative, and I claim there is no solution for 0 <= a <= 1. If no such a exists, then there is no measure Prob on D. Here is why no such a exists: Let M_5 and N_5 be defined exactly like M and N, but on D_5, with respect to the probability measure P which we agree exists. The only difference is that on D_5, there are many more documents, but still a finite number. Let A_5 = (M_5)(N_5). You can see that \sum_i (A_5)_{i,spam} = 1, because \sum_i (A_5)_{i,spam} = \sum_{T \in D_5} [ P(spam|T)P(T|spam) + P(notspam|T)P(T|spam) ] = \sum_{T \in D_5} P(T|spam) = 1. where we use P(notspam|T) = 1 - P(spam|T). Similarly, you can show that \sum_i (A_5)_{i,notspam} = 1. But now, notice that (**) 1 = \sum_i (A_5)_{i,r} > \sum_i A_{i,r}, so in (*), both sums are < 1. It follows that 0 <= a <= 1 cannot exist.[] Now that we know that no Prob measure which equals P can exist for D finite, consider the infinite case. Notice that the proof does not use the finite nature of D at all. The finite sums on t can be replaced by infinite sums, and can be interchanged by Fubini's theorem, because all terms are nonnegative. All we need is to be able to enumerate the t's on D and D_5. Moreover, the inequality in (**) is strict even in the infinite case (easy, just use a nonreal document in D_5 with positive P probability). So the same proof works in the infinite case. -- Laird Breyer.
Hi Laird, On 2004年3月28日, Laird Breyer wrote: > Here's the proof (a bit long, so I didn't want to include it earlier). hm, there are some points where I'm not sure what you mean. Not sure if we'll find the time to sort this out but maybe... > Let's first assume that D has finite cardinality |D|. This is the case > if the vocabulary is finite and documents have a maximum size. We can > take the limit afterwards. > > Claim: Assume |D| is finite. Define for any t \in D > > > > Prob( t is spam | t = (t_1,...,t_n) ) =(def) > > > P( T is spam | T = \phi(t_1,...,t_n) ) > > Then there exists no probability measure "Prob" defined on D such > that the above are conditional probabilities for all documents t \in D. > > Proof: > > Let M be the 2x|D| matrix, where the row label represents > {spam,notspam} and the column label represents an enumeration of the > documents t \in D. > > M_st =(def) Prob(d is "s" | d = t) =(def) P(T is "s" | T = \phi(t)), > > Let N be the |D|x2 matrix, where the row label is a document and the > column label is in {spam,notspam}: > > N_tr =(def) Prob(d = t | d is "r") = \prod_k P(T_k = \phi(t)_k | T is "r") This looks more or less as if you're applying the transformation function \phi to each feature t_k in the document t? But you can only apply this function to the whole document t, otherwise you'll get wrong results... > The last product is because on D_5, the terms T_k are independent, > conditionally on the spam/notspamc class. I'm only defining these > matrices so I don't have to write long formulas. > > If the measure Prob on D exists and is a probability, then we must > certainly have > > Prob(d is "s") = \sum_t Prob(d is "s"| d = t) Prob(d = t) What exactly is it you're postulating here? > = \sum_{t,r} Prob(d is "s"| d = t)Prob(d = t|d is "r")Prob(d is "r") > v = MNv > > where v is the column vector v = (Prob(d is "spam"), Prob(d is "notspam"))^T What's T here? The set of features T_k in the document T = \phi(t) ? > and we are summing over the real documents t \in D only. > > Let A = MN, then v is an eigenvector with eigenvalue 1, and v has the > special form v = (a, 1 - a)^T with 0 <= a <= 1. Now > > v_i = \sum_j A_ij v_j, and \sum_i v_i = 1 gives > > \sum_j (\sum_i A_ij) v_j = 1, which simplifies to > > (*) a (\sum_i A_{i,spam}) + (1-a) (\sum_i A_{i,notspam}) = 1. > > The elements of A are of course all nonnegative, and I claim there is > no solution for 0 <= a <= 1. If no such a exists, then there is no > measure Prob on D. > > Here is why no such a exists: Let M_5 and N_5 be defined exactly like > M and N, but on D_5, with respect to the probability measure P which > we agree exists. The only difference is that on D_5, there are many > more documents, but still a finite number. > > Let A_5 = (M_5)(N_5). You can see that \sum_i (A_5)_{i,spam} = 1, > because > > \sum_i (A_5)_{i,spam} = > \sum_{T \in D_5} [ P(spam|T)P(T|spam) + P(notspam|T)P(T|spam) ] > = \sum_{T \in D_5} P(T|spam) = 1. > > where we use P(notspam|T) = 1 - P(spam|T). > Similarly, you can show that \sum_i (A_5)_{i,notspam} = 1. > > But now, notice that > > (**) 1 = \sum_i (A_5)_{i,r} > \sum_i A_{i,r}, > > so in (*), both sums are < 1. It follows that 0 <= a <= 1 cannot exist.[] > > Now that we know that no Prob measure which equals P can exist for D > finite, consider the infinite case. Notice that the proof does not > use the finite nature of D at all. The finite sums on t can be > replaced by infinite sums, and can be interchanged by Fubini's > theorem, because all terms are nonnegative. All we need is to be able > to enumerate the t's on D and D_5. Moreover, the inequality in (**) > is strict even in the infinite case (easy, just use a nonreal > document in D_5 with positive P probability). So the same proof works > in the infinite case. ------------ Christian Siefkes ----------------------------------------- | Email: chr...@si... | Web: http://www.siefkes.net/ | Graduate School in Distributed IS: http://www.wiwi.hu-berlin.de/gkvi/ -------------------- Offline P2P: http://www.leihnetzwerk.de/ ---------- The documentation said "For Windows 95 or better", so I used Linux... ;-)
On Mar 29 2004, Christian Siefkes wrote: > > Let M be the 2x|D| matrix, where the row label represents > > {spam,notspam} and the column label represents an enumeration of the > > documents t \in D. > > > > M_st =(def) Prob(d is "s" | d = t) =(def) P(T is "s" | T = \phi(t)), > > > > Let N be the |D|x2 matrix, where the row label is a document and the > > column label is in {spam,notspam}: > > > > N_tr =(def) Prob(d = t | d is "r") = \prod_k P(T_k = \phi(t)_k | T is "r") > > This looks more or less as if you're applying the transformation function > \phi to each feature t_k in the document t? But you can only apply this > function to the whole document t, otherwise you'll get wrong results... > Sorry, this was meant to be as succinct as possible, but is what you expect. If \phi(t) = U, then \phi(t)_k = U_k where U = (U_1,...,U_|D_5|). Each element of the matrix N_{tr} has a fixed full document t, and a fixed label r. I could have written this as N_{tr} = \prod_k P( G_k = T_k | G is "r"), where T = \phi(t) = (T_1,...,T_n) > > > The last product is because on D_5, the terms T_k are independent, > > conditionally on the spam/notspamc class. I'm only defining these > > matrices so I don't have to write long formulas. > > > > If the measure Prob on D exists and is a probability, then we must > > certainly have > > > > Prob(d is "s") = \sum_t Prob(d is "s"| d = t) Prob(d = t) > > What exactly is it you're postulating here? I am saying: if a probability measure Prob exists on D, then such a measure follows all the usual rules of probability theory. The above is sometimes called the law of total probability, and has the form Prob(A) = \sum_i Prob(A|B_i)Prob(B_i), This equation is true for all probabilities on D, whenever the B_i form an exhaustive partition of D. This is the case with the choice B_t = {d = t}. > > > > = \sum_{t,r} Prob(d is "s"| d = t)Prob(d = t|d is "r")Prob(d is "r") > > v = MNv > > > > where v is the column vector v = (Prob(d is "spam"), Prob(d is "notspam"))^T > > What's T here? The set of features T_k in the document T = \phi(t) ? > Oops, that T is for "transpose", because v is a column vector and I can't write columns in an email. Sorry, you can probably ignore that. I'm also using a latex convention that _ indicates a subscript, ^ indicates a superscript. Writing mathematical notation in emails is terribly inefficient. -- Laird Breyer.