I am working on a project, and at the moment I have just finished the majority of the Bayes Rating/Classifier.
I have two sets of methods I am most concerned about. Method set 1, will select a certain number of extraneous elements (which have probabilities furthest from 0.5), and only take those into account for classification.
public double GetBayesRating(int numberOfExtremes)
{
return GetBayesRating(numberOfExtremes, 0.5);
}
public double GetBayesRating(int numberOfExtremes, double messageSpamProbability)
{
// Formula based on https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering#Other_expression_of_the_formula_for_combining_individual_probabilities
// We used the natural-log method for the reasons that section describes.
List<SpamKeyword> topKeywords;
if (messageSpamProbability <= 0 || messageSpamProbability >= 1)
throw new ArgumentOutOfRangeException("messageSpamProbability", messageSpamProbability, "The parameter messageSpamProbability must be between the values of 0 and 1.");
if (numberOfExtremes > 0)
topKeywords = keywordsMatched.OrderByDescending(x => Math.Abs(x.SpamProbability)).Take(numberOfExtremes).ToList();
else
topKeywords = keywordsMatched;
double n = 0;
for (int i = 0; i < topKeywords.Count; i++)
n += (Math.Log(1 - topKeywords[i].GetSpamProbability(messageSpamProbability)) - Math.Log(topKeywords[i].GetSpamProbability(messageSpamProbability)));
return 1 / (1 + Math.Pow(Math.E, n));
}
The second set, can allow you to drill down into how many extremes you want above/below the median (0.5). That is, if you want the six highest and four lowest, then it will only take those (or how many it can find until it hits 0.5) into account.
public double GetBayesRating(int numberAboveMiddle, int numberBelowMiddle)
{
return GetBayesRating(numberAboveMiddle, numberBelowMiddle, 0.5);
}
public double GetBayesRating(int numberAboveMiddle, int numberBelowMiddle, double messageSpamProbability)
{
// Formula based on https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering#Other_expression_of_the_formula_for_combining_individual_probabilities
if (numberAboveMiddle <= 0)
throw new ArgumentOutOfRangeException("numberAboveMiddle", numberAboveMiddle, "The parameter numberAboveMiddle must be greater than 0.");
if (numberBelowMiddle <= 0)
throw new ArgumentOutOfRangeException("numberAboveMiddle", numberBelowMiddle, "The parameter numberBelowMiddle must be greater than 0.");
if (messageSpamProbability <= 0 || messageSpamProbability >= 1)
throw new ArgumentOutOfRangeException("messageSpamProbability", messageSpamProbability, "The parameter messageSpamProbability must be between the values of 0 and 1.");
List<SpamKeyword> topKeywordsAbove = keywordsMatched.OrderByDescending(x => x.SpamProbability).Take(numberAboveMiddle).ToList();
List<SpamKeyword> topKeywordsBelow = keywordsMatched.OrderBy(x => x.SpamProbability).Take(numberBelowMiddle).ToList();
List<SpamKeyword> topKeywords = new List<SpamKeyword>();
for (int i = 0; i < topKeywordsAbove.Count; i++)
if (topKeywordsAbove[i].SpamProbability >= 0.5)
topKeywords.Add(topKeywordsAbove[i]);
for (int i = 0; i < topKeywordsBelow.Count; i++)
if (topKeywordsBelow[i].SpamProbability <= 0.5 && !topKeywords.Contains(topKeywordsBelow[i]))
topKeywords.Add(topKeywordsBelow[i]);
double n = 0;
for (int i = 0; i < topKeywords.Count; i++)
n += (Math.Log(1 - topKeywords[i].GetSpamProbability(messageSpamProbability)) - Math.Log(topKeywords[i].GetSpamProbability(messageSpamProbability)));
return 1 / (1 + Math.Pow(Math.E, n));
}
The code should be self-explanatory. I'm most curious about the last method, as I am wondering if there is a shorter, more concise, more readable way to make it do what it does.
The code all works, and I think it's calculating the correct values based on my input criteria (which consists of 1000 strings to examine), but it seems a bit archaic to me.
Also, do note, I have full access to all of the C#6.0 and .NET 4.6 feature-sets. If there is anything in particular from either of those sets that would help me (I don't think there is, I've read a lot of the C#6.0 additions and did not notice anything that would assist me here, doesn't mean there isn't anything though), then I would greatly appreciate it being pointed out.
Also, another note, the GetBayesRating(int, double)
and GetBayesRating(int, int, double)
methods both have maintainability indexes of 57
and 47
respectively. I would potentially like to improve this. (Of course, I also have a method that has maintainability index of 22, but that is another story.)
Also, as requested, the relevant bits of the SpamKeyword
class:
public class SpamKeyword
{
private string term;
private double spamProbability;
private double weight;
public string Term { get { return term; } private set { if (!string.IsNullOrWhiteSpace(term)) { term = value; } else { throw new ArgumentException("An invalid value was passed for Term: the value was either null, whitespace, or an empty string."); } } }
public double SpamProbability { get { return spamProbability; } private set { spamProbability = value; } }
public double Weight { get { return weight; } private set { weight = value; } }
public SpamKeyword(string term, double spamProbability, double weight)
{
this.term = term;
this.spamProbability = spamProbability;
this.weight = weight;
}
public double GetSpamProbability(double messageSpamProbability)
{
return spamProbability * messageSpamProbability / (spamProbability * messageSpamProbability + (1 - spamProbability) * (1 - messageSpamProbability));
}
public bool IsEmpty()
{
return this.term == null && this.spamProbability == 0 && this.weight == 0;
}
}
2 Answers 2
Using LINQ can help you a lot here. Wherever you see a loop, there's a good chance you can replace it with LINQ to improve readability and safety. Let's look at the following loop:
double n = 0; for (int i = 0; i < topKeywords.Count; i++) n += (Math.Log(1 - topKeywords[i].GetSpamProbability(messageSpamProbability)) - Math.Log(topKeywords[i].GetSpamProbability(messageSpamProbability)));
First of all, topKeywords[i].GetSpamProbability(messageSpamProbability)
is used twice and would be more readable if it were stored in a variable:
double n = 0;
for (int i = 0; i < topKeywords.Count; i++)
{
var spamProbability = topKeywords[i].GetSpamProbability(messageSpamProbability);
n += (Math.Log(1 - spamProbability) - Math.Log(spamProbability));
}
Here is the version using LINQ:
double n = topKeywords
.Select(x => x.GetSpamProbability(messageSpamProbability))
.Select(x => Math.Log(1 - x) - Math.Log(x))
.Sum();
Now let's look at the following code in your final GetBayesRating
method:
List<SpamKeyword> topKeywordsAbove = keywordsMatched.OrderByDescending(x => x.SpamProbability).Take(numberAboveMiddle).ToList(); List<SpamKeyword> topKeywordsBelow = keywordsMatched.OrderBy(x => x.SpamProbability).Take(numberBelowMiddle).ToList(); List<SpamKeyword> topKeywords = new List<SpamKeyword>(); for (int i = 0; i < topKeywordsAbove.Count; i++) if (topKeywordsAbove[i].SpamProbability >= 0.5) topKeywords.Add(topKeywordsAbove[i]); for (int i = 0; i < topKeywordsBelow.Count; i++) if (topKeywordsBelow[i].SpamProbability <= 0.5 && !topKeywords.Contains(topKeywordsBelow[i])) topKeywords.Add(topKeywordsBelow[i]);
I was going to suggest replacing the for loops with LINQ queries using Where
, but then I realized that topKeywordsAbove
and topKeywordsBelow
are ordered, so you could just add a TakeWhile
call to the initial queries. Note that you'll have to change <= 5
to < 5
so there isn't any overlap. And now that you're not accessing the collections by index and you're only accessing them once, you don't need to convert them to Lists.
var topKeywordsAbove =
keywordsMatched
.OrderByDescending(x => x.SpamProbability)
.Take(numberAboveMiddle)
.TakeWhile(x => x.SpamProbability >= 0.5);
var topKeywordsBelow =
keywordsMatched
.OrderBy(x => x.SpamProbability)
.Take(numberBelowMiddle)
.TakeWhile(x => x.SpamProbability < 0.5);
var topKeywords = topKeywordsAbove.Concat(topKeywordsBelow);
Finally, you can extract the end of your two big methods into a method:
private double GetBayesRating(IEnumerable<SpamKeyword> topKeywords, double messageSpamProbability)
{
double n = topKeywords
.Select(x => x.GetSpamProbability(messageSpamProbability))
.Select(x => Math.Log(1 - x) - Math.Log(x))
.Sum();
return 1 / (1 + Math.Pow(Math.E, n));
}
-
\$\begingroup\$ I very much like all these suggestions - I just put them all in, and I did not see a lack of performance. (It took 3ms longer than the previous methods on the first 500 test strings, but that is statistically irrelevant.) The only bit I don't like is the user of the
var
keyword, I'm not overly fond of it. Though, here, it's hard to avoid it. That said, it'll do in this context just fine. \$\endgroup\$Der Kommissar– Der Kommissar2015年07月08日 16:53:28 +00:00Commented Jul 8, 2015 at 16:53 -
\$\begingroup\$ @EBrown: I'm glad you approve! Regarding the
var
keyword, go with whatever you find more readable. I used it because IEnumerable<SpamKeyword> is verbose, but I could see how usingvar
in this instance could make keeping track of types difficult. \$\endgroup\$Risky Martin– Risky Martin2015年07月08日 17:24:20 +00:00Commented Jul 8, 2015 at 17:24 -
\$\begingroup\$ I tend to try to avoid it where possible, here it makes sense due to the nature of the code, but I never use it in situations where I have multiple
IEnumerable
objects that have different types specified, in situations where I am making changes to it that could cast to unexpected types (thinks it's anIEnumerable
when I want it to be aList
), or for basic types likedouble
,int
, etc. That said, in this situation I am not against it because it doesn't meet any of those criteria. And I greatly appreciate your response - you just taught me a little more LINQ. :) \$\endgroup\$Der Kommissar– Der Kommissar2015年07月08日 17:27:58 +00:00Commented Jul 8, 2015 at 17:27
Just a quick nitpick, but it could help if you need to debug
if (numberAboveMiddle <= 0) throw new ArgumentOutOfRangeException("numberAboveMiddle", numberAboveMiddle, "The parameter numberAboveMiddle must be greater than 0."); if (numberBelowMiddle <= 0) throw new ArgumentOutOfRangeException("numberAboveMiddle", numberBelowMiddle, "The parameter numberBelowMiddle must be greater than 0.");
the parameter name in the second ArgumentOutOfRangeException
should be "numberBelowMiddle
".
Explore related questions
See similar questions with these tags.