Friday, March 19, 2010
Spam detection
Suppose you need to classify a set of web pages basing just on the textual content. What approach would you adopt?
Subscribe to:
Post Comments (Atom)
Random commentary about Machine Learning, BigData, Spark, Deep Learning, C++, STL, Boost, Perl, Python, Algorithms, Problem Solving and Web Search
1 comment:
The approach that everyone always does? Think up a bunch of features and train a classifier?
Reply DeleteThat's just about every paper on this topic these days. All the papers look like this: Here's a couple hundred features we tried, here are the dozen that mattered, here are the types of classifiers we tried but it didn't make much difference which kind we used.
I think the more interesting work on this topic is when we include other data such as real-time user behavior data in the classification. That's when things start to get exciting.