Antonio Gulli's coding playground: Spam detection

Friday, March 19, 2010

Spam detection

Suppose you need to classify a set of web pages basing just on the textual content. What approach would you adopt?

Pubblicato da codingplayground a 11:36 PM

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

1 comment:

Greg Linden March 22, 2010 at 8:28 AM
The approach that everyone always does? Think up a bunch of features and train a classifier?

That's just about every paper on this topic these days. All the papers look like this: Here's a couple hundred features we tried, here are the dozen that mattered, here are the types of classifiers we tried but it didn't make much difference which kind we used.

I think the more interesting work on this topic is when we include other data such as real-time user behavior data in the classification. That's when things start to get exciting.
Reply Delete
Replies

Add comment

[フレーム]

Popular Posts

K-means in C++

K-means is a classical clustering algorithm.. Here you have a C++ code for K-means clustering . (Edit: 12/05/013) See also my more rece...
Adaboost : improve your weak performance

Adaboost is one of my favorite Machine Learning algorithm. The idea is quite intriguing: You start from a set of weak classifiers and learn...
Nearest Neighbour on KD-Tree in C++ and Boost

Wikipedia describes the pseudo-code for computing the nearest neighbour (nn) on an already built KDtree. Here you have a boost implementatio...
place n queens on a chessboard

typical recursive solution where we tentatively put a queen, if this doesn't violate conditions in column i. Then continue in submatrix...
A robot is moving in a rectangular board

It can move either down or right and the board is N x M. How many path does the robot have? Solution: Steps are N+M and we can chose N, ...
Discuss memory layout for C programs

Ideally you should discuss all the different areas that are used
DBSCAN clustering algorithm

DBSCAN is a well-known clustering algorithm, which is easy to implement. Quoting Wikipedia: " Basically, a point q is directly densit...
Design Patterns : C++ full collection of Gamma's patterns

Full collection of Gamma's patterns in c++: Creational : Abstract Factory, Builder, Factory, Prototype, Object Pool, Singleton, Struct...
Learning linear regression with gradient descend

Last week I restarted an old and good behavior (see A collection of algos and data structures published here) . Every day, I take an well k...
PCA: Dimensional Reduction in Eigen

PCA (Principal Component Analisys) is a classical machine learning method to reduce the dimensionality of a problem. PCA involves the calcu...

My Blog List

The Official Google Blog

At our Research@ Poland event we shared how AI is helping us solve big challenges. - From AI education to disaster response, see how collaboration is at the heart of the work at Google Research.
3 hours ago
Inside Search

Google Search with Gemini 3: Our most intelligent search yet - Learn more about Gemini 3 and how it’s upgrading Google Search and AI Mode.
20 hours ago
Computational Complexity

Test of Time Awards: A Good Idea but .... - Since there is now a CCC Test-of-Time Award, see here, (CCC stands for Computational Complexity Conference), I decided to look at other Test-of-Time award...
2 days ago
Facebook Developer Blog

Platform Evolution: Facebook Social Plugins to Be Discontinued February 2026 - As Meta’s developer platform continues to evolve, we’re making strategic decisions to focus on tools and features that deliver the most value to developers...
1 week ago
PeteSearch

All AI Benchmarks are Wrong, but some are Useful - When I was new to Google Brain, I got involved in a long and heated discussion about evaluation numbers for some models we were using. As we walked out of ...
4 weeks ago
GigaOM

Reclaiming Control: Digital Sovereignty in 2025 - Sovereignty has mattered since the invention of the nation state—defined by borders, laws, and taxes that apply within and without. While many The post R...
5 months ago
Process Algebra Diary

2024 Award Winners announced by Computer Science Canada | Informatique Canada - Computer Science Canada | Informatique Canada has announced the list of recipients of its awards for 2024. Members of the TCS community will be pleased ...
5 months ago
TechCrunch Europe

Rummy Nabob: The Royal Destination for Ultimate Card Gaming - Online gaming in India has undergone a dramatic transformation in the past decade, and one of the brightest stars in this evolution is the classic game o...
5 months ago
TechCrunch

Top 10 AI Tools That Will Transform Your Content Creation in 2025 - [image: Top 10 AI Tools That Will Transform Your Content Creation in 2025] Looking to level up your content creation game in 2025? You're in the right pla...
10 months ago
Search Engine Land: News About Search Engines & Search Marketing

TikTok unveils 5 new advertising tools - TikTok introduced Smart+, GMV Max, PETs, Conversion Lift Studies, and Out of Phone: Retail to help brands drive stronger results.
1 year ago
in theory

FOCS Test of Time Awards - The FOCS test of time award recognizes each year a paper, or papers, each from the FOCS conference of 10, 20, and 30 years earlier for the impact they have...
1 year ago
Zen and the Art of Programming

I Ask Basic Questions During Technical Interviews - Over the past 18 years at IBM, I have interviewed over a thousand programmers for positions within my team (which has a fair number of interns.) Most of ...
1 year ago
Official Google Research Blog

Generative AI to quantify uncertainty in weather forecasting - Posted by Lizao (Larry) Li, Software Engineer, and Rob Carver, Research Scientist, Google Research Accurate weather forecasts can have a direct impact on ...
1 year ago
Geeking with Greg

My book, Algorithms and Misinformation - Misinformation and disinformation are the biggest problems on the internet. To solve a problem, you need to understand the problem. In *Algorithms and Mis...
1 year ago
GeeksforGeeks

Python Flask Projects with Source Code (Beginners to Advanced) - Flask, a Python web application framework, was created by Armin Ronacher. Known for its lightweight and efficient nature, Flask is designed for quick sta...
2 years ago
SEO by the Sea

Identifying Subjective Attributes Of Entities - Identifying UGC Subjective Attributes Of Entities This recently granted patent is about identifying subjective attributes of entities. I haven’t seen a p...
3 years ago
My Biased Coin

Current CS 124 Stats - This is as much personal recording for me (and perhaps of interest to Harvard people who read the blog). But also putting the numbers here for others to k...
5 years ago
my slice of pizza

Trumpet - Someone asked me for my second favorite musical instrument: Trumpet. Hear it, and I challenge you to not raise your head and look beyond horizons. Put it ...
5 years ago
Matt Cutts: Gadgets, Google, and SEO

All the Fitbit activity badges - Fitbit has discontinued their Fitbit One step trackers, which seems like a good opportunity to step back and reflect on wearing one for the last decade or ...
6 years ago
John Battelle's Searchblog

Do We Want A Society Built On The Architecture of Dumb Terminals? - The post Do We Want A Society Built On The Architecture of Dumb Terminals? appeared first on John Battelle's Search Blog. God, “innovation.” First banali...
7 years ago
Google News Blog

Find out how journalists across the world use technology today - How do journalists use technology around the world? A new interactive explorer produced by the International Center for Journalists (ICFJ) and Google New...
7 years ago
Search Engine Watch Blog

Five very quick tips to building a loyal ecommerce customer base - Don’t leave all that juicy revenue out there; get smart about re-engaging and watch your numbers climb.
9 years ago
Microsoft Research Downloads

LatticeCrypto - LatticeCrypto is a high-performance and portable software library that implements lattice-based cryptographic algorithms. The first release of the library ...
9 years ago
Wolfram|Alpha Blog

Find All Wolfram News in One Place—The Wolfram Blog - This is the final post here at the Wolfram|Alpha Blog. Approximately six and a half years ago our launch team started the Wolfram|Alpha blog just prior to ...
9 years ago
Mini-Microsoft

18,000 Microsoft Jobs Gone... Eventually? - *1. Cut Once.* *2. Cut Deeply.* And might I humbly add: *3. Cut Quickly.* As of this morning, we're looking to cut 18,000 Microsoft positions including ...
11 years ago
Twitter Blog

An invitation to #ComedyFest (BYOB) - This week Twitter is turning into a comedy club, and you’ve got the best seats in the house, all for the price of free. We’re not saying that enjoying your...
12 years ago