Writing An Effective Spam Filter
Written By corenominal on May. 6, 2008.
2 Comments
Report Note
+ Clip This
This is my first Chawlk note. I only registered as a new Chawlk user this morning, and to be honest, I was not sure that I would be overly interested in the site/service; however, there seems to be a good mix of users and content on the site and it occurred to me that the Chawk notes service might be a good place to post Dear Lazy Web type posts?! So here goes...
Dear Lazy Web
I am currently in the process of writing a new spam filter for the user comments system on website [crunchbang.org]. I am following the same principles as described by Mr Snook. Do you have any experience of writing/creating similar, and if so, do you have any tips for effectively separating the spam from the ham?
Best Regards,
Philip

Ozone42
Written May. 6, 2008 / Report /
Sisyphus comes to mind
The point system you linked to isn't bad, but I wonder about the -10 points for "Interesting." Perhaps if that was the majority of the comment made.
I think the most important thing to keep in mind is not make it hard on the real people commenting. If you catch 90% of the spam and it's still easy to leave a comment that doesn't get flagged, I think you've succeeded. If you catch 100% of the spam over a week, but have 2 false positives, I think that's a failure. Then again, it really depends on the level of traffic and spam we're talking about here.
corenominal
Written May. 6, 2008 / Report /
@Ozone42: A quote from the Wikipedia article you linked to:
I agree that this could become an unending activity, but I am not sure about the "pointless and unrewarding" part. Personally I find this type of task thoroughly interesting and rewarding. I have really enjoyed coding up my spamsnake and I am looking forward to continued tinkering :)
Also, I wonder myself about some parts of the points system [discussed in the linked article.] Having read the entire post and the comments, I am pretty sure that some of the flags/rules used are compound, so while they may appear odd on their own, they probably work well in conjunction with other rules.