3 April 2009

How to stop spam comments on blogs and forums

Blog spam first became a problem for me when one of my articles got picked up by SlashDot last year. For a few heady days I was receiving more than 10,000 daily visitors and won a large number of back-links to this site. All good stuff, but the downside was that I came under attack from a small army of spammers who proceeded to flood my comments queue with link-encrusted nonsense.

Spam is happenning on a truly grand scale. According to the Askimet anti-spam service, at least 80% of all comments posted to blogs are spam. This represents a huge amount of content, most of it being generated by automated “spam bots”.

Why do they bother? Blame search engines, particularly Google. By making links the basic currency of search engine value, search engines have encouraged an explosion of blog and forum spam by the less scrupulous “black hat” search engine optimisers. Inserting a link into a comment or post can earn a spammer valuable search engine page rank for their site – if they do this on a grand scale then they can push their sites up the search results.

Taking a look around “black hat” SEO forums can be instructive. There are a lot of people out there running server farms that do little more than try to insert hyperlinks into blogs and forums. The tenacity and imagination that these guys use in pursuit of spam is pretty staggering – if they applied this to more legitimate jobs then they’d make a fortune.

How can you block spam? This is something of an arms race – as people raise walls against spammers so more techniques are developed to try and get around the restrictions. That said, there are a number of techniques that can help. They broadly fall into two categories: catching spam before it hits and discouraging spammers from your site.

Catching and blocking spam

Validation – testing for humans

Validation questions are often used to ensure that content is being submitted by a real person rather than an automated spam tool. The Captcha technique is the most common, where a visually-obscured combination of letters are displayed and the user has to enter them, the idea being that these letters cannot be reliably read by an automated tool.

This technique does have a number of usability drawbacks. Regular users can find Captchas a nuisance while older Captcha techniques cannot be used by assistive technologies such as screen readers, although more modern Captcha implementations do allow for audio equivalents.

Disallowing multiple submissions

Many spam attacks involve “flooding” a site with multiple posts. The obvious solution here is to block consecutive posts by IP address. Although it is worth treating multiple posts from a single IP address as suspicious, spammers generally hide or “fake” their IP address by using a distributed proxy service which can appear to give them a different IP address for each posting. It also is possible, though rare, for two different legitimate posters from the same ISP to post on the same IP address.

Response tokens

Many automated spammers work by sending an HTTP POST request straight to your website after working out what fields are expected by your comments form. Including a hidden session token or encoded value in a comments form can help to guard against this kind of automated attack.

A token can include information such as the poster’s IP address and the time at which the form was requested by the user. Any posts that do not come with an appropriate token or are returned a little too quickly can be regarded as suspicious.

Distributed approaches

Spammers have learnt to cover their tracks by varying their IP addresses and tweaking the nature of postings so that individual spam filters find it very difficult to detect spammers. However, across thousands of sites it is easier to establish a pattern and detect automated spam that will be hitting a large number of sites at the same time.

Askimet is one of the more popular services that offer distributed spam protection, mainly because it is available as a plug-in for WordPress so has developed a very wide user base. However, it does have some commercial use restrictions that may limit its long term adoption in the corporate sphere. LinkSleeve is an alternative that does not include licenses or API keys.

Discouraging spam

You can help to put spammers off by not making it worth their while. If they aren’t going to get any backlinks from your site they may be less likely to bother with you.

Using “nofollow”

Adding the rel=”nofollow” attribute to a hyperlink tells search engines not to index the link, making sure that the link will not deliver any page rank value. This is a standard that was adopted by Google in 2005 and other major search engines have followed this lead. Most blog software now adds this link to any posted comments by default, the rationale being that spammers will not be less motivated to post spam as any links posted in the comment will not deliver any page rank.

However, it’s important to bear in mind that spammers are playing a numbers game. They continue to spam everybody in the hope of hitting a site that has not implemented “nofollow” and regard the chance of picking up a small amount of traffic through normal click-throughs as worth the effort.

Keyword filtering

Blocking well-known keywords that spammers use can be effective, but a lot of spammers have grown wise to this and steer clear of the more obvious terms. Some spammers may be discouraged by the fact that they cannot get the most relevant keyword terms into their links, but many are happy enough with an irrelevant link and will continue to spam away on phrases that beat keyword filters.

Don’t let people embed links into their posts

This is a bit of a zero option. Again, this won’t discourage automated spammers who will probably attack you anyway “just in case”, but it will help to put off the more hands-on SEO spammers who review the sites that they attack for backlink potential. Given that so many spammers use automated HTTP POST requests that circumnavigate a comments form, this technique is only effective if links are removed at the server side rather than through client-side JavaScript.

The downside to this technique is that legitimate posters will not be able to post links to genuinly related information. In a very real sense this acts against the spirit of online communities and blogs by reducing their potential as a means of sharing information.

Many bloggers have reported that their comment rate goes down when they block embedded links to their blogs – this does make me wonder what motivates people to leaving comments in the first place: are they genuinely trying to comment to an article or just notch up another back-link to their own blog?

Filed under UI Development.