Fighting spam in Wikka


As it may have dawned on you by now, spam is getting to be a problem in wiki's - both the type of spam that also plagues many blogs in the form of comment spam (only in a wiki it would (also) affect page content), and referrer spam. And then there are spambots intent on gathering email addresses.

Wikka sites are no exception any more (and other WakkaWiki forks seem to be having problems, too).

This page is intended to gather ideas for how to fight spam (of all types) in Wikka, so we can coordinate our efforts and get a spammer-hardened Wikka out there. You can also find some general information about (fighting) wiki spam and what Wikka has already implemented as defense measures.


Spam in Wikka pages

About how to discourage spammers to post links on spam pages in the first place, and what to do when your pages have been spammed already.

Blocking Agents

Bad Behavior is a set of PHP scripts which prevents spambots from accessing your site by analyzing their actual HTTP requests and comparing them to profiles from known spambots. (quote from the homepage)

Two Suggestions


Content Filter
Wacko wiki has implemented a content filter based on a word/phrase list. I'm not sure how sophisticated it is (it's not a Bayesian filter), but uses a list updated from chongqed.org. Read more about it here. I thought this might contribute to our conversations about spamfighting. --GmBowen

Preliminary list of links to (apparent) content blocking systems in wikis (more as I find them):
--JavaWoman

Bayesian Filter: Focus on the content
Many of these suggestions will stop a certain degree of spam, but spammers can easily break these anti-spam measures such as adding random tokens (modern spam bots can already scan a page for form elements and submit all of them). Therefore, I suggest analyzing the content based on what might constitute spam (text frequency, link frequency, blacklist, bayesian filter) and then assigning a score to the post. If the post has over, let's say, a 50% chance for spam, then perhaps email validation, post approval, or a captcha can be used to further validate the user.

I'm particularly supportive of the bayesian filter. For instance, many spam fighting programs today use the bayesian filter (ie. Thunderbird). The bayesian algorithm is adaptive and learning which will work best when used in conjunction with other standard filters. The process might be like this:
  1. The standard filters (ie. blacklist) catches a suspicious post. The post is marked for approval.
  1. The admins will review the post at the post moderation panel. If the post is "ham" then the bayesian filters will automatically adapt to allow future posts that resemble the approved post through. However, if the post is "spam", then the bayesian filter will automatically adapt to block future posts with those certain keywords.

Therefore, a bayesian filter cannot be solely implemented, but rather, it requires admin intervention (to help the filter learn) and other standard filters.

Bayesian filters have been extremely successful in eliminating over 98% of common spam after a few weeks of adaptation.
--MikeXstudios

Adding Random Tokens for Form Submissions?
Ticket:154
Based on this post, I wonder whether providing randomised session tokens for form submission may provide just one more step to impede spambots. Very simple to implement:

wikka.php:
function FormOpen($method = "", $tag = "", $formMethod = "post")
{
    if(!isset($_SESSION['token'])) {
        $token = md5(uniqid(rand(), true));
        $_SESSION['token'] = $token;
    }
    $result = "<form action=\"".$this->Href($method, $tag)."\" method=\"".$formMethod."\"><p>\n";
    $result .= "<input type=\"hidden\" name=\"token\" value=\"".$_SESSION['token']."\" />";
    if (!$this->config["rewrite_mode"]) $result .= "<input type=\"hidden\" name=\"wakka\" value=\"".$this->MiniHref($method, $tag)."\" />\n";
    return $result;
}


and then just wrap edit.php and addcomment.php sections using:
if ($_POST['token'] == $_SESSION['token']) { //form spoof protection

}


I'm definitely no expert on security, and I can see how it can be bypassed, but it does require one more step and adds complexity for spambots to spoof the wiki forms at no cost... --IanAndolina

One issue with the google redirection and newer rel="nofollow" is that good sites also get hit by this procedure. As we can't really tag links on a "trusted user" basis, we have to do that on a trusted server one. I use a whitelist in config.php with a list of "good servers":

"serverre" => "/(nontroppo.org|goodsite.com|etc)/",


And my Link routine in the main wakka.php (wikka.php) is modified to make use of it:

if (preg_match($this->GetConfigValue("serverre"), $tag))
{
    $url = $tag; //trusted web sites so no need for redirects
    $urlclass= "ext";
}
else
{
    $tag = rawurlencode($tag);
    $url = "http://www.google.com/url?q=".$tag;
    $urlclass= "ext";
    $follow = " rel=\"nofollow\" ";
}
return $url ? "<a ".$follow." class=\"".$urlclass."\" href=\"".$url."\">$text</a>" : $text;


This way, trusted sites get full and unadulterated links, but anything else has BOTH google redirection and rel="nofollow" added. The CSS can then contain ways to visually tag those different URLs, so the user can see if a link is trusted or not (I use advanced generated content - not supported in IE):

a.ext:after, a[rel="nofollow"]:after {content:"\00220A";
    text-decoration: none !important;
    font-size: 0.9em;
    color: #888;
    position: relative;
    bottom: 1ex;}

a[rel="nofollow"]:after {content:"\002209";}
-- IanAndolina

Spam Block for Saving pages
As I was getting a lot of repeat spam of the same domains over and over, I implemented a "link blacklist" to my Wiki for comments and edits:

add to edit.php & addcomment.php:
preg_match_all($this->GetConfigValue("spamre"),$body,$out); //keyword spam block
if (count($out[0])>=1)
{
    $this->SetMessage("Go spam somewhere else.  You links will never get spidered here anyway.");
    $this->redirect($this->href());
    return;
}


config.php
"spamre" => "/(voip99|zhiliaotuofa|mycv|princeofprussia|imobissimo|valeofglamorganconservatives|68l|8cx|online-deals99).(net|cn|com|org)|(phentermine)/m",


Now, what I wanted to do was have an admin only wiki page, where the contents of the spamre regexp could be edited, instead of being hardwired in config.php - but never got round to it. But this would be the better way to do it - have a function that finds a wiki page and builds a regexp from the keywords added by admins to that wiki page (not all of whom may have access to config.php). It is a fairly basic method - but with a couple of vigilant admins can reduce repeat attacks from spam bots considerably. -- IanAndolina

User Validation
I like the ascii-based user validation scheme (Captcha) here:

http://www.moztips.com/wiki/index.pcgi?action=edit&page=SandBox

I don't know how to do that in PHP (it is a PHP based wiki I believe) - though the more complex image based solutions are available. This for me is far prefereable to locking pages for writing using ACLs - which IMO destroys the very purpose of the wiki. --IanAndolina


[Copied from SuggestionBox] "There's also code around that uses GD & that could be built onto Nils' code that generates a "registration password" automatically and outputs it as a distorted graphic image.....the code is intended to befuddle auto spam registers & thus stop open-registration sites from being hit by spam bots that register themselves as users. Ultimately, as the bots become more sophisticated I think we'll have to use something like that or else sites like this one (with open registration) will be victimized. Here and here are examples of what I mean (I like the simplicity of the first version in the second example). -- GmBowen" .... I think at least we need a system like one of these (or like the one Ian suggests) on the user registration page. mb


Spam repair and defense
See also DeleteSpamAction !
1/22/05 - Spam help! I have apparently been attacked by an army of spam bots. Has this happened to anyone else? For now, I am asking for your help with:

Whatever script they used (on multiple machines, no less) could certainly be used against any Wakka-like site with minimal modifications, so something has to be done...I will do what I can to help you guys combat future attacks as well as implement the new HTML attribute you've probably all heard about. --RichardBerg

 UPDATE acls SET comment_acl="+" WHERE comment_acl="*";
UPDATE acls SET write_acl="+" WHERE write_acl="*";



Banning users
just so it doesn't get "lost", I'm copying a few comments from another page here. --JW

So, what to do? Banning by IP is indeed fraught with the risk of banning innocent users since with a lot of large ISPs the IP addresses are assigned round-robin and may be different even between subsequent requests (i.e. the request for embedded images may each come from a different IP address, and from a different address than that for the page itself).

Possibly creating and storing a kind of "signature" consisting of not IP address but other request header elements, including user agent string but also accept headers, maybe combining that with a whole IP block rather than a single address might give us some sort of handle. But you'd need to actually store that and watch it for a while before you can tell how reliable (or not) that might be. --JavaWoman


Stopping Spammers getting Google Juice
There is a technique to stop spammers from gaining any advantage of spamming, which is to redirect external links to stop them from affecting their PageRank. Great to stop the whole purpose of spamming, but this has the disadvantage that good sites lose their google juice too. Check the comments out on that page for more cons. I've noticed since I enabled this on the Opera 7 wiki that slowly spam volume has dropped out, but I'm not entirely happy at the price paid. Had you thought about this, maybe have it as an option during config? -- IanAndolina



Referrer spam

Spammers sometimes visit Wikis and blogs with a tool with "bogus" referer headers containing the sites they want to generate incoming links for - this works on many wikis and blogs since such sites often have a page listing referrers (wikis) or list referrers to a particular post (blogs). If a Search engine indexes such a page, it would find a link to the spammed site, resulting in a higher "score" for that spammed page.
The general solution is to cause such links not to be followed by search engines. The technique outlined below under "Don't let old pages get indexed" already takes care of this for the referrer listings Wikka uses.

I am trying this technique to detect which referrer is a spam and which isn't :
1) Add a field named remote_addr to referrers table.
2) Modify LogReferrer() method and add set remote_addr = $_SERVER['REMOTE_ADDR'];
3) Change LoadReferrers() to returns only records where the field remote_addr is blank
4) Add this code to header action :
<link rel="stylesheet" href="<?php echo $this->Href("wikka.css", "pseudodir"); ?>" type="text/css" />

5) Create a file named wikka.css.php and put it at ./handler/page. This file will update the table referrers and set remote_addr to blank if remote_addr is the same as $_SERVER['REMOTE_ADDR'] and time is greater than "now() plus six minutes" ... The scripts will return a css file, with header no-cache, must-revalidate, expired, random etag, ... so that it will be requested each time a page is requested. --DotMG

Explanation : The difference between a spambot and a real user is that a spambot just loads a page, and it doesn' t analyse its content, so, with a spambot, all css files linked within the document won't be loaded.


Email-gathering spambots

Spambots spider websites looking for email addresses to add to the list (to use, or to sell as a targeted list). A general defense that works well (though not 100%) is to "obfuscate" email addresses so such spambots don't recognize them.
I would like to use an offensive attack against spams. My wikka would generate some false email address and present them as
<a href="mailto:unexistingemail@thesnake.us" class="email">unexistingemail@thesnake.us</a>
, and somewhere in css files, you will find
 .email {display: none;}
so that the false email won't be visible by human visitors of the site. The domain of the false email address will be either a domain name of a spam/porn site (tax them bandwidth), or a non existing domain. --DotMG
Obfuscating addresses automatically
Wikka 1.1.6.0 comes with a small action to create an obfuscated email "contact" link for the site administrator. Meanwhile, the formatter will simply turn every email address it recognizes into an email link (with the address also used for the link text) - providing nice fodder for spambots.

What we should have is a function that can turn a given email address into an obfuscated link - this could then be used by both the {{contact}} action and the formatter. It would (then) also enable use to change the obfuscating algorithm inside the fuction without affecting either the formatter or the contact action any more, and others can use this in their own extensions as well. --JavaWoman


Resolved Suggestions

Spam-defense measures that are already implemented in Wikka.
Don't let old pages get indexed
Extended method implemented as of Wikka 1.1.6.0 (Both the "noarchive" addition and applying it to the Sandbox)

To make absolutely sure old pages don't get archived (irrespective of your robots.txt) - essential to stopping WikiSpam from still getting juice from archived pages, why not make sure to add meta directives to those pages by adding something like:
<?php if ($this->GetMethod() != 'show' || $this->page["latest"] == "N") echo "<meta name=\"robots\" content=\"noindex, nofollow, noarchive\" />\n<meta name=\"googlebot\" content=\"noarchive, noindex, nofollow\">\n";?>

to header.php. This stops pages with handlers other than show or non current pages from any kind of archiving/cacheing.




Further references

Where to read more about Wiki spam.






CategoryWikka CategoryDevelopmentArchitecture
Comments
Comment by ns2.alstom.ch
2005-03-15 11:23:36
The 'Content Filter' idea should be easy to implement. Many other wiki engines have done this already, and it has proven to be reasonably effective at reducing spam.

It's true that it becomes more effective if you automatically keep your blacklist up to date with a listing like the one on chongqed.org. This can mean spammers get blocked from your wiki before they even visit it! But the automatic updating can be done with cron jobs. It doesn't necesssarily need to be a built in wikkawiki feature (would be nice I guess)

...but anyway a basic content blocking feature should be included ASAP, and then you could create a new page here called [[AntiSpamFeatures]] describing how to use it. See also http://wiki.chongqed.org//AntiSpamRecommendations -- Halz - 15th Mar 2005
Comment by DarTar
2005-05-26 09:59:27
FYI Just removed a page of spam by a new registered user (SomaCarisoprodol). Shall we make a list of spamming users and include them to the default ACL settings ?
Comment by JavaWoman
2005-05-27 22:15:07
I noted that page as a new one in the RecentChanges RSS feed and thought the name looked suspiciously like some sort of pharmaceutical and thus most likely spam. When I got to look at it, it had disappeared, so it seems I guessed right. :)

I'm not sure adding such "spam users" to the default ACLs would be effective - if they do come back and find they no longer have access with that username, they could just create a new account. But spammers rarely keep coming back in person (and we haven't seen scripting spammers yet - for which we'd need a quite different approach). Our previous signed-up spammers were discouraged soon enough when they found their "contributions" were removed as fast as they were created. We've had two so far (before this one) if I remember correctly, and both gave up after the second attempt.
I think locking down or removing their user page will be effective enough to chase away these primitive spammers.
(Also, letting default ACLs grow into a long list of banned spammers will likely become quite inefficient.)

We do need to think about how to defend against scripting spammers though. Content blocking (see above) should certainly help - and be preventive rather than reactive.
Comment by JavaWoman
2005-05-28 07:00:32
Gosh, the stupid F*@k did come back - another locked page now - but I tried a Google search for [SomaCarisoprodol wiki].... hmm....

Doing the latter I stumbled over an apparent content filter system in a wiki; I'll be adding links to such above.
Comment by c-24-21-81-115.hsd1.or.comcast.net
2005-09-22 07:33:53
I'm a little confused by the above. If I just wanted one, seeminly mosy useful anti-spam technique for wikka, what would that be?

- timm
Comment by DarTar
2005-09-22 07:40:04
The most straigthforward measure consists in modifying the default ACL. You may also want to check http://wikka.jsnx.com/SecurityModules.
Comment by SimonFinch
2007-01-03 09:08:50
Implementing bad behaviour - http://wikkawiki.org/BadBehavior - is by far the most effective way of combatting comment spam - it takes about 5 minutes to install and setup. (Be sure to use version 1.2.4 of BB, rather then the most recent.)
Comment by WyRd
2007-10-11 13:44:41
I rather like the use of captcha's for user validation. What I don't like about them, though, is that those little images that seem to make up most of them are damned difficult to read. And my eyes are very good at discerning color and contrast.

From my perspective, the point of a captcha is to create a situation that has a challenge where a human must answer the question. Usually, "What letter/numbers are contained in that freaking .gif?!" The data field, one may say, is very complex while the question is simple. This seems to be the method of every captcha authorization I've seen. But what about the other way around? Make the data field simple but the question complex. Let's say there's the following "data" block given for authorization:

sdahjksd ahkjhjksdah ahjkhsjkah 8y789hajsh ajhhja g a789789 sa hgjhjkhas aghjkghjkh asajk ahjkh sajkhk

And then the user is presented with two questions:

Please enter the first block of letters where the first letter is repeated in another block.
Please enter the first block with less than four characters.

Dozens, if not hundreds, of such questions can be generated. Just reversing a few terms can grow the questions as well. (The last block, the second block, the block with more than four characters, etc). The server knows which blocks are correct and adds filler blocks that don't meet the criteria.

The point of the questions is to keep the logic simple, but to also keep the questions from directly referring to something in the data to be parsed. Thus maximizing the human interaction. Further, it should be possible to localize the questions to other languages. The blocks to be parsed can remain text, thus keeping the server resources low. To make things even more cryptic, the blocks can be displayed in different fonts (which gives the option of asking questions about the fonts being displayed...). (Of course, that requires a browser that can display the fonts.) Larger blocks and smaller blocks can also be used, including multiple lines, etc.

A good idea... or a new level of insanity? :)
Comment by DarTar
2007-10-12 08:03:19
WyRd, thanks for your suggestion - this (simple logical or linguistic tests that a human can pass but a computer in principle cannot) is precisely what we are considering as a possible antispam measure. This will be included, among others, in the forthcoming antispam release (1.1.6.4).
Comment by WillyPs
2007-10-15 22:57:03
I think anyone who is 'English as a second language' or otherwise not well educated in english or logic may have trouble with that.
Comment by WyRd
2007-10-16 00:30:01
Which is where localization comes in. :)
Valid XHTML :: Valid CSS: :: Powered by WikkaWiki