Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

Protecting Your Users From Spam Crawlers

By theboz in Internet
Wed Dec 17, 2003 at 04:15:12 PM EST
Tags: Security (all tags)

According to A study by the CDT, a large amount of the spam that people get is caused by placing their email address on the web. People who are aware of this take precaution when displaying their email address publicly. Everyone knows that if they are going to post it, they should include words like NOSPAM or switch the @ and . symbols. However, as crawlers get more sophisticated, such simple defences can easily be decoded. There are, however, other ways that you can protect your email, particularly on your own site. This is a brief overview of some techniques to block bad crawlers and protect email addresses from spammers.

Protecting The Address
The first step that most people take are to find ways to protect the email address. As mentioned above, a simple addition to your email address such as NOSPAM can help, but it isn't foolproof and unnecessarily forces the reader to interpret the address. This is fine for your friends, but your grandmother may not realize that she has to remove the NOSPAM from your email address. This is the most basic and least desireable method of protecting your email address.

A more sophisticated method is to use JavaScript to encode your email address in such a way that a crawler that doesn't run JavaScript can't parse your real email address. Some people have disabled JavaScript or use a cell phone or a PDA that may not run JavaScript, making this method problematic. Furthermore, there are most likely crawlers out there that can process JavaScript.

Crawlers can be developed in such a way that they can decode the patterns people use to encode their emails. Any technology available to end users can be automated as well. It is possible to build an email harvester with VBA, Internet Explorer, and Outlook that dumps the list of email addresses on a site directly to your Outlook Address Book. The small amount of effort required to harvest email addresses means that any method of disguising your email address can eventually be broken.

This means that the best way to hide your email is to simply not expose it at all. Many different scripts exist to allow users to send you an email via an html form. There are various problems with these scripts since they allow anonymous emails. It is possible to further secure these scripts with methods such as only allowing it to send once per IP address for a certain period of time, capturing their IP address, and filtering out HTML.

In addition to your own web pages, WHOIS databases provide abundant valid email addresses for spammers. You should consider using a throwaway address or one of the companies that will hold your domain for you in order to protect yourself.

Fighting Web Crawlers
While there are many good crawlers such as Googlebot that provide a service to you by placing your site on a search engine, there are many more out there that are bad. The most common techniques of fighting off the bad crawlers is to hide lists of fake email addresses on a page to poison the spam email list. This doesn't always work because more intelligent crawlers are able to do DNS lookups to make sure the email's domain exists. One problem with a list of fake email addresses that uses a valid domain is that people who may have those accounts could be bombarded with bounced emails and angry responses.

One thing that you can do to make such a honeypot more efficient is to capture the domain of the crawler on the fly. This way, it is more likely that when they visit your site and take email addresses, the spammers will capture fake emails on their own ISP and anger the system administrator there. Something slightly malicious that one can do is to perform a WHOIS lookup and give the email addresses listed there for their ISP and add those emails to the honeypot. Adding commonly used email addresses like "abuse", "root", "administrator" at their domain would also work. It isn't really a good idea to do this because you would be punishing innocent employees of that ISP. An alternative would be to send an email to these people warning them that a spam harvester is using their network.

Even more effective than a honeypot would be to block the bad crawlers altogether. If you use htaccess files on Apache with mod_rewrite you can easily block the crawlers that you don't want to have access to your site. You can block them by their useragent string, IP address, hostname, and with wildcards. Most people do this by going through their logs by hand and manually adding .htaccess entries. This can be done programmatically though, in such a way that it should be able to stop most of the bad crawlers that hit your site.

The first step is to set up a robots.txt file in your main directory. Many bad crawlers either ignore robots.txt or even use it as a guide of where to look. You can use this to your advantage and put an entry for a script in it that will be both your honeypot and your means of blocking them.

The script should generate a fake list of email addresses in order to poison their list, then add an entry into your .htaccess file to block them from your site. You should have this script referenced in both your robots.txt (to be excluded) and as a link on your the first page that loads on your site. The link on your site should be set up in such a way that an end user won't click on it, but a crawler that simply parses the data will not be able to tell that it's unclickable. This can easily be done with CSS by having a portion of your page (preferrably towards the top) in a DIV that is covered up by something else. This prevents a real user clicking the link that would cause you to accidentally block the wrong person. Another amusing thing that can work with the script will be setting up a custom 403 error in your .htaccess file. You can redirect them to an entirely different server, or you can stick with the default page.

The details of the script can vary depending on what you want to do. However, the basics are to bring attention to them on their ISP, and to block them from your site. The following is a PHP script which can be used to generate a list of fake emails showing exactly who the spammer was, plus block their IP address in your .htaccess file.

<?php <br> // Variables that need set.
$htfile = '.htaccess'; // Put the path to the .htaccess file
$currtime = date("F j, Y, g:i:s a"); // The current time for the log
$spamtime = date("n-d-Y_H-i-s"); // The time for the fake email address

// This section will ban the IP address in the .htaccess file
$htofile = fopen($htfile, 'a');
$logvar = "\n#Timestamp: " . $currtime . "\tUser Agent: " . $HTTP_USER_AGENT . "\tIP Address: " . $REMOTE_ADDR . "\tRemote Host: " . $REMOTE_HOST . "\n";
$blockentry = "RewriteCond %{REMOTE_ADDR} " . $REMOTE_ADDR . "\n" . "RewriteRule .* - [F,L]\n";
fputs($htofile, $logvar . $blockentry);

// This will get the TLD of the domain to recreate the domain name
$tld = strrev(substr(strrev($REMOTE_HOST),0,strpos(strrev($REMOTE_HOST),".")));
if (strlen($tld) == 3) { $domain = strrev(substr(strrev($REMOTE_HOST),0,strpos(strrev($REMOTE_HOST),".",4)));
} elseif (strlen($tld) == 2) { $domain = strrev(substr(strrev($REMOTE_HOST),0,strpos(strrev($REMOTE_HOST),".",3)));
} else {
$domain = $REMOTE_ADDR;

// Create and list 1000 fake email addresses
for($i=0;$i<1000;$i++){<br> $email="";
for($j = 0;$j<10;$j++){<br> $email=$email . chr(rand(97,122));
$email = $email . ".TIME" . $spamtime . ".IP" . $REMOTE_ADDR . "@";
$email = $email . $domain;
echo("\n<a href="mailto:$email">$email</a><br>");


Some of the improvements that will be made will be to add DNS lookups rather than the kludgy parsing of the hostname, and possibly alerting the abuse address at their ISP the moment they start gathering email addresses. You should also consider that obscurity is the key, so if you save this as maillist.php, you may want to rename it after a while to send_mail.php in order to keep the people that develop the crawlers guessing. Although not perfect, this script should be a good start for you to use when blocking spam.


Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure


Related Links
o Google
o A study by the CDT
o JavaScript to encode your email address
o send you an email via an html form
o Googlebot
o htaccess
o mod_rewrit e
o robots.txt
o Also by theboz

Display: Sort:
Protecting Your Users From Spam Crawlers | 82 comments (57 topical, 25 editorial, 1 hidden)
Also, you could keep your email address in a GIF. (2.85 / 7) (#2)
by MyDyingDuck on Tue Dec 16, 2003 at 12:16:52 PM EST

They're seeding the clouds today.
Watch nothing's gonna go your way.

Does this still work ? (none / 2) (#11)
by bugmaster on Tue Dec 16, 2003 at 01:48:16 PM EST

That's what I'd do (if I felt the need to publish my address). However, the annoyance factor of not being able to click the link, or at least copy/paste the address, is large enough to deter some users. Furthermore, I am not convinced that there isn't a spambot out there that simply OCRs every GIF it sees, just in case.
[ Parent ]
Someone else mentioned it (none / 2) (#14)
by theboz on Tue Dec 16, 2003 at 03:25:25 PM EST

This comment mentions a site that has a few different ways to distort text so that it won't be OCRable. Of course, some of it doesn't appear to be too human readable either.

[ Parent ]

you're fucking joking (none / 2) (#16)
by gjp on Tue Dec 16, 2003 at 03:37:59 PM EST

do you realize how long it would take to OCR every gif on the internet?

this article is pointless anyway. most spambots are fooled by "at" instead of @.

Proof that Google is EVIL!.
[ Parent ]

No kidding (3.00 / 4) (#25)
by bugmaster on Tue Dec 16, 2003 at 05:47:47 PM EST

do you realize how long it would take to OCR every gif on the internet
Yes, but so what ? I think you underestimate the spammers' tenacity; plus, the smarter ones can surely come up with several heuristics to simplify the search; I can think of a couple just off the top of my head.
most spambots are fooled by "at" instead of @
The keyword you're looking for is "...for now". Spammers and anti-spammers are in an arms race, and Moore's law is on the side of the spammers, at least initially. The recent crop of MT comment-based spam is a good example: regular spam lost its effectiveness, so the scum figured out how to gimmick the Google PageRank. This is already a pretty sophisticated attack; open-relay scanning bots are another. Regexping "at" to @ is cake compared to that.
[ Parent ]
Regardless... (none / 1) (#40)
by skim123 on Wed Dec 17, 2003 at 12:25:51 AM EST

Spammers have programs that will guess email address - pick common words, names, and combinations thereof, and add things like @hotmail.com, @yahoo.com, @aol.com, etc.

Spam sucks. The only way I've found to keep it at bay is to use a challenge/response anti-spam system. Granted, this might piss off some folks who don't want to have to go to the trouble of registering with my system, but it has reduced my daily spam from 100+ to about 5.

Money is in some respects like fire; it is a very excellent servant but a terrible master.
PT Barnum

[ Parent ]
My Solution (none / 1) (#42)
by bugmaster on Wed Dec 17, 2003 at 05:48:42 AM EST

I use a passive spam filter -- PopFile. It took a while to train, but now it stops 99.99% of all the spam I get, with no false positives. I was thinking of challenge/response, but some of my friends (and users) are not tech-savvy; such a system would only confuse them. Sad but true.
[ Parent ]
get a fucking grip (none / 2) (#44)
by reklaw on Wed Dec 17, 2003 at 09:47:39 AM EST

Are you spam-paranoid or what? Yeah, they're going to OCR images and spend hours writing difficult scripts just to get your email address, because you're so important!

A clue: spammers aren't looking for people who obfusticate their addresses, because such people are likely to be troublemakers and aren't going to buy anything. They're looking for Joe AOL/Hotmail/Yahoo User with an email address in plain sight who's going to send off for some penis enlargement pills or whatever.

The spammers aren't out to get you.
[ Parent ]

Paranoia saves lives (none / 1) (#51)
by bugmaster on Wed Dec 17, 2003 at 12:34:36 PM EST

No, the spammers are not out to get me personally. They are out to get everyone. If everyone starts obfuscating emails, then spammers will write better de-obfuscators. Build a better mousetrap, someone inevitably builds a better mouse.
[ Parent ]
Moore's Law? (none / 1) (#46)
by tps12 on Wed Dec 17, 2003 at 11:02:01 AM EST


[ Parent ]
Am I missing something? (none / 2) (#33)
by readpunk on Tue Dec 16, 2003 at 08:00:47 PM EST

"However, the annoyance factor of not being able to click the link," this is confusing to me.

<a href="mailto:hipster@indierockcentral.com" target="_blank"><img src="http://indierockcentral.com/pictures/EMAILADDRESS.GIF" border="0"></a>

Are you saying the html I just illustrated is impossible?

[ Parent ]

Just a Little (none / 2) (#35)
by AndrewW on Tue Dec 16, 2003 at 08:30:33 PM EST

The HTML code you provided is definitely possible, but I don't think you got the entire point. If you're going to the length of encoding your e-mail address as an image so spammers can't harvest it, its a waste of time if you store it directly in the HTML code with mailto:your-address. Spammers will pick that up and won't even bother with the image.

[ Parent ]
Yep. (none / 1) (#37)
by readpunk on Tue Dec 16, 2003 at 10:21:34 PM EST

You are right.

[ Parent ]
I hate to say this... (none / 0) (#71)
by curunir on Thu Dec 18, 2003 at 02:39:57 PM EST

...since I hate flash, but it could be used to display the email and include the mailto link. How many email harvesters are going to go through the trouble of retrieving all flash animations and fully parsing them for addresses?

[ Parent ]
Section 508 (none / 0) (#77)
by pin0cchio on Sat Dec 20, 2003 at 12:31:45 PM EST

If your web site is required by law to comply with Section 508 of the Rehabilitation Act or corresponding accessibility legislation in other countries, placing the address in an image won't work. That's why I use a feedback form on my personal site. Even though I don't do business with the U.S. government (excluding me from Section 508 for now), I still try to practice so that if I do get a paying job in web development, I'll be well practiced in accessibility.

[ Parent ]
A simple enhancement if the javascript trick. (3.00 / 5) (#5)
by i on Tue Dec 16, 2003 at 12:46:11 PM EST

Place something like this in a separate frame:


function replaceme(lhs, rhs)
  document.write('\<A HREF="mailto:');
  document.write(lhs + "@" + rhs);
  document.write('"\>mail me\</A\>');


<a href="haha" onmouseover='replaceme("me", "example.com")'>mail me</a>

A sophisticated harvester might execute initial javascript, but it won't execute event handlers!

(Javascript gurus: how to avoid the separate frame requirement?)

and we have a contradicton according to our assumptions and the factor theorem

off the top of my head... (none / 1) (#38)
by sal5ero on Tue Dec 16, 2003 at 11:23:55 PM EST

something like:

<script language="JavaScript">
function replaceme(obj, lhs, rhs)
  obj.outerHTML = '<a href="mailto:' + lhs + '@' + rhs + '">mail me</a>';

<a href="haha" onmouseover='replaceme(this, "me", "example.com")'>mail me</a>

not sure how compatible this would be across browsers, though...

[ Parent ]

OK, here's a patched version... (none / 2) (#58)
by gusnz on Wed Dec 17, 2003 at 06:38:05 PM EST

I have to say that's a good idea. I've had a stab at rewriting it, mine's similar to sal5ero's version but using 'setAttribute()' instead of 'outerHTML' as that's a mostly-IE-only property. So try this:


<script type="text/javascript">

function replaceme(obj, user, domain)
 var addr = user + '@' + domain;
 if (obj.setAttribute) obj.setAttribute('href', 'mailto:' + addr);
 else prompt('My email address is:', addr);


<a href="#" onmouseover="replaceme(this, 'user', 'domain.com')"
 onfocus="replaceme(this, 'user', 'domain.com')">Email me</a>
<noscript>You can email me using the account 'user' at this domain.</noscript>


This should be quite cross-browser compatible and accessible (insert large rant about web standards here :). Newer (v5+) browsers that support the DOM should just transparently replace the link target when the link is hovered/focused and the user won't notice anything (you can just use onfocus if you want instead of both event handlers).

Older browsers will display a popup box with the email address as selectable text in it. And anyone with a JS-disabled browser setup will get a nice mixed-up address to parse the old fashioned way.

(Feel free to use and redistribute this, I'm putting the script in the public domain of course. Just thought I'd better specify that.)

[ JavaScript / DHTML menu, popup tooltip, scrollbar scripts... ]

[ Parent ]

I simply have an email in France (none / 1) (#17)
by xutopia on Tue Dec 16, 2003 at 03:59:04 PM EST

where laws were passed to make it illegal to spam people! :)

hotmail is in france? [nt] (none / 0) (#18)
by the77x42 on Tue Dec 16, 2003 at 04:00:47 PM EST

"We're not here to educate. We're here to point and laugh." - creature
"You have some pretty stupid ideas." - indubitable ‮

[ Parent ]
The U.S. too (none / 0) (#22)
by theboz on Tue Dec 16, 2003 at 04:24:43 PM EST

Bush just signed a bill today that is supposed to stop spam, although it will never work and it weakens preexisting spam legislation many states have. The problem is that if your email address is picked up by a spammer in Brazil or Russia, there's not much you can do.

[ Parent ]

You wish! (none / 0) (#64)
by Entendre Entendre on Wed Dec 17, 2003 at 11:02:39 PM EST

The new law isn't supposed to stop spam, it's supposed to regulate it... and the loopholes are enormous.

You'll still get boatloads of spam every day. But now it will be 'clearly and conspicuously identified' as spam - though identified in a way that anti-spam software can actually use.

Our lawmakers are idiots.

Reduce firearm violence: aim carefully.
[ Parent ]

If spam bugs you, this is a good use of your time. (1.00 / 31) (#26)
by elenchos on Tue Dec 16, 2003 at 05:56:55 PM EST

My first reaction is amazement that anyone really cares this much about spam that they would put this much time and effort into doing something about it. When you think about the hundreds of teeny tiny little things in life that irritate us in minor ways, how could one or two unexpected emails rate this kind of obsessive attention? But who am I to judge that?

What does demand my interest is the ongoing movement by a few anti-spam obsessive-compulsives to pass legislation to attack this so-called "problem". The damage to our economy through the lost business to these vilified "spammers" is disturbing enough, but the permanent damage to free speech for all of us is truly frightening. Anti-spam nuts don't care, but if you're a normal person, doesn't it worry you that President Bush, a known enemy of free speech as we know it, is so pleased to sign into law new limits on free speech in the guise of "Canning Spam"?

Talk about catering to a special interest! If the President wanted to help the anti-spam compulsives, he would fund therapy to help them let go of this fixation. But if inflating the importance of this little distraction can justify limiting the free exchange of ideas, well, then the motives behind it become clear.

But to return to the topic at hand, I have to applaud "the Boz" for redirecting the considerable nervous energy of the anti-spam crowd. When you think about all the many ways you could try to programmatically protect email addresses from these mythic "spambot" beasts, you realize that the pursuit of this quest could absorb all of a sufficiently-motivated person's free time. And the more time they spend tilting at that windmill, the less they can devote rewriting the rules that govern the lives of ordinary folks like you and me.

A psychiatrist could really have a field day with the mytho-sexual complex of fears and needs represented by these shadowy "spammers" and their polymorphic "spambots". Like the Red Menace or the malevolent demons of the Dark Ages, these ghostly creatures can, in the imagination, take on any characteristic required to justify the most elaborate precautions.

Are they susceptible to garlic? Well, no one knows, but perhaps some day a spam chimera will appear that is, so we must devise a garlic defense against it now. Do spammer witches have weapons of mass destruction (WMD)? No, but the could! Given enough time, they just might. Better do something preemptively them now!

See, we aren't just talking about the fight against what exists, but everything a totally fixated person envisions might exist someday. Future historians will compare the Spam Panic of 2003 with the Salem Witch Trials, and like past hysterical overreactions, the silly laws we have burdened ourselves with will eventually be sheepishly repealed.

In the mean time, if you are bothered that much by spam, please take "The Boz's" advice, and direct your wrath into thinking up more and more diabolical "spambot" tricks-that-could-exist and appropriately labor-intensive responses.

It will make you feel better about your spam obsession, and reduce the thousands of hours of lost productivity by normal people, wasted placating those who worry about spam above all else.


Artless troll. (none / 3) (#27)
by caek on Tue Dec 16, 2003 at 06:04:29 PM EST

[ Parent ]
Freedom of speach and spam... (none / 3) (#29)
by Saad on Tue Dec 16, 2003 at 07:00:02 PM EST

are totaly differned things.In the days of the
founding fathers, the freedom of speach was a freedom to speak your mind, to belive what you wanted, and not be prosecuted for what you belive is working for YOU.

So freedom as far as your hands can reach, but until it interfeers with freedom of other people.

Spam has nothing to do with freedom of speach.

Its rather a freedom for any kind of aggresive marketing. I wonder if you would be so willing to treat it as freedom of speach with a cell phone, and normal phone, calling all the time, on Sunday evening, when you just want to have good time with your wife, while awaiting a call from a friend about their new baby. You get 16 calls about the new vacume cleaner, and morgage options, and the 17 th call is from a friend , he has boy. This is insane, and has nothing to do with freedom of speach.

Firstly because spamers do not speak their minds but speak thier wallets.

Secondarily because fredom of speach does not mean I can come up by your house and shout all night that my penis is bigger then yours .

In many countries it is not allowed to lie (in comercials), does that imply there is no freedom of speach? Of course not.
[ Parent ]

you trust the gov. to much (none / 0) (#66)
by auraslip on Thu Dec 18, 2003 at 02:39:20 AM EST

[ Parent ]
Sigh (2.00 / 9) (#34)
by driptray on Tue Dec 16, 2003 at 08:20:16 PM EST

This style of trolling seems so old-fashioned now. It's sadly out of place, a relic from a better time. elenchos, you're gonna have to update your act, or at least take it some place where it's still understood.

I hope I'm wrong.
We brought the disasters. The alcohol. We committed the murders. - Paul Keating
[ Parent ]

Acronym (none / 1) (#41)
by kitten on Wed Dec 17, 2003 at 12:49:12 AM EST

Do spammer witches have weapons of mass destruction (WMD)?

This is a phenomenon seen primarily in government communiques - a phrase, followed by an acronym, which is never referred to again within the same document.

If you intended to re-use the acronym later in the text, it would make sense to attach it to the phrase, but since you use it only that one time, it is rather silly and pointless (SAP).
mirrorshades radio - darkwave, synthpop, industrial, futurepop.
[ Parent ]
heh (none / 1) (#47)
by tps12 on Wed Dec 17, 2003 at 11:11:43 AM EST

Yeah, I read an [otherwise very interesting] essay last week that not only followed "Take Our Daughters to Work Day (TODTWD)" with the full phrase in the next sentence, but then seemed to either spell it out or use the abbreviation at random for the rest of the piece. Very distracting. There should be a way of telling one's document editor when you've used an "abbreviable" term, and it should be smart enough to handle each repitition appropriately.

[ Parent ]
i have another myth for you (none / 1) (#60)
by zzzeek on Wed Dec 17, 2003 at 08:46:23 PM EST

its called a "web server logfile". if you take an intelligent look at yours you might find quite a bit of these "mythical" crawlers violating your robots.txt rules.

[ Parent ]
The fearful result being what? A spam? (none / 1) (#61)
by elenchos on Wed Dec 17, 2003 at 09:21:18 PM EST

Having an unwanted email show up a couple times a year doesn't annoy me any more than someone taking up two parking spaces or leaving a cigarette butt on the sidewalk. Like nearly everyone else, I get over it and move on.

This is probably why none of my server maintenance staff have ever come running up the basement stairs shouting that there are "crawlers violating your robots.txt rules" and asking me to alert the FBI. What difference to these little things make to grown ups?

But the cool thing is that -- unlike people who park their cars poorly -- you can do something about these spammers that aggravate you so. You can go down into your basement and twitter away at little programs that swat these pests. I suppose defeating them has a certain pathetic satisfaction, like getting one of those bug zappers for your porch.

Fun, I guess. Knock yourself out, ok? Go fight evil, and whatnot. But the real world will go on turning happily, hardly being aware of this spam menace until someone comes along to screech about it.

[ Parent ]

which internet are you using ?? (none / 0) (#72)
by zzzeek on Thu Dec 18, 2003 at 02:53:17 PM EST

i had an all new address on an all new site for about a week, that address, which i dont use otherwise, gets 10-20 spams a day, strictly due to that one week exposure. older addresses which i had sitting on sites for months i had to retire, as they would be receiving upwards of 100-200 spams a day. im not really sure which internet you are using that you only get one or two spams a couple of times a year, with an email address that youve had posted on a public webpage. i have a few other addresses that have never been posted on any webpages ever, nor used in any online forms of any kind, and even they are somehow getting a few spams every day. maybe you have an extremely hardworking spam-blocking effort happening at your ISP without your knowledge or something, but most people certainly dont.

[ Parent ]
No one will take you seriously now. (none / 0) (#73)
by elenchos on Thu Dec 18, 2003 at 04:56:58 PM EST

This is like all the wild claims about the beauty and perfection of Lunix, the unassailable security of Apache, or the stories of unprecedented disaster that would accompany the Year 2000 (Y2K).

It works at first -- you can toss around the most absurd exaggerations and many people will just trust you. But sooner or later people are going to ask "have I or anyone I know ever had 10-20 spams a day!?" I mean, you don't think the average American is going to go on believing that forever without ever seeing it himself?

But then you go on to claim "100-200 spams a day"! Just imagine how much network traffic that would represent!

This is obviously some kind of joke, and in case you don't already know, it only hurts your cause in the long run. Whatever genuine reason for complaint that you once had is destroyed once everyone thinks of you as a liar.

[ Parent ]

100 spams per day? Luxury! (none / 0) (#75)
by grinder on Fri Dec 19, 2003 at 09:01:12 AM EST

But then you go on to claim "100-200 spams a day"! Just imagine how much network traffic that would represent!

Yes, imagine that!

I just checked my mail server logs. Since the first of December, it has rejected as spam 9267 messages addressed to me. That's over 400 per day!

Imagine what would happen to my bandwidth if I let that in? Actually no, don't bother. 400 messages at an average of 4k per message come to less than 2Mb. That would saturate my pipe for, oh, about 15 seconds per day.

But 400 spams in my inbox would really ruin my day.

[ Parent ]
-1 spam (1.00 / 16) (#28)
by Ronald Reagan3 on Tue Dec 16, 2003 at 06:13:33 PM EST

Find Spam Hosts (2.00 / 4) (#31)
by cronian on Tue Dec 16, 2003 at 07:20:39 PM EST

Can't you just get some email address, and post it all over the internet. Then, when the spam-bots find it they will send email to that address, and you can have it record where the SPAM came from, and automatically block those servers.

We perfect it; Congress kills it; They make it; We Import it; It must be anti-Americanism
Robots != Spammers (none / 0) (#74)
by babazaroni on Fri Dec 19, 2003 at 01:53:36 AM EST

Robots collect the address. Spammers buy them. Their ip addresses not the same or related.

[ Parent ]
At.. (none / 1) (#49)
by yicky yacky on Wed Dec 17, 2003 at 11:59:18 AM EST

..theboz's request:

It's worth pointing out that your script will only run as printed if php is running with 'register globals' enabled.

Many (most?) ISP's and people who administer their own servers run php under the so-called 'safe mode', which automatically turns 'register globals' off for security reasons.

The variables concerned are still accessible, being string-indexed by the same string name under the $_SERVER 'superglobal' array (from php 4 onwards). So:


Additionally, the use of a 'choke threshold', or daily count, is a trivial and easy way of preventing 'contact form abuse' when used on a server / web page where the expected contact traffic will be fairly low. This can save monkeying about with the can-of-worms that can result from restricting IP-address-based usage of the contact form, but allows you to cross-check the daily logs to see if there's been any abuse and then act accordingly.

The downside is that, in addition to the normal (and, let's face it, fucking tedious) form validation, you have to catch a count overflow and provide a polite, and potentially annoying "Sorry..." page instead of the form if the count has been exceeded. Also, it's polite to regurgitate genuine user's text back to them in the highly improbably instance when they were actually writing at the moment the threshold was exceeded, to be used later if needed. You also have to code a count-resetting mechanism. Not hard, but tedious the first time you build it.

On one of our servers we use such a system, and it's proven to be less hassle, and more convenient to administer, than many other 'superior' methods.

Good article, though.

yicky yacky
'The actual reasonable Britons are correct, you're being a cock.' - Hide The Hamster.
Couple of nice ones I've seem (none / 2) (#53)
by arvindn on Wed Dec 17, 2003 at 01:05:06 PM EST

* Display your email as an image (someone already mentioned this)

* "NAME@yahoo.com where NAME is steve234" (even grandmother can get this).

So you think your vocabulary's good?

I haven't found harvesting a problem (none / 1) (#55)
by TheophileEscargot on Wed Dec 17, 2003 at 02:48:54 PM EST

I've had the same unobscured e-mail address on all my K5 comments for about two years now. It's slowly increased to about one spam per day, most of which Yahoo filters to the Bulk Mail folder.

That may be an overestimate, since I use the same account to register for most websites.

Ironically I get about three or four per day to my "real" account in spite of trying to keep it private. That may be because of things being forwarded, or possibly because I have a fairly common real name.

I wouldn't put my real address in there though, in case someone decides to crapflood it. This way I can always throw away the account.

I suspect the biggest cause is just spammers putting likely-looking combinations of names into domains, and keeping the ones that don't bounce.
Support the nascent Mad Open Science movement... when we talk about "hundreds of eyeballs," we really mean it. Lagged2Death

The problems I've run into (none / 1) (#56)
by theboz on Wed Dec 17, 2003 at 04:53:29 PM EST

I put my "real" email address on my site, but only on my resume to make it easier for recruiters. However, I was getting quite a bit of spam before that, mostly due to the fact that I've used that address on all sorts of sites.

One thing that I've noticed is that job websites seem to be a newfound haven for email harvesting. A spammer only needs to pay a small fee to sign up as an employer and they can get the email addresses of all the people that are looking for jobs. I mostly discovered that because I was getting job search related spam, and eventually it went to the more traditional means. I've since used unique identifiers and my own domain name to try and find out which sites cause me to get the most spam, but so far I haven't had anything really coming from it. For example, on my hotjobs account I would have hotjobs@xxxxxx.xxx. That should also make it easier to block stuff later on if one address gets too much spam.

[ Parent ]

That might well be it (none / 1) (#57)
by TheophileEscargot on Wed Dec 17, 2003 at 04:58:09 PM EST

I tend to give employers my real address while job-hunting.
Support the nascent Mad Open Science movement... when we talk about "hundreds of eyeballs," we really mean it. Lagged2Death
[ Parent ]
job hunt spam (none / 1) (#70)
by protogeek on Thu Dec 18, 2003 at 01:51:34 PM EST

Similar experience here. Before my last round of resume-posting, I hardly ever got spam at my main address. Shortly after putting the resume out, I started getting an ever-growing stack of sales newsletters (hint: I'm not in sales) and the like. Now it's degenerated to the point that 80+% of my mailbox is full of offers to enlarge various body parts, some of which I haven't got.

I sort of miss that rare, early spam. For some reason most of it was in German, and I amused myself trying to remember my college language classes enough to decipher it.

[ Parent ]

My personal experience (none / 0) (#68)
by Cameleon on Thu Dec 18, 2003 at 08:39:52 AM EST

I've done a few experiments using sneakemail. If I posted an email address on k5, it started getting spam within a week. The same with some other high traffic websites. Posting it on usenet results in even more spam than posting it on the web, and it starts faster. Posting it on my (no traffic) personal website doesn't do anything. Giving it away when signing up for web accounts and such didn't yield any spam yet either, and strangely, using an unspammed email addres to unsubscribe from spam didn't do anything either. Putting 'NOSPAM' in the address stops all spam, except on usenet after a few months.

[ Parent ]
Interestingly enough (none / 0) (#59)
by Big Sexxy Joe on Wed Dec 17, 2003 at 07:36:46 PM EST

I find that many of the email addresses on this site are so scrambled and garbeled with NOSPAM and misspelled domain names like yahho that I can't actually figure out what that's persons actual email address is.

I'm like Jesus, only better.
Democracy Now! - your daily, uncensored, corporate-free grassroots news hour
That is by design. (none / 1) (#63)
by Entendre Entendre on Wed Dec 17, 2003 at 10:51:36 PM EST

It's not a bug, it's a feature.

Reduce firearm violence: aim carefully.
[ Parent ]

email via html form -> spam via html form (none / 2) (#62)
by Entendre Entendre on Wed Dec 17, 2003 at 10:50:44 PM EST

I starting using the html-form method on my web site long ago, and it works pretty well. But, I have reason to believe that these are going to be overrun by spammers soon.

I've recevied a half-dozen garbled messages via my web for in the last few months. There's little other than random crap in the messages. Googling a fragment of the crap indicates that the bot producing this spew is hitting web forms all over the world.

My guess is that this is testing the waters for future spamming. The return on the investment the spammers made in this bot will be a list of web sites that they can spam via HTML form submission.

Since email-via-html-form doesn't result in ads being placed on the web, an optimist might hope that such forms get ignored by spammers. But, spammers are by definition willing to work with low rates of return. I figure they'll soon be spamming every HTML form they can find.

Reduce firearm violence: aim carefully.

There's an easier way. (none / 0) (#65)
by tbc on Thu Dec 18, 2003 at 01:08:50 AM EST

Hasn't been a big a problem for me. I have a simpler way of dealing with it. See plussed users and spammer e-mail harvesting.

Or use sneakemail.

Sneakemail (none / 0) (#67)
by Cameleon on Thu Dec 18, 2003 at 08:32:50 AM EST

I've been using sneakemail for a while now, and it works like a charm. Most addresses don't get spam, and once one does, I just delete it.

And I just found out that my university (where I have my email address) supports those 'plussed' addresses. Thanks for that tip!

[ Parent ]

Anti-spambot script (none / 0) (#69)
by Loki The Younger on Thu Dec 18, 2003 at 12:15:20 PM EST

If you're going the javascript route, here's a nice perl script that will read an HTML file and automatically encode mailto links for you.  It's a real time saver if you're converting existing plain pages to spam-safe ones.

N.B.: I'm not the author of this script, just a satisfied user.

Poisoned Tarpit (none / 1) (#76)
by hengist on Fri Dec 19, 2003 at 09:25:25 PM EST

Here is an idea I've been kicking around in my idle moments; I'd appreciate some comments on it.

If, as has been repeated in the comments attached to this story, the spam bots ignore the robots.txt file, set up a little trap for them.

- Add a Disallow entry that specifies something really juicy-sounding, like private_email_list.html

- Embedded in that file is a script (I like PHP, but to each their own ;) ) The script randomly generates e-mail addresses pointing to popular domains (for added irony, randomly serve up the addresses of your favourite Nigerian spammers)

- script pauses random amount of time after spitting out each address

- loop ~million times

If an innocent browser somehow finds the page, they will just get sick of waiting and leave. If a bot accesses the page, it gets stuck there for a while, and gets a bunch of useless addresses in the process. So, it's a tar pit, and it poisons their data at the same time.

Extensions to this idea could include logging the IP address of the bot. Whenever this address tries to access any other page on your site, throw them in the tar pit. You could also do a reverse-DNS lookup on it, and automatically send e-mail to abuse,admin@their domain.
There can be no Pax Americana

thoughts about address harvesting (none / 0) (#78)
by izogi on Sun Dec 21, 2003 at 05:26:47 AM EST

I guess I'm a bit disturbed by the ongoing vigilante war between spammers and anti-spammmers, which for better or worse is a natural outcome of people being dis-satisfied with the way that authorities are handling spam. I myself hate getting spam, and I find it even more insulting that I'm forced to waste my time, bandwidth and money to receive spam. If there's a way to cut it off at the source that I think seems reasonable, then I'm all for it.

On the other hand, it seems that any measures like this are only going to work for a minority. The successful professional spammers have money and resources, and as soon as it's worth their while they'll circumvent things here, in one way or another.

It's just one hack after another, with each hack further degrading the underlying infrastructure. eg. The DIV hack that you suggest does away with the content and presenation separation that HTML and XHTML are supposed to be all about, and (again) would likely create more problems for everyone who's not a visual user of the web.

It's not that I think much can be done about it, although it's frustrating that the stupid spam/anti-spam war creates all these extra problems. It's even more of an insult to me that I feel like I should have to throw away more time and resources to continue fighting with them.

I've spent the better part of the last three or four years going to great lengths to avoid having my email address published. It still hasn't worked. The first spam sneaked through a couple of years ago, 18 months ago I could still count the spam my address had received on my fingers, but since then it's just gone up and up.

The irony is that the addresses that I have published on the web aren't the ones getting spammed the most, or at all. Possibly 50% of my spam goes to my raw ISP address that I never actively hand out, but which I collect my mail from. (Other mail is redirected to it.) I presume that this address was discovered using a dictionary attack or permutations appended to the ISP's domain. The rest of my spam goes to the addresses that I use day-to-day and have only really given out to friends.

The problem is that no matter how much you protect your address, it's likely to get out eventually anyway unless you take absolute draconian measures. If you don't give it away accidentally, someone else will.

You can read all of the terms and conditions you like about what companies will and won't do with your personal information. It's extremely difficult to require that every single other person do the same when they're giving out your personal information. Get mad at them if you like, but the damage is already done.

One of my addresses was published in an online README file (without my permission) after I contributed a bug-fix to an open source project. There was no malicious intent, the guy simply didn't realise that I cared about my address getting out.

As well as that, friends and family are always going to submit my address to those web-based greeting card companies, no matter what I tell them. Keeping email addresses private just isn't a concept that a lot of people comprehend, because for most people email addresses simply aren't that important.

Even if there was a new email protocol that was supposed to be spam-resistant, it wouldn't surprise me if advertisers managed to find some way to abuse it and start up the vigilante wars all over again. Spam isn't caused by technical or legal allowances, it's caused by having a feasible market. Once the market's there, business people will pull the strings and arrange for the money to go to the right people until it's possible for them to access that market.

In some strange way, the best way to get rid of spam is to get rid of the people, or otherwise make it so that there's no feasible market.

- izogi

Maybe an inside job? (none / 0) (#82)
by neomonkey on Wed Dec 31, 2003 at 06:02:10 AM EST

The irony is that the addresses that I have published on the web aren't the ones getting spammed the most, or at all. Possibly 50% of my spam goes to my raw ISP address that I never actively hand out, but which I collect my mail from. (Other mail is redirected to it.) I presume that this address was discovered using a dictionary attack or permutations appended to the ISP's domain. The rest of my spam goes to the addresses that I use day-to-day and have only really given out to friends.

I recently got a sub account email, and right away I started receiving spam on it, even without using it.  I'm thinking there's someone at qwest that's selling new email addresses to spammers, nothing else makes sense.

Meanwhile, another address I have had for several months has never received any spam.  It's a third party site based in England.  Meanwhile, my long standing address was spam free for two years, until I made the mistake of responding to an iPod giveaway come on - d'oh!
"Was man God's greatest mistake, or was God man's?" -Nietzsche
[ Parent ]

Graham on secret addresses (none / 0) (#79)
by kmself on Wed Dec 24, 2003 at 02:15:43 AM EST

The problem with this scheme is that it's a variant of the "secret address" concept. And it ultimately doesn't work (and fails badly when it fails).

Graham's comment:

Good: Easy.
Bad: Doesn't work.
Role: Facile recommendation for brief news articles.

Some recommend that you keep your address secret in order to avoid spam. But it's hard to keep your address secret, because other people have to know it to send you email. All it takes is one naive friend to enter your address in a web site to send you an electronic greeting card, and it's all over.

Even if no one discloses your address, spammers can still get it through dictionary attacks. In a dictionary attack, spammers try sending a test mail to millions of possible addresses. Any that don't bounce are probably valid. My mother gets spam as a result of a dictionary attack on AOL, even though she only sends email to a handful of people and never uses the Web.

In addition to "friends" posting your address to "greeting card" sites, or sharing them with other people, is ye olde virus / MS Outlook harvest: many spamming viruses will retreive address out of address books and either generate spam to them, or (conceivably) report them to a spam harvester.

My recommended solution: SMTP-time mail defenses, including filtering, teergrubing, and deny-at-SMTP time blocking of suspected spam. There are many working configurations, it fails gracefully (spammers generally don't care, your friends will at least know they didn't get through, and innocent third parties aren't joe-jobbed).

Enough of hiding from the bad guys. My own status? Public address, dialup access, strong filters, semi-automated reporting of spam. If I controlled my SMTP, I'd do what I preach, but my clueless ISP has useless filtering, and I've got what I've got.

Karsten M. Self
SCO -- backgrounder on Caldera/SCO vs IBM
Support the EFF!!
There is no K5 cabal.

Robots.txt (none / 0) (#80)
by hoybofe on Wed Dec 24, 2003 at 04:41:49 PM EST

http://www.webmasterworld.com/robots.txt has a long list of active robots you might want to block. Some of these (and many others) ignore robots.txt, and are forcibly blocked in .htaccess. Also, anyone take a peek at http://www.whitehouse.gov/robots.txt ? Interesting :)

Alternate methods (a.k.a blatant gratuitous plug) (none / 0) (#81)
by Mindblock on Thu Dec 25, 2003 at 05:51:43 PM EST

A couple of people have mentioned the fact that some spammers will discard an email address, considering it unreachable, if an email sent to that address bounces.

The email service provided by fastmail.fm can do that, if you get some spam, click 'bounce', and the email is returned to the sender as if it bounced. Only good for when spammers don't use throwaway email addresses tho :/

Secondly, subdomains... I use different subdomains for different sites, so that if I do get spam going to a specific address, I know where it came from (not much I can do with that knowledge though, but it satisfies curiousity), and I can just dump that email address and make up a new one...

fastmail.fm has plenty of other decent features, but I've done enough plugging for now, check it out if you're interested, ignore this post if you're not...

-Common sense isn't all that common-

My alternate method (none / 1) (#83)
by grahamsz on Fri Jan 02, 2004 at 11:34:11 PM EST

Well if they can spider me, then why cant I spider them?

You see i stumbled across Overture.com, did a search for bulk email, and there's a function there where you can see how much an advertiser will pay when you click on a sponsored link.

At the time - the #1 listing for bulk email was about $7 a click (it's under $4 now though).

So I got perl LWP out and wrote a script that performs a search for bulk email, and spiders a couple of pages off the first 20 links. By my reckoning that'd cost the bulk email industry around $100 a run, or since i ran it every 30 minutes from cron, almost $5k a day.

I routed the requests through my ISPs cache cluster so they'd show up from a few different IPs as well. Before moved out of a cable area i'd probably spidered the best part of a million dollars worth of links.

Naturally this takes money out of spam related companies and puts it into the slightly less despicable overture... but i felt like i was doing my bit.
Sell your digital photos - I've made enough to buy a taco today

Protecting Your Users From Spam Crawlers | 82 comments (57 topical, 25 editorial, 1 hidden)
Display: Sort:


All trademarks and copyrights on this page are owned by their respective companies. The Rest 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!