Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

Spam Filtering with gzip

By KWillets in Technology
Sun Jan 26, 2003 at 07:03:35 AM EST
Tags: Technology (all tags)

While many people see gzip as a compression tool, it also makes a credible spam filter. Here's how.

I was reading through a bioinformatics book the other day, and was reminded of a useful shortcut for comparing a text against various corpora. A number of researchers have simply fed DNA sequence data into the popular Ziv-Lempel compression algorithm, to see how much redundancy it contains.

Loosely speaking, the LZ (Zip) and the related gzip compression algorithms look for repeated strings within a text, and replace each repeat with a reference to the first occurrence. The compression ratio achieved therefore measures how many repeated fragments, words or phrases occur in the text.

A related technique allows us to measure how much a given, "test" text has in common with a corpus of possibly similar documents. If we concatenate the corpus and the test text, and gzip them together, the test text will get a better compression ratio if it has more fragments, words, or phrases in common with the corpus, and a worse ratio if it is dissimilar. Since the LZ algorithm scans the entire input for repetitions, it tends to map pieces of the test text to previous occurrences in the corpus, thereby achieving a high "appended compression ratio" if the test text is similar to what it's appended to.

In this case, we wish to compare an incoming email message against two possible corpora: spam and non-spam (ham). If we maintain archives of both, we can compare the appended compression ratios relative to each, to judge how similar a new message is to spam or ham.

As a simple test, I downloaded some sample spam and ham from the Spamassassin archive. I removed headers from the messages (to focus on message text only), and created spam and ham "training sets" 1-2 megabytes in size. I then tested spam and ham messages not in the training sets for for their compressed sizes when appended.

Compression was measured as follows:

$ cat spam.txt new-message-body.txt |gzip - |wc -c
$ cat ham.txt new-message-body.txt |gzip - |wc -c

The file sizes output were compared to the compressed sizes of spam.txt and ham.txt without new-message-body.txt appended, to see how many bytes were consumed by the new-message-body.

The results for "ham" messages were the most dramatic. The average compressed size of a ham message appended to spam was 38% higher than when appended to other ham. For spam messages, the same comparison yielded a compressed size 6% smaller when appended to spam vs. ham, so in both cases, compressing a message with others of its kind yielded a smaller file, on average.

Individual results were also quite clear: while some spam messages compressed slightly better when mixed with ham, ham messages still maintained a margin of 15% or more between the most spamlike ham, and the most hamlike spam. I would put the threshold somewhere around 110%; if a message's size when gzipped with spam is less than 110% of its size when compressed with ham, it's probably spam.

In conclusion, gzip is a fairly blunt instrument for spam detection, but the effectiveness of its relatively blind repetition-finding is worth noting. The current fad among spam filters is word-counting, with various statistical heuristics applied to the results. Algorithms like LZ and gzip go beyond word matching, finding entire phrases and paragraphs of repetition, but do not attempt to measure their statistical significance. More sophisticated approaches, which combine phrase matching with statistical analysis, may be more effective.


Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure


Related Links
o bioinforma tics book
o gzip
o Spamassass in archive
o Also by KWillets

Display: Sort:
Spam Filtering with gzip | 115 comments (102 topical, 13 editorial, 1 hidden)
Slashdot (4.00 / 7) (#1)
by TheOnlyCoolTim on Sat Jan 25, 2003 at 10:47:34 PM EST

Doesn't slashdot implement this?

"We are trapped in the belly of this horrible machine, and the machine is bleeding to death."

Haven't heard of it (nt) (none / 0) (#2)
by KWillets on Sat Jan 25, 2003 at 10:52:29 PM EST

[ Parent ]
Wait, that rings a bell (4.00 / 1) (#6)
by KWillets on Sat Jan 25, 2003 at 11:04:47 PM EST

They do some kind of repetition-finding within each comment, I think, to foil the crapflooders. Not sure if they do any sort of inter-message analysis.

[ Parent ]
Yes (5.00 / 4) (#35)
by damiam on Sun Jan 26, 2003 at 09:27:01 AM EST

They have a "postercomment compression filter" which gzips every comment posted. If it compresses too well (or too badly, I'm not sure), you get a lameness error.

[ Parent ]
Is that why I haven't been able to post any K code (none / 0) (#51)
by jjayson on Sun Jan 26, 2003 at 04:32:26 PM EST

Smile =)
* bt krav magas kitten THE FUCK UP
<bt> Eat Kung Jew, bitch.

[ Parent ]
I suspect it would try to unzip that (nt) (5.00 / 2) (#61)
by KWillets on Sun Jan 26, 2003 at 05:51:47 PM EST

[ Parent ]
Look at compressOK in Access.pm (5.00 / 1) (#84)
by Seth Finkelstein on Mon Jan 27, 2003 at 09:58:30 AM EST

Yes - take a look at (2.2.5) the "compressOK" function
It's quite elaborate, e.g

while ($content) {
# Slice off a hunk from the front and check it.
my $content_slice = substr($content, 0, $slice_size);
substr($content, 0, $slice_size) = "";

# too short to bother?
my $length = length($content_slice);
next if $length <10;<br>
# Runs of whitespace get increased in size for purposes of the
# compression check, since they are commonly used in ascii art
# (runs of anything will be compressed more).
$content_slice =~ s{(\s{$spacerun_min,})}
{" " x (length($1) ** $spacerun_exp)}ge;
# Whitespace _tags_ also get converted to runs of spaces.
$content_slice =~ s/<\/?(BR|P|DIV)>/$breaktag_space/gi;
# Whitespace entities get similar special treatment.
$content_slice =~ s/\&(nbsp|#160|#xa0);/$nbsp_space/gi;
# Other entities just get decoded before the compress check.
$content_slice = decode_entities($content_slice);

# The length we compare against for ratios is the length of the
# modified slice of the text.
$length = length($content_slice);
next if $length < 10;

-- Seth Finkelstein
[ Parent ]

yeah, but it's iffy. (none / 0) (#93)
by ethereal on Mon Jan 27, 2003 at 02:09:55 PM EST

Some ASCII art makes it through just fine, and some cut-and-pasted text (for example, legitimately quoting from an article on another site) doesn't make it through.


Stand up for your right to not believe: Americans United for Separation of Church and State
[ Parent ]

Question on performance (3.66 / 3) (#5)
by arvindn on Sat Jan 25, 2003 at 11:03:51 PM EST

I understand that this is a proof-of-concept type writeup, but still: Testing many messages would involve repeatedly compressing the several-megabyte ham and spam archives. Have you considered hacking gzip so that the work done on compressing the archives can be resued for each new message? Do you think this is possible/easy to do?

So you think your vocabulary's good?
According to the man page (5.00 / 1) (#7)
by KWillets on Sat Jan 25, 2003 at 11:14:34 PM EST

You know, I thought there was an option to append to a compressed archive, but the man page says it's better to cat the uncompressed files together and compress the result.  It would make sense to read the compressed file for repeat candidates, but there may be some decompression involved.
There are other data structures which make this more tractable :-).

[ Parent ]
I don't know the compression algorithm, but (3.00 / 2) (#10)
by jacob on Sat Jan 25, 2003 at 11:37:31 PM EST

I suspect you'd get much better performance if rather than running through the whole gzip each time you just found the intermediate representation for your test corpus and saved it, and then you could just use that as your starting point for compressing the concatenated text. In fact, since you know you're only interested in the resultant size, you might be able to get away with not even compressing the text.

Guess I'll have to dig up my old data compression book ...

"it's not rocket science" right right insofar as rocket science is boring


[ Parent ]

gzip append (none / 0) (#105)
by pyro9 on Tue Jan 28, 2003 at 12:04:59 PM EST

Appending to a gzip works because each gzip starts with a header and a reset code. Gzip is coded to be able to skip headers found in the middle of a gz file. The resulting file will be no smaller than the sum of the two seperate gzips. In fact, it is valid to just cat 2 .gz files together.

If you cat the uncompressed files and then compress, you don't take the hit of resetting the dictionary unnecessarily, so the final result may be smaller than the sum of the two seperate gzips.

What is being suggested here is to retain only the final dictionary from compressing the corpus (the actual compressed corpus goes to /dev/null) and then start with that, only compressing the mail to be checked.

The future isn't what it used to be
[ Parent ]
could be done (4.00 / 3) (#18)
by tichy on Sun Jan 26, 2003 at 12:21:43 AM EST

I hacked a LZW compressor once cause I was bored, but IANALZW so to speak. As I understand the algorithm, what you'd need is to both save the compressed ham/spam and the state of the dictionary. The compressor keeps a dictionary that matches strings in the file to strings of bits, which it builds as it compresses.

So, if you merely keep a pre-compressed version of the ham/spam, the compression done on the mails will depend only in the amount of redundancy of the mails themselves, not that of the mails + ham/spam, because gzip will have to build a new dictionary using an already compressed ham/spam and the mails. This will lead to skewed (less compressed) results for the mails.

At least, that's what I think; someone who knows LZW better correct me if I'm wrong.

[ Parent ]

Bayesian classification (4.57 / 7) (#9)
by Cluster on Sat Jan 25, 2003 at 11:35:06 PM EST

Now that Bayesian classification came to life in Mozilla's mail client a few months ago, I don't see any reason to implement such a relatively primitive method.  The Bayesian classification is not patented, since Mozilla is using it, and features tremendous accuracy and "intelligence".

worthwhile for its own sake (5.00 / 2) (#37)
by The Shrubber on Sun Jan 26, 2003 at 11:09:45 AM EST

Maybe not, but doing the research seems to be worthwhile for its own sake.  At the very least it provided the author a chance to be somewhat creative.  Would a Bayesian spam filter and a compression based one give similar results (because of something fundamentally similar in their souls, for example, their corpus-based natures)? If not, then it'd be interesting to compare them or run them side by side, eh?

I think we should credit the guy for giving a few moments of his time to 1) think things through 2) try them out, rather than saying "nah, don't need it"

[ Parent ]

Bayesian versus MDL (5.00 / 1) (#64)
by martingale on Sun Jan 26, 2003 at 07:28:49 PM EST

I expect that both Bayesian and MDL methods would give similar results, but basically incidentally. All statistical methods require two things (I'm simplifying):

1) a model chosen before we look at the data, which takes the form of a probability distribution or two, so that we can calculate the probability of obtaining any dataset.

2) a set of decision rules chosen before we look at the data, which can be applied to the combination of data and probabilities to select the model (if we have several) which best associates with the data.

MDL takes in principle the set of all possibly imaginable models to start with, and compares how well they compress the data. So it implicitly defines a utility function which is proportional to the achieved compression.

Bayesian methods do not specify the utility (it's up to you and you can take the MDL one if you like), but also needs a set of prior probabilities (ie if you haven't seen the data yet, what do you think is the best model going to be?)

None of those methods tell you what the exact form of the models you should use will take (in principle, MDL does but this is impossible to use in practice, so you just choose a model family ad-hoc), and it so happens that the form of the models is typically very strongly going to influence your results.

So, to make a short answer: people tend to pick the same models regardless of whether they use classical, Bayesian or MDL statistics to choose among them, which is why the results are often similar. Models are often picked for expediency/computability/tractability, which strictly speaking is ad-hoc.

[ Parent ]

But what does that *mean*? (5.00 / 2) (#45)
by pla on Sun Jan 26, 2003 at 02:43:14 PM EST

Now that Bayesian classification came to life in Mozilla's mail client a few months ago, I don't see any reason to implement such a relatively primitive method.

I've seen a lot of people going on about how wonderful this "new" idea of Bayesian spam identification seems.

I also noticed that it took *two weeks*, after the first so-called "Bayesian" spam filter appeared on Slashdot, for anyone to notice that it didn't really use any Bayesian techniques.

Although indeed a boon for certain reasons, I don't think most people have any idea what it means to say "my mail program uses a Bayesian spam filter". It refers to nothing more impressive than using posterior probibilities, rather than classic statistics (by which I mean any of the tried-and-true maximum entropy methods), to arrive at a similar result (classical and Bayesian techniques arrive at VERY similar answers in almost all cases, just by a different path).

And, the one situation where Bayesian analysis *does* give significantly better results (basically, where you already know the answer), doesn't really apply to spam for more than a week at a time (in which case, we may as well use a simple checksum of the message, discarding all obvious randomizing strings).

So, while I do like the word (Bayesian, Bayesian, Bayesian... Bay...Bay...Bayyyyyysian, fun to say) as much as anyone else, its use does *NOT* mean we should stop looking for better ways to get rid of spam.

Of course, when I read the title of this FP, I thought it would have a completely different idea - Basically, something along the lines of public-key crypto, except we don't actually need to bother with the "crypto" part of that. Each person sending us mail would query our account for a key, which it uses to modify the message in some way (a simple XOR, perhaps). It would then GZIP the message and send it. For normal email senders, this would not seem like a problem and would result in at worst a second or two delay in sending large attachments. For spammers, it would mean they'd need an enormous CPU farm just to handle the compression of each outgoing message (as well as at least a temporarily valid return path by which to request our "key".)

[ Parent ]
a few corrections (5.00 / 3) (#63)
by martingale on Sun Jan 26, 2003 at 07:14:39 PM EST

Sorry, I have to correct you on a couple of points ;-)

Maximum entropy is a quintessential Bayesian idea, going back to ET Jaynes, and whose roots can be traced back to Laplace in the 17th century even.

Here's a (very) quick summary of classical versus Bayesian.

In classical statistics, people come up with a model "explaining" a dataset, much like physicists look for models explaining phenomena. In classical statistics, the observed data is used to compare plausible generative models, by comparing the likelihood of the dataset under each model and selecting the model with the maximum likelihood.

In Bayesian statistics, people also come up with models somehow, but they need not "explain" the dataset, only be compatible with it (it's a philosophical, rather than practical distinction). Then, Bayes' theorem is used to convert a set of prior beliefs (written as probabilities) via the observed data into another set of posterior probabilities for each of the compared models. Unlike classical statistics, there is no "correct" model (like there might be "correct" laws of physics), so a choice is usually made to take the model whose probability is greatest. This is called the Maximum A Posteriori estimate, and happens to often be identical with the maximum likelihood estimate, unless say the problem is really tricky and the max likelihood est does not exist at all.

Maximum Entropy is an earlier step in the Bayesian framework, which is used to construct the prior beliefs, by choosing them in such a way as to maximize the amount of uncertainty inherent in those beliefs, consistent with any hard constraints imposed at the outset. It has no classical equivalent.

Oh yeah, let's get back on topic ;-) Most of the "Bayesian" spam filters you can download on sourceforge aren't at all, as you pointed out. Moreover, most of those filters aren't even statistical... People think that Bayesian means writing down a probabilistic model - not at all, Bayesian means using an existing model to calculate a decision by combining the data with prior belief probabilities, and then optionally performing a utility maximization. The latter is called decision theory, and goes back to Wald in the fifties, who was a classical statistician but got the shock of his life when he proved that the optimal decision must have the Bayesian mathematical form...

[ Parent ]

Many thanks... (none / 0) (#69)
by pla on Sun Jan 26, 2003 at 09:16:47 PM EST

Sorry, I have to correct you on a couple of points ;-)

Good... Thank you. I wish more people would correct me when I err. You have indeed clarified a few points for me.

Incidentally, as many errors as I did make (which of course I only notice reading back over what I wrote, sigh), I really did mean "likelihood" rather than "entropy". Dumb dumb dumb mistake.

You clearly have a better grasp of this subject than I do, however. ;-)

[ Parent ]
intelligence (none / 0) (#94)
by drivers on Mon Jan 27, 2003 at 02:45:12 PM EST

I wish more people would correct me when I err.

I found this statement unintentionally humorous. I read your post and you went right over my head, and I consider myself reasonably intelligent. My reaction was, "ok wise guy, if you're so smart where's YOUR spam filter?" When you say "I wish more people would correct me" most likely there are very few people who could even understand what you are talking about in the first place, let alone correct you. :-)

[ Parent ]

Would it work better with headers included? (4.75 / 4) (#12)
by GGardner on Sat Jan 25, 2003 at 11:59:06 PM EST

When Paul Graham compared his Bayesian spam filter and its great results to previous, not-so-good results, he found several differences between his work and the older work. He felt the most important was that he considered headers when trying to classify spam, and that helped tremendously.

Humans can classify messages into spam/not-spam just by looking at the Subject header. How would this algorithm work if you just used the Subject header? If nothing else, I bet it would run much faster. The From header, though, probably negatively correlates.

subject insufficient (4.00 / 1) (#14)
by martingale on Sun Jan 26, 2003 at 12:09:19 AM EST

Humans can classify messages into spam/not-spam just by looking at the Subject header.
This is, strictly, incorrect. For example, the I love you, I'm writing to get your advice, etc. subject lines aren't recognized as spam or viruses by many humans the first few times. So even humans need to look at the body of the message.

However, you're right of course that humans are a hell of a lot better than machines at filtering (unless they're tired from looking at a few hundred messages already). What that shows is that a well trained machine ought be be able to distinguish spam from as few clues as a well trained human, in principle.

[ Parent ]

You are correct (3.00 / 2) (#17)
by Cluster on Sun Jan 26, 2003 at 12:18:09 AM EST

This reasoning is exactly how Bayesian filters operate: initially the ruleset is "dumb" and has to be trained by the operator.  With each e-mail one receives, one marks it valid or spam, and the ruleset begins slowly growing.  Given sufficient clues, the mechanism begins making its own semi-intelligent decisions.   Intelligence increases with more and more "shaping" by the human.

The reason the Bayesian system is so interesting is that no spammer can successfully bypass it simply because every person's ruleset is different: valid e-mails of a computer programmer may be very different from valid e-mails of a receptionist.

Try Mozilla Mail if you want to try the filters for yourself.

[ Parent ]

Not just Mozilla Mail... (none / 0) (#47)
by gidds on Sun Jan 26, 2003 at 04:08:43 PM EST

Mac OS X's inbuilt Mail app also uses Bayesian filters.  IME they're very effective -- it's worth taking your time to train them, but once trained, they can be uncannily accurate.

[ Parent ]
Just looking at the Subject (5.00 / 3) (#19)
by GGardner on Sun Jan 26, 2003 at 12:37:45 AM EST

I think that I can detect substantially all my spam just by looking at the Subject field. And maybe the From: field too. Occasionally, I get messages with empty subject lines, so I guess there's a problem with the degenerate case. Obviously, no filter, human or machine-written is 100% correct

If you could effectively filter just based on the Subject line (or other headers), there are a couple of nice benefits. Subjects aren't HTML or otherwise encoded. (At least I think so -- is there any MUA so damaged as to display a Subject field that's got HTML in it?) This gets around the problem of trying to filter spam that contains one big embedded image with all the text in it. If Bayesian filters take off, I fear the spammers will switch to this.

Sadly, I get lots of legitimate e-mail that is HTML encoded (thanks, Mom!), so I can't just dump all HTML mail (though I'd like to).

[ Parent ]

images are a real complication (4.00 / 2) (#21)
by martingale on Sun Jan 26, 2003 at 12:59:52 AM EST

It's unclear I think what will happen in the long run with sending images. If they're simply referenced in the email body, then there's a weakness in that the SRC field points to the spammer's machine serving the image, so they're effectively tagged with some text which gives the filter a clue about the contents of the image, even if it's only the server address. This could get ugly, but if the email program can display the image, then a filter can find the location and cross reference with previous images to decide if it's spam.

The trickier problem occurs if the image is sent as an attachment. The practical answer to that is simply to refuse to download messages which are bigger than a size limit of a few k, say. However, if the spammers can craft a message that gets through nonetheless, the next step is to build a probabilistic model of the image contents, just as the Bayesian filters already do with text. Statistics doesn't care if it deals with words or pixels (but images have a different geometry from text, which must be taken into account).

[ Parent ]

Whitelists as a solution to the image problem (4.00 / 1) (#23)
by GGardner on Sun Jan 26, 2003 at 01:35:05 AM EST

I see more and more spam sent as images-only, or images-mostly, so I think this is going to be a bigger and bigger problem. I'm thinking about dumping any e-mail that contains inline images, unless the sender is on my whitelist. The only good reason that I can think of for me to get an unexpected inline image is the stupid stationery that some MUAs generate. Other than that, it seems like common courtesy to have a textual conversation before sending a big image -- e.g. I maintain a couple of open source utilities, but don't just send me a screen shot of how they are broken, ask me first if that's ok.

[ Parent ]
I take a hard line here. (none / 0) (#75)
by abulafia on Mon Jan 27, 2003 at 01:34:02 AM EST

My mail client is mutt. It shows me text. If there isn't text, I don't see it. If there's a reason to look at an image, I can easily (v)iew it, but loading inline images doesn't happen. Email is a text based medium. If you don't send me text, odds are it won't get read. My clients understand the concept, and so I figure if people who want to pay me can get it, anyone else can, too. Not really all that relevent, but Outlook and Eudora are not the only options.

[ Parent ]
Image tags (none / 0) (#24)
by KWillets on Sun Jan 26, 2003 at 01:36:53 AM EST

I've found a few image servers amongst the common substrings in spams that I've analyzed. One of the advantages of doing common substrings instead of words is that the common parts of the url will show up, even if the query or file portion contains different parameters. In fact the entire preceding anchor and image tags will appear along with the url fragment.

[ Parent ]
Here's an example (none / 0) (#26)
by KWillets on Sun Jan 26, 2003 at 02:06:28 AM EST

I was thinking of another article on "common string fragments in spam", but I have some debugging to do before that one comes along. Here's a good "smoking gun" substring (minus the leading left angle bracket):

img src="http://admanmail.com/server/t.asp?ad_key=

This appears seven times out of about 125 messages in my spam corpus. I've noticed that these types of urls often have no additional text, just the image, so it's important to recognize these. Here are a few more:


Here's another fairly indicative string :-):

We don't want anyone to receive our mailings who does not wish to

[ Parent ]
Common strings (none / 0) (#39)
by GGardner on Sun Jan 26, 2003 at 01:06:28 PM EST

I bet these common strings are only common to a particular spammer, who probably re-uses a template for several different spams. After a while, the template will change, and these strings will be less valuable. For example, last year, many spams had some text about how they complied with some non-existant senate bill, and thus they weren't really spam. Nowadays, I don't see this text at all.

[ Parent ]
Statistical reality (none / 0) (#49)
by KWillets on Sun Jan 26, 2003 at 04:21:49 PM EST

Just about every statistical technique relies on new occurrences being similar to past occurrences.  I'm looking for ways to find the most indicative pieces and exploit them quickly.  

For instance, only one or two instances of the above image tags, which have essentially zero chance of being non-spam, should be enough to flag any new message with the same string as spam.  If the message body consists only of such a url, the task might actually be easier, since there's nothing to dilute the near-certainty that the string is spam.  

[ Parent ]

character sets (none / 0) (#62)
by scruffyMark on Sun Jan 26, 2003 at 06:30:36 PM EST

One thing you can do is specify the character set of a subject line (unicode encoding, I guess?)

I sure wish there were an easy way of telling my mail client that any message that's in Chinese is spam. I have no idea why, but I get a tremendous amount of spam in Chinese - I don't speak a word of it, much less read any... Filtering on character set, even of the subject line, would probably hit 80% or better.

[ Parent ]

Spamassassin will do this for you. (none / 0) (#74)
by abulafia on Mon Jan 27, 2003 at 01:27:29 AM EST

Don't know what you use, but if you can use spamassassin, you're set.

From the man page:

ok_languages xx [ yy zz ... ] (default: all) Which languages are considered OK to receive mail from. Mail using character sets used by these languages will not be marked as possibly being spam in an undesired language. [...]

Spamassassin is part of my war on spam.

It is more important that people do a lot of different things - which is why I think it is a Good Thing(tm) that I don't disclose the measures I take. More and more varying approaches means it is harder to optimize against any one method.

Let a thousand flowers bloom.

[ Parent ]

I don't have good headers (none / 0) (#16)
by KWillets on Sun Jan 26, 2003 at 12:11:17 AM EST

I think the Spamassassin archive has some artifacts in the headers that make it less scientific to use them, like getting spam under a different address from the non-spam.  I may be wrong, but that was my impression when reading them.

A broader test might be to try both, or headers-only.  It was at least more challenging to filter bodies-only.

[ Parent ]

Open relays? (none / 0) (#77)
by srichman on Mon Jan 27, 2003 at 06:58:29 AM EST

Humans can classify messages into spam/not-spam just by looking at the Subject header.
And the other headers might be helpful to computers. For instance, spam often passes through open relays, so you'll find patterns in the IP addresses of the SMTP hops in your spam corpus.

[ Parent ]
Variants on Bayes (none / 0) (#106)
by PigleT on Tue Jan 28, 2003 at 12:46:15 PM EST

I wrote a simple statistical analysis script. Starting from the principle "we want to compare tokens in a mail against a category to find the best matching category", my routine is to run a set of functions across the mail isolating classes of token - such as body-words, subject, sender (a combination of From and Sender and anything else likely), and so on. I can also weight-up the amount any one class of tokens contributes to the overall score against a category.

Taking 2000 mails at random from the sample used to educate the system, I found scaling up the Subject 5x and the Sender 10x made an increase of about 1% in effectiveness, maybe 98-99% or so, compared to viewing everything equally.

You could say that this is effectively implementing a sort of a grey-list - if someone sends me 5 spams and 95 hams, that exact ratio will be stored in the sender field across categories as you'd expect.

So define your functions-of-a-mail, scale them up or down in importance (for example, if X-RBL-Warning is 60% indicative of spam in your universe, that will be determined by the learning phase, and can be scored down slightly for later use), and off you go.

ObPlug, if it helps anyone: it lives at http://spodzone.org.uk/pigmail/ .

Now getting back on topic, a bit: is running gzip or even bzip2 once per incoming mail likely to be an efficient flock()-friendly way to check for spam? Especially given that the existing databases must be re-compressed each time around, that effectively restricts the maximum size of corpus you can use - in addition to any buffer-size constraints as well.
~Tim -- We stood in the moonlight and the river flowed
[ Parent ]

it's called MDL modelling (4.78 / 14) (#13)
by martingale on Sun Jan 26, 2003 at 12:03:44 AM EST

Minimum Description Length modelling is a relatively recent statistical idea (Rissanen) based on the assumption that if you understand a dataset really well, you'll be able to devise a really good shorthand language (compression method) to represent it.

There's a slight (ok, major) problem with this technique as follows: achieving the best compression is a function of both the compressibility of the dataset, and the performance of the compression algorithm on datasets of that type. In other words, you might get conflicting predictions if you try the method with gzip, and then try the same method with, I don't know, arj.

Theoretically, the best possible compression exists and has a name: Kolmogorov complexity. Unfortunately, this is not theoretically computable, so we'll always have the problem outlined above, that any results you come up with may be contradicted by another compression algorithm someone else tries.

For example, I bet your results depend strongly on wether you append or prepend the sample message to the corpus (assuming a large corpus of a few megabytes).

BTW, instead of using gzip, write a small C program which calls the gzip library API, then you'll be able to save the compression dictionary built from the reference corpus. That way, you only need to load the dictionary and compress the sample message. (Note that for gzip, this is equivalent to appending the sample to the corpus).

Methods (none / 0) (#103)
by pyro9 on Tue Jan 28, 2003 at 11:51:05 AM EST

I would guess that the methods specialized for compressing executables would tend to do worse. However, what this method is really dependant on is that regardless of the compresser, the same compresser will do a better job on more compressable data where compressable is reletive to the method (text is not very compressable for JPEG for example).

That compressability will be a function of the repetition in the case of LZW like methods. Spam tends to be highly repeditive. To increase accuracy, it would be a good idea to strip out quotes in replys. To keep spammers from exploiting that, immediatly trash messages containing only quoted lines.

Other useful pre-processing would include simple rules like auto allowing any mail that is a reply to sent mail and authorized mailing lists. Anything in the addressbook is probably good mail. The basic observation that compression and pattern recognition are closely tied is both insightful and useful.

Saving only the dictionary is a nice improvement.

The future isn't what it used to be
[ Parent ]
OT: You have the Gusfield book? (3.66 / 3) (#22)
by jjayson on Sun Jan 26, 2003 at 01:30:47 AM EST

I have a couple books on pattern searching and this is by far my favorite (there is a Shasha, et al. that I also like). I think some of the chapters were not written by Gusfield. I remember seeing a chapter online from the book by somebody else (also by Dennis Shasha, I think). What else have you read that is any good in the field?
Smile =)
* bt krav magas kitten THE FUCK UP
<bt> Eat Kung Jew, bitch.

Yes I do (5.00 / 1) (#29)
by KWillets on Sun Jan 26, 2003 at 02:53:16 AM EST

I'm not sure who wrote it, but it looks like Gusfield to me :-).  I could ask him for you.

It's a good book, although the explanations get a bit strung out sometimes.  Short, concise descriptions are easier for me to digest.  

I haven't bought any other books on the topic, but I've been supplementing it with web searches and researchindex.

[ Parent ]

Stats table (4.80 / 5) (#28)
by KWillets on Sun Jan 26, 2003 at 02:42:28 AM EST

Here's an addendum that I should have put in the original article: raw data. I added a few data points to the ham message section, so the average goes down slightly to 36% better compression for ham/ham vs. spam/ham.

Again, this isn't very scientific, but possibly interesting for someone who wants to do more thorough tests.

Antoher Way To Stop Spam (4.00 / 7) (#32)
by jopf on Sun Jan 26, 2003 at 08:19:57 AM EST

Compute some mathematical function of both the mail contents and the recipient's email address, the output of this function would be similar to a digital signature. Each email and it's associated recipient will then have a unique signature. An email client would then only accept emails that are signed using this technique. The mathematical function should be designed such that it takes a non trivial amount of computation to compute. The reason this will work is because it will make spam uneconomical as computation to compute the signature function will become prohibitively expensive when sending many messages. A half a second computation may not be a problem when sending a single mail, but sending a million messages will quickly become a problem. The signature function could obviously be scaled as computing power increases.

Problem (4.66 / 6) (#33)
by greenrd on Sun Jan 26, 2003 at 08:33:55 AM EST

The problem with this is that it makes very large legitimate mailing lists, such as the ones run by MichaelMoore.com or Zmag.org, prohibitively expensive and/or slow to run.

"Capitalism is the absurd belief that the worst of men, for the worst of reasons, will somehow work for the benefit of us all." -- John Maynard Keynes
[ Parent ]

True, But... (4.00 / 3) (#34)
by jopf on Sun Jan 26, 2003 at 08:41:15 AM EST

...you could just set up a filter to allow these sort of mails through unsigned.

[ Parent ]
Right (2.50 / 2) (#67)
by greenrd on Sun Jan 26, 2003 at 08:29:28 PM EST

Instant loophole for spammers.

"Capitalism is the absurd belief that the worst of men, for the worst of reasons, will somehow work for the benefit of us all." -- John Maynard Keynes
[ Parent ]

not really (3.00 / 3) (#68)
by twistedfirestarter on Sun Jan 26, 2003 at 09:12:44 PM EST

if you bother to sign up to a mailing list you can bother to remove it from your filter

[ Parent ]
I thought this for a second, too. (none / 0) (#100)
by br14n on Tue Jan 28, 2003 at 01:17:01 AM EST

But the spammers have to guess which mailing lists I'm on or barrage me with the same spam using the thousand most common list From:s or something.

[ Parent ]
Yeah, and what about the mailing lists run by (1.66 / 3) (#41)
by untrusteduser on Sun Jan 26, 2003 at 01:40:13 PM EST

Jerry Fawell, Pat Buchanan, and the Ku Klux Klan?

[ Parent ]
Legitimate? Michael Moore? (1.00 / 1) (#97)
by Purple Walrus on Mon Jan 27, 2003 at 08:18:26 PM EST

You wish, sonny;)

[ Parent ]
This technique has a name: hashcash. (4.50 / 2) (#50)
by br14n on Sun Jan 26, 2003 at 04:23:14 PM EST

Unless I've mistaken your thing for a similar technique.

[ Parent ]
Spam Wars (4.00 / 1) (#56)
by jefu on Sun Jan 26, 2003 at 05:12:55 PM EST

A couple of times recently I've heard proposals about using PK encryption technology with some kind of certificate authority to verify the identity of a host over the network. These proposals were not just about spam, but also about network security in general and even network attacks as a part of information warfare.

If this were added to SMTP (and other mail protocols), a host could refuse mail from another host that was improperly identified (or not identified). If the PK encryption is used to verify host identity, but information is transferred using a nonce key and a faster symmetric encryption algorithm, the nonce key could be cached for some time to alleviate the major part of the computational cost. I'd have to go find the description of the proposed protocol, but it came from someone who knows something about crypto and sounded reasonable at first listening.

Since improperly identified hosts would be screened, and since hosts could be blacklisted for hosting spammers, this would cut down on open relay hosts and hosts owned by spammers. Probably cut spam by 99 percent.

If this capability were added to network algorithms and protocols more uniformly, it could also cut down on fun things like the Microsoft SQL Server DDOS we just observed.

Of course, if every user had a suitable public key and signed all outgoing messagbes and insisted on incoming messages being signed as well, lots of stuff would end up being put in the "probably spam" bucket just for not being signed.

[ Parent ]

Tiny dataset sizes (3.00 / 2) (#36)
by asjo on Sun Jan 26, 2003 at 10:28:52 AM EST

> created spam and ham "training sets" 1-2 megabytes
> in size.

Those sizes are way to small to conclude anything, aren't they?

  Best regards,


Not too small. (4.00 / 1) (#38)
by mcgroarty on Sun Jan 26, 2003 at 01:00:14 PM EST

Those sizes are way to small to conclude anything, aren't they?

gzip relies on looking backward in a stream to find similar strings to copy forward. The distance it will look backward (the size of the "lookback buffer") is small enough that more data wouldn't be "seen" anyway.

[ Parent ]

Taking advantage (none / 0) (#101)
by asjo on Tue Jan 28, 2003 at 01:28:40 AM EST

> The distance it will look backward (the size of
> the "lookback buffer") is small enough that more
> data wouldn't be "seen" anyway.

So basically the filter won't take advantage of the possibility to use knowledge of the entire corpus of spam, but rather it'd just move a smallish window over it.

Sounds like a good idea.

[ Parent ]

Damn intereresting (3.83 / 6) (#42)
by askey on Sun Jan 26, 2003 at 01:44:22 PM EST

I must say, this is the most interesting article that I have read on here in ages. :) Novel idea, don't remember seeing anything like this before.

Language phylogeny (5.00 / 2) (#70)
by Repton on Sun Jan 26, 2003 at 10:21:37 PM EST

This sort of thing has been done before. About a year ago, I guess, was when I saw it, the the area of language trees [google cache of PDF document]. Here's the Ars story where I first saw it.

They say that only an experienced wizard can do the tengu shuffle..
[ Parent ]

even bayesian filtering is insufficient (4.25 / 4) (#43)
by beanhead on Sun Jan 26, 2003 at 02:02:13 PM EST

Over at Citeseer there's an interesting paper on bayesian filtering, tested on a sizeable corpus (2412 messages taken from a moderated mailing list, and 481 spam messages received by the author) which argues that further safety nets beyond the naive bayesian approach are needed for an effective filter.
This alone should be sufficient justification for looking into different methods.

There are a number of others (5.00 / 1) (#54)
by KWillets on Sun Jan 26, 2003 at 04:56:32 PM EST

Bayesian reasoning is hardly new, but there have been a few papers on applications to spam filtering:

Bayesian Spam Filtering. Also follow links to related papers.

Gary Robinson's notes on statistical filtering.

Most people just seem to be scratching the surface. For instance, spam falls into different categories, so a message that mentions both "teen cheerleaders" and "Laurent Kabila" is very likely not spam, unless it's a hybrid nigerian-porn spam.

Likewise, term-based approaches fail to find highly spammish phrases like "in the subject line" or "white farmers" (from Nigerian spam), which consist of fairly innocuous words.

[ Parent ]

Why the Bayesian algo in this paper didn't work (5.00 / 7) (#55)
by GGardner on Sun Jan 26, 2003 at 05:01:54 PM EST

Although the cited paper, which was published two years ago, has lots of nicely typeset mathematical formulae, it still misses the target. This paper is not cited by Paul Graham in his latest "Plan for Spam", but these authors make the same mistakes that Paul criticizes similar papers for.

For instance, they remove all HTML from the corpus. Why? Well, they just assumed that all HTML would be the useless keywords, and not add any distinguishing features to the noise. But they didn't bother to test this theory. In practice, they are completely wrong -- certain HTML codes are very good predictors of spam. For example, the single most spam-specific token for many spam corpi is the HTML code for red, #FF0000. In a similar vein, they tossed out common words, a priori, and stemmed all tokens. Did they measure the impact of these decisions? Nope. The paper doesn't mention if they used headers in their analysis, but it sounds like the didn't. Again, in the Graham paper, he claims that including headers is the biggest positive differentiator of his work.

Of course, the proof is in the pudding. And the Paul Graham style Bayesian filters seem to be getting very good results for a large number of people, across several different implementations.

[ Parent ]

misconception about bayesian (none / 0) (#65)
by martingale on Sun Jan 26, 2003 at 07:46:33 PM EST

Bayesian refers to a dataprocessing theory which takes probabilities as input, together with evidence in the form of raw data, and produces probabilities as output. It can be shown mathematically that the Bayesian method is the *only* logically consistent way of converting probabilities.

Where there is room for improvement is in the nature of the probabilities injected into the Bayesian theory only. But this is the crux of it. The naive Bayesian models are very simple unrealistic models chosen because you can actually solve the equations analytically. It's kind of like the physicists who solve quantum field equations for the zero particle case...

There is no doubt that if progress is made towards the construction of more sophisticated models (things like full sentence models etc.), then the results will improve further. What Graham is suggesting (and he's not alone) is that the better models must take into account more features of the emails. The trick nobody can solve well is to build tractable models based on complex feature sets.

But just to recap on a common misconception, it's been known mathematically since 1945 (Cox) that Bayesian methods are the only logically consistent way of reasoning with probabilities. Every other method is fallacious.

[ Parent ]

If you want a good spam filter... (1.60 / 10) (#46)
by MessiahWWKD on Sun Jan 26, 2003 at 03:09:53 PM EST

Then you should use MailWasher instead. I'm sorry to say this, but GZIP just doesn't compare.
Sent from my iPad
No, bad program (4.00 / 1) (#87)
by arafel on Mon Jan 27, 2003 at 11:32:00 AM EST

Regardless of the merits of the rest of it, the fact that it will "analyse the headers and then bounce the mail back" is enough to make me go argh.

Speaking as someone who's had his domain forged by a spammer before now - must have pissed him off - all that bouncing an email accomplishes is irritating the poor schmuck who's had the misfortune to have his email address/domain faked.

Spammers do not use addresses you can use for that purpose, and header analysis isn't reliable enough for automated bouncing. It's a horrendous idea, and the author should take it out.

[ Parent ]
Mail Washer (4.00 / 1) (#89)
by VampireD on Mon Jan 27, 2003 at 01:37:20 PM EST

This software is useless, absolutely useless. I have an account that is very useful for testing effectiveness of spam programs. I have a client who has a website that NEVER checks their mail, so currently they have 700 mail in their account, which is completely spam. Using mailwasher, it found about 40 or so as "Probably spam" and another 100-150 as Possible Spam. The rest (600ish) came up as Normal. Very poor effectiveness I must say, I am going to play around maybe seeing how bzip2 and gzip do against such a large list of junk.

[ Parent ]
gzip history buffer is too small for this (5.00 / 17) (#52)
by jloup on Sun Jan 26, 2003 at 04:55:04 PM EST

As the author of gzip, I'm glad to see it used in unexpected applications!

Unfortunately the gzip history buffer is only 32 KB. So you effectively match only against the last 32K of your spam archive. However detecting redundant strings within the new message itself, without any history, is still a useful spamness indicator. Don't forget to do any necessary base64 decoding first.

Spam has become much harder to recognize automatically in the last few months. Many spammers give fewer clues and make their messages look more ordinary. My own procmail filter, which used to catch 99.5% of the spam I receive (several hundred messages per day), is now only about 98% effective. I'm afraid we will need something much more sophisticated than a simple compression ratio to effectively detect spam.

Oops (3.50 / 4) (#59)
by KWillets on Sun Jan 26, 2003 at 05:27:33 PM EST

Yes, it's more of a blunt instrument than I thought.  Thanks for the update.

I ran a quick check to see if the window size was skewing my results, since I picked spam and ham differently (ham test messages were at the end of the ham training set, while spams were interleaved near the beginning of the spam training set).  So far it looks about the same if I test spam just past the end of the spam set.  

Spams (and hams) seem to run in groups, so I thought the window might favor messages taken in time order, but it looks like it's not much of a problem.

It's not a bug; it's a "feature" :-).

[ Parent ]

dictionary dumping in API [o/t] (5.00 / 2) (#66)
by martingale on Sun Jan 26, 2003 at 08:15:10 PM EST

About six months ago, I was looking at using libz for a project which required compressing a set of small (20 chars on average) strings. The strings had a lot of common prefixes, and I wanted to compress them before storing them. I also wanted to test if a new string had already been seen by compressing it, and comparing the representation with already compressed and stored strings. There was only one catch: I had about 600 MB worth of these strings.

I ended up reimplementing a basic arithmetic coder based on a single global static dictionary of a few megs, constructed from statistics gleaned via a previous pass through the full dataset.

What made libz unsuitable (this is no criticism, just though you might find it interesting) was precisely that the dictionary gets reset periodically. It would have been nice to have, in the API, a way of saving the intermediate dictionaries so as to have access to all the dictionaries used in the past. It was bizarre to realize that here I had an application where the common wisdom of "resetting the dictionary to improve compression performance" was working against me, because what I needed was a single dictionary for six hundred megs of data.

[ Parent ]

we're not worthy, we're not worthy \nt (4.00 / 3) (#79)
by noogie on Mon Jan 27, 2003 at 08:09:54 AM EST

[ Parent ]
compression window size (4.50 / 4) (#53)
by count fosco on Sun Jan 26, 2003 at 04:55:43 PM EST

As I understand gzip, it keeps a sliding window of the last 32k (or was that 64k) which it uses to compress the piece of text that its currently reading. The implication being that most of your 1-2 meg training sets are ignored.

Have you tried bzip2 which uses 900k windows ?(though they don't slide in bzip2, so its probably safer to prepend new-message-body.txt rather than append.)

I was looking at bzip2 (none / 0) (#57)
by KWillets on Sun Jan 26, 2003 at 05:18:16 PM EST

But unfortunately I picked up on its 900kb window without noticing gzip's 32k one.

[ Parent ]
SA corpus should be fine BTW (4.00 / 1) (#58)
by jmason on Sun Jan 26, 2003 at 05:19:14 PM EST


interesting technique! I'm amazed the gzip 32k window lets this work. It'd be interesting to see if it can be worked on so that it keeps up with an ongoing feed of mail -- since I can imagine the validity of the "training corpus" decays over time as the spam techniques change; or if the "partially-trained" dict can be saved in an intermediate stage (ie. after the training corpus has been scanned) to speed it up.

BTW the SA publiccorpus archive is pretty representative of my spam/ham feed, so don't worry about the headers being unscientific in some way. I was careful to avoid any of the munging that made other archives useless for many systems. (Of course, it's not typical *for some users*, but it's typical of *my mail* ;)

Going to try with headers (none / 0) (#92)
by KWillets on Mon Jan 27, 2003 at 02:08:47 PM EST

Thanks for the comments; I took a look at your archive description and noted the point about the headers being valid after all.  

I'm trying to write some code to use zlib or something similar, to automate the testing and try a few variations. I'll see if I can do the headers too.

[ Parent ]

Drug Store Truck Drivin' Man (1.14 / 41) (#60)
by medham on Sun Jan 26, 2003 at 05:30:56 PM EST

       A                        E
He's a drug store truck drivin' man

E                            A
He's the head of the Ku Klux Klan

A                  D
When summer rolls around

D        E                    A
He'll be lucky if he's not in town

           A                      E
Well, he's got him a house on the hill

E                             D               A
He plays country records till you've had your fill

A                               D
He's a fireman's friend he's an all night DJ

D      E                            D          A
But he sure don't think much of the records he plays


Well, he don't like the young folks I know
He told me one night on his radio show
He's got him a medal he won in the War
It weighs five-hundred pounds and it sleeps on his floor


He's been like a father to me
He's the only DJ you can hear after three
I'm an all night musician in a rock and roll band
And why he don't like me I can't understand


He'll be lucky if he's not in town

The real 'medham' has userid 6831.

compression windows, (semi-OT) (3.66 / 3) (#71)
by BCoates on Sun Jan 26, 2003 at 11:08:38 PM EST

Since the issue of compression window sizes came up, is there a compression algorithm designed to find and compress long duplicate sequences far apart, as opposed to short duplicate sequences nearby?  

This would actually require a new algorithm, since cranking up the window size for gzip or bzip2 significantly beyond their current defaults would perform horribly.

I've run into a few situations where this would be a handy tool (mostly thinking of compressing p2p traffic, which tends to be redundant in that manner)

Benjamin Coates

That may be another article (3.50 / 2) (#73)
by KWillets on Mon Jan 27, 2003 at 12:37:16 AM EST

That book by Gusfield has the basics on how to index and find common substrings in linear time, although linear in the sense of O(100n) rather O(n), at least memorywise.  

In fact he sketches out a version of LZ that finds common substrings over an arbitrarily long string; I made the mistake of thinking that was what LZ and gzip do normally.

[ Parent ]

I'll explain tomorrow. (none / 0) (#99)
by it certainly is on Mon Jan 27, 2003 at 11:13:24 PM EST

I'm a compression geek. I've got a half finished intro to compression.

However, I'll answer your question now. Dictionary compression (LZ) algorithms ALWAYS look for the longest match. Increasing the search window simply improves your chances of finding a longer match.

PS: If you want better compression, don't use dictionary compressors at all, use a PPM compressor, they consistently beat all other lossless compressors

kur0shin.org -- it certainly is

Godwin's law [...] is impossible to violate except with an infinitely long thread that doesn't mention nazis.
[ Parent ]

Expanding window size with bzip2 (5.00 / 7) (#72)
by KWillets on Mon Jan 27, 2003 at 12:05:24 AM EST

Since a number of people graciously pointed out that my tests were limited by gzip's 32k window size, I ran a few tests with bzip2, which handles data in blocks of up to 900k.

I took the last 500000 bytes from each corpus, and compressed them with spam and ham messages as I did with gzip.

The raw data is here, with ten messages of each kind tested (not a huge sample, but illustrative).

With bzip2, spam seemed to respond to its own kind better than it did with gzip. spam/spam compressed to 67% of its size with ham/spam, while spam/ham compressed at 107% of ham/ham size. That's still a pretty good spread, although individual results seemed a bit less consistent.

Repeating the text? (4.00 / 1) (#88)
by sniels on Mon Jan 27, 2003 at 12:43:19 PM EST

What would happen if you repeated the message, spaced so that gzip only saw one copy at a time, but compared it to more of the sample?

[ Parent ]
Thinking about that (4.00 / 1) (#91)
by KWillets on Mon Jan 27, 2003 at 01:56:33 PM EST

I have a page on zlib open right now to see how to do that with indvidual messages in the corpus.  

The idea is that we can find the closest corpus message in terms of corpus-message/new-message compressed size, and do more of a nearest-neighbor calculation.

[ Parent ]

This looks familiar (5.00 / 4) (#76)
by srichman on Mon Jan 27, 2003 at 06:40:53 AM EST

From the Spam Conference abstracts:
Adaptive Spam Filtering

Jason Rennie, MIT AI Lab

Spam is a rampant problem with an annoying characteristic. As quickly as heuristics are developed to spot spam, the spammers change their tactics. Current systems all have one hole that spammers love to sneak through: they can't adapt. Hand-crafted rule-based classifiers have static rules. Bayesian approaches use static pre-processing that ignores "!!!!" and/or Japanese (for example). We need a new approach---a way to dynamically learn patterns that can identify spam. I describe one such approach: spam filtering as a compression problem. Given a set of e-mails and their labels (spam/non-spam), the objective is to encode a program for identifying spam in less space than it takes to encode the labels. In conjunction with a classification algorithm, this framework provides a natural way to score patterns. As part of a spam filtering system, it can be used to adapt the set of features used for labeling e-mail as spam. I describe the rationale for this approach and give examples of its performance on real data.

Work done in conjunction with Tommi Jaakkola.

Clarification (5.00 / 1) (#80)
by srichman on Mon Jan 27, 2003 at 08:56:57 AM EST

Rereading my above post, I realized that the subject "This looks familiar" could be misconstrued as an allegation of plagiarism. That wasn't what I meant. I just meant "this looks similar and relevant."

[ Parent ]
No problem (none / 0) (#90)
by KWillets on Mon Jan 27, 2003 at 01:50:58 PM EST

I certainly didn't invent this technique.  My main intention was to simplify it, try it,  and make it accessible for anyone else who wants to give it a go.

The links to other research have been very informative.  I had a loose idea of how this should work, from skimming a textbook, but I hadn't dug up the original papers, or followed current research.

[ Parent ]

Link (none / 0) (#82)
by srichman on Mon Jan 27, 2003 at 09:12:29 AM EST

Here's a nice link to the work, stolen from a slashdot comment that was much better than mine.

[ Parent ]
Confusing (none / 0) (#102)
by KWillets on Tue Jan 28, 2003 at 02:31:53 AM EST

I was actually thinking of something along the same lines, of finding maximal common substrings ("features")  and counting them, comparing spam/nonspam, etc.  This task can be accomplished in linear time.  In fact the Burrows-Wheeler Transform uses a variant (Suffix Array) to group common substrings (that's apparently what "block sorting" means; I had to look it up).  

He seems to be describing a pruning technique for the various substrings, or possibly a minimum description length for the rules used, not the compressed text.  I'm confused.

[ Parent ]

Same story on Slashdot 1 hour later (1.42 / 7) (#81)
by Stavr0 on Mon Jan 27, 2003 at 09:08:24 AM EST

Your Rights Online: Using gzip As A Spam Filter
- - -
All your posse are belong to Andre the Giant
And if you bother to look at that story... (4.40 / 5) (#83)
by wiredog on Mon Jan 27, 2003 at 09:48:31 AM EST

you see it's an mlp to this one.

Wilford Brimley scares my chickens.
Phil the Canuck

[ Parent ]
Sequitur... further food for thought... (5.00 / 1) (#85)
by dragoni on Mon Jan 27, 2003 at 10:01:16 AM EST

Sequitur might be an interesting basis for follow-on experiments in this vein... http://sequence.rutgers.edu/sequitur/ From the homepage: SEQUITUR is a method for inferring compositional hierarchies from strings. It detects repetition and factors it out of the string by forming rules in a grammar. The rules can be composed of non-terminals, giving rise to a hierarchy. It is useful for recognizing lexical structure in strings, and excels at very long sequences. Darryl.

Well... (5.00 / 1) (#86)
by awgsilyari on Mon Jan 27, 2003 at 10:33:56 AM EST

I've thought of using SEQUITUR for this purpose before. It could be valuable, but you can't use it directly in its current form. SEQUITUR produces a grammar that generates precisely one string -- the input string.

The inferred grammar produced by SEQUITUR is valuable in that it automatically isolates high-level grammar features. In a way this is very general because it captures the basic structure of the grammar being analyzed. But it is also, in a way, too specific because 1) it cannot represent words or patterns not seen in the input and 2) it can be structurally redundant due to things like ungrammatical sentences in the input among others. SEQUITUR is a greedy algorithm, and although it can produce very good grammars, many times a human could have done significantly better (at least for subsets of the grammar, or subgrammars).

I suppose one way to start would be to use SEQUITUR to generate a number of "prototypical" grammars, and somehow estimate the information gain by using an augmented prototype grammar to derive a new sentence (by augmented I mean augmented with rules that guarantee the target sentence can be derived). In other words, compute the grammatical "delta" between a known spam grammar and the input grammar. But having tried to actually implement this... Well, let's just say your work is cut out for you.

Ok, I don't want to go on about it too much since I could talk about information theory all day. But KWillets has definitely had a cool idea in the spirit of information theory.

Please direct SPAM to john@neuralnw.com
[ Parent ]

Further Explorations (5.00 / 2) (#95)
by KWillets on Mon Jan 27, 2003 at 05:57:14 PM EST

Following some of the very useful comments and research links that people posted, I started working on individual document matching, using the same principle.

The idea is that, for example, a Nigerian spam will have the best relative compression when another Nigerian spam is used for the gzip dictionary, and other types of spam will not appear as related.

I created a short zlib-based program that compares a test file against a list of corpus documents, and outputs the compressed size when each corpus document is used as the deflation dictionary. This process is similar to gzip'ing the two documents concatenated together, where the first document sets up a dictionary of common strings, and the second document's compression level reflects the number and size of its hits in that dictionary.

The output I've looked at so far seems to show that similar documents and duplicates are easy to pick out due to their much better relative compression, and I haven't seen any ham/spam combinations that come close to each other. Typically a ham or spam will have a number of messages in its respective corpus that are close, and other messages, ham or spam, will be far away.

I also tried some messages with headers, the effect seems a bit less pronounced, but still significant.

With a large number of document similarities like this, it's probably worthwhile to try a K-nearest-neighbor approach. I'll work on that tonight, possibly.

pointless (2.00 / 2) (#96)
by autopr0n on Mon Jan 27, 2003 at 06:05:38 PM EST

If this caught on, it would be so easy to defeat it wouldn't even be funny.

[autopr0n] got pr0n?
autopr0n.com is a categorically searchable database of porn links, updated every day (or so). no popups!
how does that make it pointless? (none / 0) (#98)
by cicero on Mon Jan 27, 2003 at 09:13:52 PM EST

the fact is that right now it's not being defeated.

besides, the article isn't suggesting that anyone replace their spamsassin installations with gzip, but it's merely being offered as a, "wow, look at this nifty new use of a very old tool"

I am sorry Cisco, for Microsoft has found a new RPC flaw - tonight your e0 shall be stretched wide like goatse.
[ Parent ]

PS: Old news (5.00 / 1) (#104)
by it certainly is on Tue Jan 28, 2003 at 11:59:06 AM EST

Got a story here from last February, titled Zip programs can identify language of any document, by using the same technique outlined here to identify spam, or cake recipes, or whatever. Indeed, "the researchers note that their analysis should work equally well for any information string, whether it records DNA sequences, geological processes, medical data or stock market fluctuations."

I was going to leave this in my story on compression, but I've decided it's better off as a comment here.

kur0shin.org -- it certainly is

Godwin's law [...] is impossible to violate except with an infinitely long thread that doesn't mention nazis.

Here's a good link (none / 0) (#109)
by KWillets on Tue Jan 28, 2003 at 03:50:44 PM EST

Text categorization using compression models.

With regard to DNA sequences, Gusfield has citations going back to 1990 (Allison and Yee, Milosavjlevic) about mutual information and minimum message length comparisons.

[ Parent ]

why not ppm? (none / 0) (#107)
by VasyaPoup on Tue Jan 28, 2003 at 01:33:06 PM EST

First a side comment. I know you know this, but bzip2 isn't just has a larger than gzip's 32k window, it's entirely different algorithm. :) So, the next logical step would be just using LZ-type encoder with larger window. You can use rar, which has a window size up to 4M (in fact, the stupid thing use 4M as a <it>default</it>! :) Now, concerning redundancy and LZ. From what I know of compression theory, the relationship between redundancy and the ability to compress with dictionary-based coding (such as LZ family) is quite a nontrivial one. Just imagine how many ways there are to find a repetitive strings. That's why there so many LZ based compressors and they are so different in compression and speed. On the other hand, the PPM based schemes are much more straightforward from the theoretical point of view, as they directly address the statistical properties of the text. First you build a model, and then, given the model (a set of probabilities) you encode a given string. Today's PPM encoders, given enough memory (they love memory, yes :) outbest gzip compression up to two times. Now, if you want to play with one of those, find, e.g. ppmd which is open source, compile it (or ask me for a linux executable) and here you go. Heh, I forgot. This comment doesn't mean I believe in filtering spam this way... :))

Better the devil that I know (none / 0) (#108)
by KWillets on Tue Jan 28, 2003 at 02:49:00 PM EST

I was reading about PPM last night, and BWT, etc.  

My original interest was in building a filter that does dictionary-type substring lookups, and scores hits based on incidence in the spam/ham corpus, length, etc.  While dictionary compressors do the former, they don't score based on probability; a string occurrence compresses the same amount whether it occurs once or many times in the corpus.  There are some other issues with proper string matching as well; algorithms like longest-common-substring might match better than left-right scanning, but that's still just conjecture on my part.

I haven't fully understood PPM yet, but it's possible that it does something more like what I was planning.  

The main need right now seems to be a storable dictionary for the text.  BWT offers some possibilities (Mark Nelson wrote an article on it in Dr. Dobb's that has several intermediate stages in the transform, and I'm already familiar with suffix arrays).  gzip and zlib don't have any intermediate format; even the "set dictionary" call uses text to build the dictionary.  

gzip is doing pretty well, but I'd be interested in trying PPM.  I'll see if I can get something working with that.

[ Parent ]

saving dictionary, model, etc (none / 0) (#111)
by VasyaPoup on Wed Jan 29, 2003 at 09:38:33 AM EST

> The main need right now seems to be a storable dictionary for the text.

Well, afair, LZ77, which is gzip's based upon, doesn't have a dictionary per se, just pointers to previosly parsed text. LZ78 has one, and it should be straightforward how to save it. I'm sure code does already exist, may be it's worth to ask in comp.compression...

As to PPM, and saving a model, this should also be rather easy. Look in the ppmd source ftp://ftp.sac.sk/pub/pc/pack/ppmdi1.rar especially how the -r2 option works.

After ppmd little redundacy's left, so people had fun building a model on a real text and then generating pseudotext which looks like real, only doesn't have any sense. You can play and generate pseudospam and pseudoham this way, just use -r2 option, cut the encoded file after the model has been freezed and fill with random bits.

[ Parent ]

Linux source? (none / 0) (#112)
by KWillets on Wed Jan 29, 2003 at 01:44:54 PM EST

I found a debian package, but I'm not sure how to apply the diff file, which converts from Windows to Linux, apparently.

[ Parent ]
little tuning (none / 0) (#113)
by VasyaPoup on Wed Jan 29, 2003 at 03:05:53 PM EST

I'm sending you the source, I had been able to compile on my RH6, just a little tuning in the Makefile and #define _UNKNOWN_ENVIRONMENT_ instead of #define _WIN32_ENVIRONMENT_ in the PPMdType.h

[ Parent ]
Any chance of getting your modified code please? (none / 0) (#115)
by ewildgoose on Fri Feb 21, 2003 at 10:41:36 AM EST

Any chance of getting your modified PPMD code to run on linux? I removed the item which was erroring from the Makefile, however, the thing coredumps when trying to run the binary... Please email to "ppmd@" with domain "wildgooses.com" Thanks

[ Parent ]
DNA? (none / 0) (#110)
by Dyolf Knip on Tue Jan 28, 2003 at 08:42:36 PM EST

So what were the results of feeding DNA sequences into gzip?

If you can't learn to do something well, learn to enjoy doing it poorly.

Dyolf Knip

How I filter spam (none / 0) (#114)
by DJBongHit on Thu Jan 30, 2003 at 02:04:36 PM EST

I set up a system which makes it so that absolutely zero spam ever gets into my Inbox. Now obviously this won't work for everybody (for example, it's no good for people who often get mail from people they don't know), but it fits in well with my email habits.

First of all, any mail which is not specifically addressed to me (i.e. my email address is in the To: or Cc: fields) or addressed to the send address of a mailing list I'm subscribed to is immediately dumped. That's a sure-fire indicator of spam. Anything that gets through that stage is either filtered into a folder for that particular mailing list, or if it's personal mail, my address book is searched for the sender. If the sender is there, it's clean, and it goes into my inbox. If the sender is not there, it's not immediately assumed to be spam, but is filtered into an "Unknown Senders" folder, which I scan through every couple of days. Any legit mail I find in there, I add the sender to my address book.

Like I said, for a lot of people this isn't a reasonable solution, but it works great for me.


GNU GPL: Free as in herpes.

Spam Filtering with gzip | 115 comments (102 topical, 13 editorial, 1 hidden)
Display: Sort:


All trademarks and copyrights on this page are owned by their respective companies. The Rest 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!