I was reading through a bioinformatics book the other day,
and was reminded of a useful shortcut for comparing a text against various
corpora. A number of researchers have simply fed DNA sequence data into the
popular Ziv-Lempel compression algorithm, to see how much redundancy it
Loosely speaking, the LZ (Zip) and the related gzip compression algorithms look for
repeated strings within a text, and replace each repeat with a reference to the
first occurrence. The compression ratio achieved therefore measures how many
repeated fragments, words or phrases occur in the text.
A related technique allows us to measure how much a given, "test" text has in
common with a corpus of possibly similar documents. If we concatenate the
corpus and the test text, and gzip them together, the test text will get a
better compression ratio if it has more fragments, words, or phrases in common
with the corpus, and a worse ratio if it is dissimilar. Since the LZ algorithm
scans the entire input for repetitions, it tends to map pieces of the test text
to previous occurrences in the corpus, thereby achieving a high "appended
compression ratio" if the test text is similar to what it's appended to.
In this case, we wish to compare an incoming email message against two possible
corpora: spam and non-spam (ham). If we maintain archives of both, we can
compare the appended compression ratios relative to each, to judge how similar a
new message is to spam or ham.
As a simple test, I downloaded some sample spam and ham from the Spamassassin archive. I removed
headers from the messages (to focus on message text only), and created spam and
ham "training sets" 1-2 megabytes in size. I then tested spam and ham messages
not in the training sets for for their compressed sizes when appended.
Compression was measured as follows:
$ cat spam.txt new-message-body.txt |gzip - |wc -c
$ cat ham.txt new-message-body.txt |gzip - |wc -c
The file sizes output were compared to the compressed sizes of spam.txt and
ham.txt without new-message-body.txt appended, to see how many bytes were
consumed by the new-message-body.
The results for "ham" messages were the most dramatic. The average compressed
size of a ham message appended to spam was 38% higher than when appended to
other ham. For spam messages, the same comparison yielded a compressed size 6%
smaller when appended to spam vs. ham, so in both cases, compressing a message
with others of its kind yielded a smaller file, on average.
Individual results were also quite clear: while some spam messages compressed
slightly better when mixed with ham, ham messages still maintained a margin of
15% or more between the most spamlike ham, and the most hamlike spam. I would
put the threshold somewhere around 110%; if a message's size when gzipped with
spam is less than 110% of its size when compressed with ham, it's probably spam.
In conclusion, gzip is a fairly blunt instrument for spam detection, but the
effectiveness of its relatively blind repetition-finding is worth noting. The
current fad among spam filters is word-counting, with various statistical
heuristics applied to the results. Algorithms like LZ and gzip go beyond word
matching, finding entire phrases and paragraphs of repetition, but do not
attempt to measure their statistical significance. More sophisticated
approaches, which combine phrase matching with statistical analysis, may be more