As a result of the torrent of spam I've been receiving from the Sobig.F virus, my tolerance for spam is at an all-time low. Like most people I get my share of 'medical' spam, offering products to increase, decrease or otherwise modify various parts of my anatomy. In the past most of these have gone to an email address I have kept for web use and were therefore easy to catch, but I'm now starting to get them on my primary email address as well. I therefore decided to whip up a procmail recipe to deal with them, using a list of keywords and procmail scoring. However, as I soon learned, the spammers have tried to prevent you doing this by obfusticating the contents of the spam. They do this by sending out HTML-format emails, and obfusticating the HTML so that a simple keyword match won't work. However, with a small perl script and a little bit of procmail magic, this was easily circumvented. I've written this up because I think it show some useful and underused features of both perl and procmail. If you are interested, read on.
My first attempt was to scan the potential spams for a list of common keywords, and if 10 or more matches were found, classify it as spam. To do this I used procmail scoring (see the procmailsc(5) manpage for details of exactly how this works). Procmail counts a rule as matched if the total score is >= 0. The procmail rule below initialises the score to -10, and increments it by 1 each time a keyword in the MEDICAL list matches.
# Detect 'medical' spam.
MEDICAL="doctor|physician|prescri(be|ption)|physical exam(ination)?"
MEDICAL="${MEDICAL}|FDA approved|health|relief"
MEDICAL="${MEDICAL}|viagra|diazepam|valium|xanax|xenical|ambien|zyban"
MEDICAL="${MEDICAL}|pain|penis|erection|impotence|allergy|migrane"
MEDICAL="(${MEDICAL})"
# Need 10 or more keyword matches to qualify as spam.
:0 HB :.Spam.lock
* -10^0
* $ 1^1 ${MEDICAL}
Spam
However, that didn't work very well, and I quickly discovered it was due to the obfustication thechniques that were being used by the spammers. Let's look at an example of how they do this:
<table width=3D100% bgcolor=3Dblack cellpadding=3D3><tr><td colspan=3D3 bg=
color=3Daqua align=3Dcenter><font face=3DVerdana size=3D4><b>Onlin<!-- =
huxley -->e Ph<!-- coronado -->armacy<br><font color=3Dred>No Pr=
<!-- toxin -->ior Prescr<!-- extremal -->iption Nee<!-- =
usurer -->ded!<br><font color=3Ddeeppink>No Ph<!-- monotreme -->y=
sical Ex<!-- substantiate -->am Need<!-- steak -->ed!</td></tr>
<tr><td width=3D100% bgcolor=3Dblueviolet colspan=3D3><p align=3Dcenter><f=
ont face=3DVerdana color=3Dwhite><big><big><b><marquee border=3D1 scrollam=
ount=3D5 scrolldelay=3D1>Va<!-- trash -->lium ... Xa<!-- =
ambush -->nax ... Diazepa<!-- interject -->m ... Amb<!-- =
extraneous -->ien ... Xeni<!-- hera -->cal ... Via<!-- =
bigelow -->gra ... And Many Mo<!-- destruct -->re</marquee></td><=
/tr>
Yuck. There are a few tricks that they are using here:
- Embedding HTML comments (
<!-- ... -->) inside words.
- Using
= to escape newlines.
- Using
=3D instead of just a plain =.
They could also have used HTML character encodings, e.g. using = instead of just a plain =. Obviously we need some way of undoing this before procmail runs our spam detection rules on the message. This is actually quite simple to do. First we need a small perl script to deobfusticate the email:
#!/bin/perl -w
#
# Author: Alan Burlison, 02/09/2003
# This script undoes some of the obfustication used by spammers to try to
# hide the real content of their mails from mail filters.
#
use strict;
#
# Get the next line, ignoring any '='-escaped newlines.
#
sub nextline
{
my $line = <>;
while (defined($line) && $line =~ s/=[\n\r]+$//) {
last unless (defined($_ = <>));
$line .= $_;
}
return ($line);
}
#
# Main.
#
while (defined(my $line = nextline())) {
# Decode encoded characters.
$line =~ s/&#(\d+);/pack('C', $1)/eg;
$line =~ s/&#x([\da-f]+);/pack('C', hex($1))/egi;
$line =~ s/=3d/=/gi;
# Remove HTML comments, even if split across lines.
$line =~ s/<!--.*?-->//g;
while ($line =~ /<!--.*(?!-->)/) {
last unless (defined($_ = nextline()));
$line .= $_;
$line =~ s/<!--.*?-->//gs;
}
print($line);
}
exit(0);
The first point of interest here is the first two lines of the "Decode encoded characters" block. This uses the e regexp modifier to execute the replacement part of the substitution, rather than just using it as the replacement text. For each HTML encoded character perl calls the necessary block of code to return the corresponding character value, which is then used to replace the matched text.
The second point of interest is the use of the non-greedy quantifier and the negative-lookahead assertion in the "Remove HTML comments" block. If I had just used s/<!--.*-->//g to remove comments from a line it would not have worked correctly on lines that contained two comments - the .* would have matched as much as possible before matching the trailing -->, i.e. given an iput line of
Buy <!-- foo --> Viagra <!-- foo -->here!
the resulting line after substitution would be
Buy here!
and not
Buy Viagra here!
Some of the comments are split over multiple lines so we need to keep reading in lines until we see the closing --> of a comment block. Because HTML comments can't be nested, we can deal with this by first removing any whole comments, and then whilst the line contains a comment open with no corresponding close, we keep appending lines and removing whole comments. We check for an unclosed comment by matching the line against the comment open (<!--), followed by the minimum number of characters (.*?) that are not followed by a comment close ((?!-->). The (?! ... ) constrict is a perl negative lookahead assertion - see the perlre(1) manpage for details. If we pass the block of obfusticated HTML shown above through the script, we get the following (line breaks added for clarity):
<table width=100% bgcolor=black cellpadding=3>
<tr><td colspan=3 bgcolor=aqua align=center>
<font face=Verdana size=4>
<b>Online Pharmacy<br>
<font color=red>No Prior Prescription Needed!<br>
<font color=deeppink>No Physical Exam Needed!</td>
</tr><tr>
<td width=100% bgcolor=blueviolet colspan=3>
<p align=center>
<font face=Verdana color=white>
<big><big><b>
<marquee border=1 scrollamount=5 scrolldelay=1>
Valium ... Xanax ... Diazepam ... Ambien ... Xenical ... Viagra ... And Many More
</marquee>
</td></tr>
The last part of the jigsaw is to plug this into procmail, so that the mail is deobfusticated before applying the keyword counting rule. Procmail has a neat filter feature that allows you to specify rules that take the mail being processed and pass it through an external filter before processing it further. The final procmail magic required is:
# Detect 'medical' spam.
MEDICAL="doctor|physician|prescri(be|ption)|physical exam(ination)?"
MEDICAL="${MEDICAL}|FDA approved|health|relief"
MEDICAL="${MEDICAL}|viagra|diazepam|valium|xanax|xenical|ambien|zyban"
MEDICAL="${MEDICAL}|pain|penis|erection|impotence|allergy|migrane"
MEDICAL="(${MEDICAL})"
# Deobfusticate HTML emails.
:0 fBbw
* (<!--|-->|&#x?[0-9a-f]+;|(=$))
| deobfusticate
# Need 10 or more keyword matches to qualify as spam.
:0 HB :.Spam.lock
* -10^0
* $ 1^1 ${MEDICAL}
Spam
The fBbw flags to procmail tell it the rule is a filter (f), to match against the body of the mail (B), to pass the body of the mail to the filter (b), and to wait until the filter has finished before processing any of the rules that follow (w). The rule matches against a subset of the obfustication tricks used by the spammers, to cut down on uneccesary executions of the deobfusticate script. Running the script unnecessarily won't do any harm, but it is obviously more efficient to only deobfusticate if really necessary.
One caveat: the filter rule modifies the body of the mail, so any rules that follow will see the modified version, and the modified mail will eventually be stored somewhere, so this rule should go somewhere near the end of your procmail rc file. However, the modifications performed by the deobfusticate script will be benign, unless the mail pulls tricks like hiding JavaScript inside comments (and JavaScript inside a mail is a bad idea anyway - right? ;-)