December 19, 2009

How to Use and Configure SpamAssassin at CCIS

This document covers the following topics:
• Introduction
• What is SpamAssassin?
• How accurate is SpamAssassin?
• What isn't SpamAssassin?
• Using SpamAssassin at CCIS
• How do I turn SpamAssassin on and off?
• How do I filter possible spam automatically on the server?
• How do I filter possible spam automatically on my PC or Mac?
• Bayesian Classification
• What is Bayesian classification?
• How can I improve the accuracy of Bayesian classification?
• Customizing SpamAssassin
• Adjusting the score required to mark a message as spam
• Preventing SpamAssassin from marking mail from certain senders
• Adjusting the scores of particular tests
• For more information (about customization)

Introduction

SpamAssassin is spam-detection software that attempts to detect spam as mail is delivered to your mailbox, and flags messages that it thinks are likely to be spam. By itself, SpamAssassin does not filter spam, but it can be used in conjunction with other facilities to filter your mail. It is highly configurable.

At CCIS, all mail that gets delivered to our users' mailboxes is automatically run through SpamAssassin. (You can turn it off if you don't want your mail checked, as described below).

What is SpamAssassin?

SpamAssassin (http://www.spamassassin.org/) is a tool that attempts to automatically determine whether incoming mail is spam (unsolicited commercial email) or not. It does this by applying a number of tests to the mail, and assigning each test a score, which can be positive (likely to appear in spam) or, less frequently, negative (unlikely to appear in spam). For any given message, the scores of all the tests are added up, and compared with a threshold value.

If the message scores lower than the threshold, it is not marked as spam (but a summary of any tests that matched is added in a header field). If it scores higher than the threshold, it is marked as spam (by adding an 'X-Spam-Flag: YES' header in addition to the summary of tests) and a detailed report on the message is included at the top of the message, and the original message is included as an attachment.

The report has a preview of the content of the message (e.g. the first few lines of it, converted from HTML or unencoded if necessary), the total number of points SpamAssassin assigned towards its spam score, and a list of the rules that matched. The list of rules has three columns: the number of "points" each rule contributed towards the total score, the name of the rule (which can be used to customize how SpamAssassin scores rules, as described on the SpamAssassin web site), and a human-readable explanation of the rule. Rules can produce positive or negative numbers of points; if the number is positive, a match means that the message is likelier to be spam; if the number is negative, a match means that the message is likelier not to be spam. (For instance, messages in HTML with JavaScript are often spam, but PGP-signed messages are rarely spam.)

Note: Previous versions also added *****SPAM***** to the subject line. The current version doesn't do that by default, but you can tell it to as described below under Customizing SpamAssassin.

Some of the tests involve the body or the headers of the message itself. Others involve consulting various network databases. The current version also supports Bayesian classification of messages, in which it uses characteristics of messages it's already seen to try to improve its accuracy at classifying future messages.

How accurate is SpamAssassin?

For the most part, SpamAssassin relies on probabilities - identifying words, phrases, or headers that are more likely to appear in spam than in legitimate email, or mail from sites that are more likely than average to send spam. Few of SpamAssassin's tests are sure-fire guarantees that a message is or isn't spam. Therefore, if you get a lot of mail, it's almost certain that SpamAssassin will occasionally misidentify legitimate mail as spam ('false positives'), and often fail to mark spam as such ('false negatives'). At CCIS, I've seen SpamAssassin mark forwarded calls-for-papers as spam, and at home, I've seen it mark legitimate bills from my ISP as spam. For this reason it's very important not to automatically discard mail SpamAssassin marks as spam without looking at it.

Since SpamAssassin's threshold for marking a message as spam is configurable (see below), you have some control over the ratio of false positives and negatives.

Recent versions of SpamAssassin support Bayesian classification, described below. When used in the default, automatic mode, this improves accuracy a little bit over time, as the Bayesian logic learns new features of messages that SpamAssassin has decided based on other tests to consider as spam or non-spam. However, it can also be trained by hand, using messages that you've decided manually are spam or non-spam. When trained with a large enough number of hand-classified messages, both spam and non-spam, this can improve SpamAssassin's accuracy considerably, not least because it's customized to the actual types of mail that you tend to get.

What isn't SpamAssassin?

SpamAssassin itself doesn't do any actual mail filtering (in the sense of blocking mail, or filing it in a special folder). It just inspects the mail and marks it up to indicate whether meets the probable-spam threshold or not. However, the markup that SpamAssassin adds makes it easy to filter your mail with other tools, such as procmail (http://www.procmail.org/, described in one of our HOWTO document - How to use procmail for mail filtering at CCIS). As mentioned above, you shouldn't just delete mail that's flagged as spam, but you might want to filter it into a separate folder that you can skim through periodically.

Go to top ››

Using SpamAssassin at CCIS

How do I turn SpamAssassin on and off?

You don't need to do anything to turn SpamAssassin on; at CCIS it's run by default on all locally-delivered mail. If you don't want SpamAssassin to be run on your mail, however, you can turn it off by creating a file .spamassassin/disable in your home directory with the commands:

cd
mkdir .spamassassin
touch .spamassassin/disable


The 'mkdir .spamassassin' command will probably fail with a File exists error; you can ignore that if it does.

(Incidentally, this .spamassassin/disable file is a mechanism we've instituted at CCIS, not something built in to SpamAssassin.)

How do I filter possible spam automatically on the server?

You might want to sort mail which SpamAssassin marks as spam into a separate folder for future perusal. That way you can easily find legitimate mail in your main mailbox, and regularly skim through the mail in your potential-spam folder for any misidentified legitimate mail before clearing it out.

If you read your mail on CCIS Unix systems (e.g. with Pine), the easiest way to do this is with "procmail", which is configured by creating and editing a file called .procmailrc in your home directory. We have a HOWTO document discussing How to use procmail for mail filtering at CCIS , which you should read in order to learn how to do this. The easiest thing to filter on in your .procmailrc is the 'X-Spam-Flag: YES' header SpamAssassin adds to messages it thinks might be spam. If you want finer-grained control (e.g. if you want to filter possible spam into different folders depending on which specific SpamAssassin tests it matches) you might want to instead look at the contents of the 'X-Spam-Status:' header, which lists the total hits and the tests that matched.

Here is a sample .procmailrc that assumes you use Pine and stores mail in standard Unix mailbox format (a/k/a 'mbox' format) in the file mail/possiblespam (accessible in Pine as a folder called 'possiblespam').

IMPORTANT: You really should read How to use procmail for mail filtering at CCIS document and understand what "procmail" is doing and the risks of using "procmail" and how to minimize them rather than just copying this into your .procmailrc and hoping it works.

# $HOME/mail *should already exist* - run "mkdir ~/mail" first!
MAILDIR=$HOME/mail

# not setting DEFAULT or ORGMAIL, so mail that doesn't match will
# be left in system mailbox

LOGFILE=$HOME/procmail-log.txt

# stuff SpamAssassin thinks might be spam:

:0:
* ^X-Spam-Flag: YES
possiblespam

How do I filter possible spam automatically on my PC or Mac?

If you read your mail with a PC or Mac client such as Eudora, Outlook, or Outlook Express, you may be able to do filtering from your mail client based on the markup that SpamAssassin adds. (If you use IMAP to read your mail, you may alternatively be able to use "procmail" as described above as well, but getting that to work smoothly is a little tricky.) The easiest way is based on the 'X-Spam-Flag: YES' header that SpamAssassin adds to the message. Here are some links describing how to do that for certain mail readers:

• Email Client Configuration for SpamAssassin [real-time.com] lists procedures for many mail clients for Windows and Unix/Linux. Some are probably applicable to Macintosh clients as well.
• Setting up a SpamAssassin Filter in Windows [oregonstate.edu] is a detailed description of configuring several versions of Outlook (not Outlook Express) for SpamAssassin.
• This page about SpamAssassin on MacOS X [aplawrence.com] is mostly irrelevant, but about a third of the way down there's a description of how to automatically sort mail in Mail.app based on SpamAssassin's X-Spam-Flag: header.
• Generic (non-SpamAssassin specific) information on filtering in Eudora:
• How to Use Filters (for Windows) [eudora.com]
• How to Use Filters (for Macintosh) [eudora.com]
• Generic (non-SpamAssassin specific) information on filtering in Microsoft Outlook (not Outlook Express)
• Rules Wizards and Assistants [slipstick.com]
(The Usage section has lots of useful links.)
• Microsoft Knowledgebase: OL2000: How to Use the Rules Wizard in Outlook 2000 (Q196212) [microsoft.com]

(If your mail client can't filter based on arbitrary headers, you can tell SpamAssassin to add '*****SPAM*****' to the Subject: header as well, as described below under Customizing SpamAssassin.)

Go to top ››

Bayesian Classification

Bayesian classification works by taking pieces of mail that have already been sorted into spam and non-spam, and trying to discover characteristics that distinguish them. It's implementation in SpamAssassin has two modes: In auto-learning mode, which is turned on by default, SpamAssassin will use messages that it's already pretty sure (based on other tests it does) are spam, or already pretty sure are not spam, and examine them to determine characteristics that might be useful in detecting future spam or non-spam messages. If the rest of SpamAssassin's tests are sufficiently accurate, this can gradually improve SpamAssassin's accuracy over time -- as the kind of mail you get evolves, SpamAssassin learns new characteristics to distinguish spam and non-spam

In the other mode, SpamAssassin can be taught by hand to recognize spam and non-spam messages based on a sufficiently large number of messages you've sorted by hand. You do this by running the 'sa-learn' command on those messages. It's important to run 'sa-learn' on enough messages, both of spam and non-spam, and to run it on spam and non-spam with otherwise similar characteristics. (For instance, if you run it on a recent collection of non-spam and an old collection of spam, SpamAssassin will learn that older messages are likelier to be spam and newer messages are likelier to be non-spam, which will give inappropriate results.) SpamAssassin needs to have examined a couple hundred messages before it will use Bayesian classification; good results are reported with a couple thousand.

It is, however, still useful to run 'sa-learn' on individual messages, especially messages that SpamAssassin has misclassified. In combination with SpamAssassin's auto-learning, that can help prevent future false positives and false negatives

The 'sa-learn' command has a manual page, so you can type 'man sa-learn' for more information, but here are some typical invocations. (SpamAssassin refers to non-spam as ham.)
• sa-learn --ham --mbox ~/mail/saved
Tell SpamAssassin that all of the mail in the Unix mbox-format mail folder ~/mail/saved is ham (non-spam).
• sa-learn --spam --mbox ~/mail/spam
Tell SpamAssassin that all of the mail in the Unix mbox-format mail folder ~/mail/spam is spam. It won't be very helpful to do this on a folder you're automatically populating with messages SpamAssassin marks as spam, since SpamAssassin has already seen all those messages (and there might be false positives in there, so you'd be reinforcing SpamAssassin's errors). But it will help prevent future false-positives if you run it on a folder of spam you've sorted by hand.
• sa-learn --spam
This command would tell SpamAssassin to read a message from standard input and treat it as spam. For instance, from a mail reader like Pine or Elm or Mutt you could pipe a single message that was misclassified as non-spam into this sa-learn command.
• sa-learn --ham
This command would tell SpamAssassin to read a message from standard input and treat it as non-spam (ham). For instance, from a mail reader like Pine or Elm or Mutt you could pipe a single message that was misclassified as spam into this sa-learn command.
• sa-learn --spam 'mhpath cur'
This example is only for users of the MH or nmh mail system. MH keeps a notion of the current message, which is stored as an individual file, and this command would tell SpamAssassin that MH's current message in the current mail folder is spam. You could replace '--spam' with '--ham' to tell it the message was non-spam. (The MH command mhpath cur produces the full Unix pathname to the current message, so this command gives sa-learn the pathname to the file with the current message. This is much better than using 'show | sa-learn --spam', because the show command reformats the headers and may not print all of them.)

The sa-learn manual page has a lot of additional information, both on syntax and usage of the command itself, and on how to use it effectively to produce good results.

If you accidentally learn a message as spam (or as ham), you can correct that just by re-learning the same message with the correct option ('--ham' or '--spam'); the previous characterization of the message will be forgotten. Alternatively, you can use the '--forget' flag, which forgets the previous characterization of the message, but does not re-learn it.

Go to top ››

Customizing SpamAssassin

When it first processes a message (i.e. the first time you get mail after turning it on), SpamAssassin creates a .spamassassin/user_prefs file in your home directory. You can customize SpamAssassin by editing that file with a text editor. (If it doesn't already exist, you should turn SpamAssassin on and send yourself a piece of mail to make SpamAssassin create it - that's better than just creating a new empty .spamassassin/user_prefs file because SpamAssassin puts explanatory comments in the file.)

There are a few main things you're likely to want to set in that file:
• The total score needed to mark a message as spam.
• Whether to add '*****SPAM*****' to the Subject: header.
• Whether to use Bayesian classification, and if so whether to auto-learn based on presumed spam or non-spam.
• Email addresses from which SpamAssassin should not mark a message as spam.
• The score contributed by particular tests SpamAssassin applies.

Each is described below.

Adjusting the score required to mark a message as spam

By default, SpamAssassin tags a message as likely spam if the point-values of the tests the message matches add up to five or more. You can increase this value by putting a line in your .spamassassin/user_prefs file that looks like

required_hits 6

A value greater than 5 will make mail less likely to be marked as spam, decreasing the number of false positives and increasing the number of false negatives. A value less than 5 will make mail more likely to be marked as spam, decreasing the number of false negatives and increasing the number of false positives.

Turning on Subject: header rewriting

Previous versions of SpamAssassin added '*****SPAM*****' to the front of the Subject: header when a message was tagged at likely spam, in addition to adding the 'X-Spam-Flag: YES' header. If you want to restore the old behaviour, you can add the line

rewrite_header Subject *****SPAM*****

to your .spamassassin/user_prefs file. You can of course change the string that SpamAssassin adds to the Subject: line.

Controlling the Bayesian classifier

If you want to disable Bayesian classification entirely, you can add the line

use_bayes 0

to your .spamassassin/user_prefs file, and if you want to use the Bayesian database and train it by hand, but not try to automatically learn spam/ham distinctions based on mail SpamAssassin classifies, you can add the line

auto_learn 0

If you turn off Bayesian classification altogether, you may want to delete the three files bayes_journal, bayes_seen, and bayes_toks in your .spamassassin directory to save space.

Preventing SpamAssassin from marking mail from certain senders or domains as spam

SpamAssassin lets you list addresses (or domains) whose mail should not be marked as spam. This is referred to as 'whitelisting' senders. You might want to do this, for instance, for addresses for which it's very important that you see their mail quickly (such as billing@yourisp.net), or for senders or domains whose legitimate mail is likely to get incorrectly tagged as spam. For instance, if you have a correspondent with an account at an ISP that also hosts spammers, and she uses a mail program that forces her to send her mail as HTML, or adds phrases to the bottom of her message that sound like spam to SpamAssassin, you could add her address to the whitelist. Also, SpamAssassin has a hard time distinguishing between (unsolicited) spam and legitimate (solicited) advertising and marketing material, so if you've signed up for marketing newsletters, you might want to add the addresses they come from to the whitelist.

You do this by adding lines like

whitelist_from billing@myisp.net
whitelist_from example3881@hotmail.com
whitelist_from specials@example.com


to your .spamassassin/user_prefs file. As you can see, you can have multiple lines; each address needs to be on a separate line.

You can use the asterisk ('*') as a wildcard character to match parts of addresses, as you can when matching Unix filenames. So to avoid marking any messages that say (accurately or incorrectly) they're from myisp.net, you could add

whitelist_from *@myisp.net

and if you have a correspondent whose mail sometimes comes from julia@example.edu, sometimes from julia@www.example.edu, and sometimes from julia@mail.example.edu, you could add a line

whitelist_from julia@*example.edu

(Actually, addresses in the whitelist just get a very large negative number added to their score, so it's just conceivable that a message could be in the whitelist but still be marked as spam, if it had enough other spamlike characteristics.)

There's also a blacklist_from keyword you can use to cause mail from certain addresses to be flagged as spam even if it doesn't match any other tests.

Adjusting the scores of particular tests

You can also adjust the score of a particular test. For instance, one of the tests is whether the subject of the message has lots of 8-bit characters. That violates the Internet email specifications (such characters are supposed to be encoded in a seven-bit format), and is very common with certain spam mail, but it also sometimes happens with legitimate non-English email sent with non-standards-compliant software. If you had a correspondent in Russia or Israel using a buggy mail program, that person's mail might be regularly flagged as spam partly due to that test. SpamAssassin calls its test for 8-bit characters 'SUBJ_ILLEGAL_CHARS', and you could turn it off by adding a line saying

score SUBJ_ILLEGAL_CHARS 0

to your .spamassassin/user_prefs file (or just lower its score from the default of 3.136). Alternatively, you can increase the scores for particular tests. You can find out the names of tests by reading the report of a message that matched a particular test, or by consulting the list at the link below.

Go to top ››

For more information
• The tests performed by the current version of SpamAssassin (which may be slightly different from the ones performed by the version we have installed) are documented at http://spamassassin.org/tests.html.
• Documentation on the configuration file is online (for the latest version) at http://spamassassin.org/doc/Mail_SpamAssassin_Conf.html, and you can see the same document for the version we have installed by running the command 'perldoc Mail::SpamAssassin::Conf' on a CCIS Unix machine.

(At CCIS, we're using an efficient version of SpamAssassin called "spamd" that runs as a separate server process running on our mail server; for security reasons that version doesn't allow you to define your own tests in your .spamassassin/user_prefs file as described in that documentation, but you can adjust the weighting of existing rules.)
• More SpamAssassin documentation is at http://spamassassin.org/doc.html
• A good general site about spam is http://www.mail-abuse.com/

No comments:

Post a Comment