How to set up Spamassasin-FuzzyOcr for Gentoo

After seeing a increase in image spam, I decided to add the Fuzzy OCR plugin for spamassassin. Basically, it will read the image and see if there are any words or phrases that are labeled as spam and append a score to it. I was surprised that I didn't see any how tos for Gentoo, and I ran across multiple issues setting this up, so here we go.

We need to use spamassassin-fuzzyocr-3.5.1-r1 to get things working. Currently, Gentoo has 2.3b as the stable version, make sure you use the latest greatest. I added the following to /etc/portage/package.keywords:

=mail-filter/spamassassin-fuzzyocr-3.5.1-r1 ~x86
dev-perl/MLDBM-Sync ~x86

I added the following USE flags to /etc/portage/package.use

media-libs/netpbm       jpeg jpeg2k png tiff xml zlib -jbig -rle -svga
mail-filter/spamassassin-fuzzyocr       amavis dbm gocr logrotate mysql ocrad tesseract
app-text/tesseract      tiff

There are some issues with tesseract that I had to correct later, so we can make some adjustments now to avoid those issues. Add the following to /etc/make.conf

LINGUAS="en"

emerge -pv spamassassin-fuzzyocr

tesseract created a zero lengh file for eng.unicharset when you emerge it without setting the LINGUAS. You'll see something like this in the logs if you don't set the lingua:

SA warn: FuzzyOcr: Return code: 256, Error: Unable to load unicharset file /usr/share/tessdata/eng.unicharset

The next issues were related to ocrad and pamthreshold scansets. These are related to temp dir issue. If you see errors like these, you have this problem:

Apr 23 18:30:22 comp amavis[20440]: (20440-09-2) (!)SA error: FuzzyOcr: Unable to read output from "/var/amavis/tmp/.spamassassin20440b4y50jtmp/scanset.tesseract.out.txt" for scanset tesseract
Apr 23 18:30:23 comp amavis[20440]: (20440-09-2) (!)SA error: FuzzyOcr: Error running preprocessor(pamthreshold): /usr/bin/pamthreshold -simple -threshold 0.5
Apr 23 18:30:23 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Errors in Scanset "ocrad-decolorize-invert"
Apr 23 18:30:23 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Return code: 256, Error: pamthreshold: bad magic number 0x8d8c - not a PAM, PPM, PGM, or PBM file
Apr 23 18:30:23 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Skipping scanset because of errors, trying next...
Apr 23 18:30:23 comp amavis[20440]: (20440-09-2) (!)SA error: FuzzyOcr: Error running preprocessor(pamthreshold): /usr/bin/pamthreshold -simple -threshold 0.5
Apr 23 18:30:23 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Errors in Scanset "ocrad-decolorize"
Apr 23 18:30:23 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Return code: 256, Error: pamthreshold: bad magic number 0x8d8c - not a PAM, PPM, PGM, or PBM file
Apr 23 18:30:24 comp amavis[20440]: (20440-09-2) (!)SA error: FuzzyOcr: Error running preprocessor(pamthreshold): /usr/bin/pamthreshold -simple -threshold 0.5
Apr 23 18:30:24 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Errors in Scanset "ocrad-decolorize-invert"
Apr 23 18:30:24 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Return code: 256, Error: pamthreshold: bad magic number 0x8181 - not a PAM, PPM, PGM, or PBM file
Apr 23 18:30:24 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Skipping scanset because of errors, trying next...
Apr 23 18:30:24 comp amavis[20440]: (20440-09-2) (!)SA error: FuzzyOcr: Error running preprocessor(pamthreshold): /usr/bin/pamthreshold -simple -threshold 0.5
Apr 23 18:30:24 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Errors in Scanset "ocrad-decolorize"
Apr 23 18:30:24 comp amavis[20440]: (20440-09-2) SA warn: FuzzyOcr: Return code: 256, Error: pamthreshold: bad magic number 0x8181 - not a PAM, PPM, PGM, or PBM file

To correct this you need to apply some patches to the perl modules.
http://bugs.gentoo.org/attachment.cgi?id=175916&action=view
http://bugs.gentoo.org/attachment.cgi?id=175917&action=view
http://bugs.gentoo.org/show_bug.cgi?id=251687

diff -ur FuzzyOcr.orig/Deanimate.pm FuzzyOcr/Deanimate.pm
--- FuzzyOcr.orig/Deanimate.pm   Sun Jan  7 19:05:18 2007
+++ FuzzyOcr/Deanimate.pm   Thu Nov 15 13:19:00 2007
@@ -8,13 +8,14 @@
use FuzzyOcr::Config qw(get_config set_config get_tmpdir);
use FuzzyOcr::Misc qw(save_execute);
use FuzzyOcr::Logging qw(errorlog warnlog infolog);
+use File::Basename qw(dirname);

# Provide functions to deanimate gifs

sub deanimate {
     my $conf = get_config();
-    my $imgdir = get_tmpdir();
     my $tfile = shift;
+    my $imgdir = dirname($tfile);
     my $efile = $tfile . ".err";
     my $tfile2 = $tfile;
     my $tfile3 = $tfile;
@@ -58,8 +59,8 @@

sub gif_info {
     my $conf = get_config();
-    my $imgdir = get_tmpdir();
     my $giffile = $_[0];
+    my $imgdir = dirname($giffile);
    
     my $fd = new IO::Handle;
    
diff -ur FuzzyOcr.orig/Preprocessor.pm FuzzyOcr/Preprocessor.pm
--- FuzzyOcr.orig/Preprocessor.pm   Sun Jan  7 19:05:18 2007
+++ FuzzyOcr/Preprocessor.pm   Thu Nov 15 12:31:05 2007
@@ -1,5 +1,7 @@
package FuzzyOcr::Preprocessor;

+use File::Basename qw(dirname);
+
sub new {
     my ($class, $label, $command, $args) = @_;

@@ -12,7 +14,7 @@

sub run {
     my ($self, $input) = @_;
-    my $tmpdir = FuzzyOcr::Config::get_tmpdir();
+    my $tmpdir = dirname($input);
     my $label = $self->{label};
     my $output = "$tmpdir/prep.$label.out";
     my $stderr = ">$tmpdir/prep.$label.err";
diff -ur FuzzyOcr.orig/Scanset.pm FuzzyOcr/Scanset.pm
--- FuzzyOcr.orig/Scanset.pm   Sun Jan  7 19:05:18 2007
+++ FuzzyOcr/Scanset.pm   Thu Nov 15 13:20:39 2007
@@ -2,6 +2,7 @@

use lib qw(..);
use FuzzyOcr::Logging qw(errorlog);
+use File::Basename qw(dirname);

sub new {
     my ($class, $label, $preprocessors, $command, $args, $output_in) = @_;
@@ -19,7 +20,7 @@
sub run {
     my ($self, $input) = @_;
     my $conf = FuzzyOcr::Config::get_config();
-    my $tmpdir = FuzzyOcr::Config::get_tmpdir();
+    my $tmpdir = dirname($input);
     my $label = $self->{label};
     my $output = "$tmpdir/scanset.$label.out";
     my $stderr = ">$tmpdir/scanset.$label.err";

and

diff -u -r FuzzyOcr-3.5.1-orig/FuzzyOcr.pm FuzzyOcr-3.5.1/FuzzyOcr.pm
--- FuzzyOcr-3.5.1-orig/FuzzyOcr.pm   2007-01-07 04:05:08.000000000 -0800
+++ FuzzyOcr-3.5.1/FuzzyOcr.pm   2007-04-17 14:21:25.000000000 -0700
@@ -146,7 +146,7 @@
             ){
             $fname = join('',@{$p->{'headers'}->{'content-id'}});
             $fname =~ s/[<>]//g;
-            $fname =~ tr/\@\$\%\&/_/s;
+            $fname =~ tr/\@\$\%\&\./_/s;
         }

         my $filename = $fname; $filename =~ tr{a-zA-Z0-9\-.}{_}cs;

After applying these patches, I thought I would have corrected all the errors, but one was still popping up. It looked like these:

Apr 24 08:11:49 comp[16615]: (16615-16) (!)SA error: FuzzyOcr: Unable to read output from "/var/amavis/tmp/.spamassassin16615WjYr4Utmp/scanset.tesseract.out.txt" for scanset tesseract
Apr 24 08:11:49 comp[16615]: (16615-16) SA warn: FuzzyOcr: Errors in Scanset "tesseract"
Apr 24 08:11:49 comp[16615]: (16615-16) SA warn: FuzzyOcr: Return code: 7936, Error: Tesseract Open Source OCR Engine
Apr 24 08:11:49 comp[16615]: (16615-16) SA warn: FuzzyOcr: name_to_image_type:Error:Unrecognized image type:/var/amavis/tmp/.spamassassin16615WjYr4Utmp/prep.maketiff.out
Apr 24 08:11:49 comp[16615]: (16615-16) SA warn: FuzzyOcr: IMAGE::read_header:Error:Can't read this image type:/var/amavis/tmp/.spamassassin16615WjYr4Utmp/prep.maketiff.out
Apr 24 08:11:49 comp[16615]: (16615-16) SA warn: FuzzyOcr: /usr/bin/tesseract:Error:Read of file failed:/var/amavis/tmp/.spamassassin16615WjYr4Utmp/prep.maketiff.out
Apr 24 08:11:49 comp[16615]: (16615-16) SA warn: FuzzyOcr: Signal_exit 31 ABORT. LocCode: 3 AbortCode: 3
Apr 24 08:11:49 comp[16615]: (16615-16) SA warn: FuzzyOcr: Skipping scanset because of errors, trying next..

I found another user that had this problem, and applied the following patch to get around the problem:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=481383

--- Preprocessor.pm.ORIG   2008-05-15 18:24:22.000000000 +0200
+++ Preprocessor.pm   2008-05-15 18:51:03.000000000 +0200
@@ -15,6 +15,9 @@ sub run {
     my $tmpdir = FuzzyOcr::Config::get_tmpdir();
     my $label = $self->{label};
     my $output = "$tmpdir/prep.$label.out";
+    if ($label =~ /maketiff/) {
+        $output = "$tmpdir/prep.$label.tif";
+    }
     my $stderr = ">$tmpdir/prep.$label.err";

     my $stdin = undef;
--- Scanset.pm.ORIG   2008-05-15 18:56:11.000000000 +0200
+++ Scanset.pm   2008-05-15 19:03:26.000000000 +0200
@@ -63,7 +63,12 @@ sub run {
                 return ($retcode,@result);
             }
             # Input of next processor is output of last
-            $input = "$tmpdir/prep.$plabel.out";
+            # Output name of maketiff is special!
+            if ($plabel =~ /maketiff/) {
+                $input = "$tmpdir/prep.$plabel.tif";
+            } else {
+                $input = "$tmpdir/prep.$plabel.out";
+            }
         }
     }

You'll want to create a log file /var/log/FuzzyOcr.log

touch /var/log/FuzzyOcr.log; chmod 666 /var/log/FuzzyOcr.log

After all of these are done, make sure you lint spamassassin to see if everything is happy

spamassassin --lint

Then restart amavis or spamd or whatever. You should be able to tail the FuzzyOcr.log and see mails and scan information. Keep an eye out in the mail.log for any SA errors or warnings.