Recently, I moved to using Dspam for spam filtering, as SpamAssassin just wasn't doing the job. Following training, Dspam detection rates appeared good, although a certain type of spam where the body was a GIF image was missed time and time again.
Today, Dspam recognised one of these spams, and for the first time. Unfortunately, I do not know why or how this came about, so cannot pass on any tips in this respect.
In discussions on the Dspam mailing lists, it was mooted that Optical Character Recognition (OCR) may be one way to address the problem. With this in mind, I performed an experiment to see whether it was possible to extract the text from one of these images, using readily-availiable Open Source software.
- Copy and pasted base64 encoded image from Dspam quarantine into a file; in this example, the file is called suspectgif.
- Decode the base64 to create a GIF stream (base64 command)
- Convert the GIF stream to a PNM stream (giftopnm – part of netpbm)
- Perform OCR on the PNM stream (gocr)
Assuming that we have written the base64 encoded data to our file, here is the code to do the above:
cat suspectgif | base64 -d | giftopnm | gocr -
If WordPress should wrap the above line, after gocr, there is a space then a dash to indicate that gocr should read from standard input.
The result of this is shown below. Note the first two lines of the output which refer to “garbage” placed at the end of the GIF file so that it does not give a recognisable signature. It would be interesting to know if this output would be sufficient to train a Bayesian filter.
giftopnm: Extraneous data at end of image. Skipped to end of image
giftopnm: bogus character 0x00, ignoring
''' ATTENTION ALL INVESTORS AND DAY TRADERS '''
''' THIS COULD BE THE BIG ONE OF THE SUMMER '''
WATCH SWNM TRADE MONDAY AUG 7, 2006
wATcH swNM L :KE A HAwK sTART:NG Now:
Com_any Name_. SOUTHWESTERN MEDICAL, INC,
Stock Symbol_. SWNM
Thursday Close_. $O, lU5 (Up 400Xo)
Market Ca__. $42,OOO,OOO,OO (Approx)
SWNM , PK RELEASES 8REAKINC NEWS I
TUES AUG l, 2006
Southwestern MedIcal RetaIns the ServIces of
New World Regulatory SolutIons hr CLIA ( FDA Approvals
TAMPA, FL __ Southwestern MedIcal SolutIons, Inc, (Other OTC_,SWNM,PK)
Is pleased to announce that It has recently retaIned the pro_ssIonal servIces
of New World Regulatory SolutIons to handle _nctIons of the CLIA waIver
and related Issues concernIng the Labguard 8,,q Systems and related products
wIth the FDA,
New World Regulatory Solutions Is an InternatIonal organI_atIon that
specIalI_es In obtaInIng FDA clewances hr both domestIc and global
concerns In the In vItro dIagnostIc Industry, , ,
__ LOG ON TO FAVORITE FINANCIAL SITE FOR MORE INFO
.. THIs couLD 8E THE 8Ic oNE oF THE suMMERI
.. THIS oNE coULD Co UP UP AND AWAY I
.. ADD SWNM To YoUR RADAR NoW I
RemovaI_, _b! h DIsclaImer_, ___
Note that most of the underscores on blank lines were added by me to keep the code fragment from being mangled by WordPress.