Monthly Archives: August 2006

The First Post Since The Last Post

Well, I’m back, and with my new blogging software. Far from complete, but reasonably workable. (Even if I have to make new posts in pure SQL…)

Features will come on stream in the near future.

Missing functionality soon to be added:

  • Comments
  • RSS Feeds
  • Trackbacks and Pings

The Last Post!

WordPress has release yet another security update. After the fun and games of recent upgrades, this is the end. Good bye cruel world, Smiffy's Blog is closing for renovations. This is the last post that will be made through the WordPress software.

The “user side” of this web log will be available for viewing until the new software is up and running – hopefully quite soon.

Having mapped most of the WordPress database schema, I am about to export all the old posts into my new test system. The first module to come on line will be the public side – then I can kill off WordPress once and for all. At this point I will be making new entries by writing SQL queries and feeding them directly into the database. (This is not as bad as it sounds, if one spends as much time writing SQL as I do.)

Spams with images: an experiment

Recently, I moved to using Dspam for spam filtering, as SpamAssassin just wasn't doing the job. Following training, Dspam detection rates appeared good, although a certain type of spam where the body was a GIF image was missed time and time again.

Today, Dspam recognised one of these spams, and for the first time. Unfortunately, I do not know why or how this came about, so cannot pass on any tips in this respect.

In discussions on the Dspam mailing lists, it was mooted that Optical Character Recognition (OCR) may be one way to address the problem. With this in mind, I performed an experiment to see whether it was possible to extract the text from one of these images, using readily-availiable Open Source software.

  1. Copy and pasted base64 encoded image from Dspam quarantine into a file; in this example, the file is called suspectgif.
  2. Decode the base64 to create a GIF stream (base64 command)
  3. Convert the GIF stream to a PNM stream (giftopnm – part of netpbm)
  4. Perform OCR on the PNM stream (gocr)

Assuming that we have written the base64 encoded data to our file, here is the code to do the above:

cat suspectgif | base64 -d | giftopnm | gocr -

If WordPress should wrap the above line, after gocr, there is a space then a dash to indicate that gocr should read from standard input.

The result of this is shown below. Note the first two lines of the output which refer to “garbage” placed at the end of the GIF file so that it does not give a recognisable signature. It would be interesting to know if this output would be sufficient to train a Bayesian filter.


giftopnm: Extraneous data at end of image. Skipped to end of image
giftopnm: bogus character 0x00, ignoring
_
''' ATTENTION ALL INVESTORS AND DAY TRADERS '''
''' THIS COULD BE THE BIG ONE OF THE SUMMER '''
_
WATCH SWNM TRADE MONDAY AUG 7, 2006
wATcH swNM L :KE A HAwK sTART:NG Now:
_
_
Com_any Name_. SOUTHWESTERN MEDICAL, INC,
Stock Symbol_. SWNM
Thursday Close_. $O, lU5 (Up 400Xo)
Market Ca__. $42,OOO,OOO,OO (Approx)
_
SWNM , PK RELEASES 8REAKINC NEWS I
TUES AUG l, 2006
Southwestern MedIcal RetaIns the ServIces of
New World Regulatory SolutIons hr CLIA ( FDA Approvals
_
TAMPA, FL __ Southwestern MedIcal SolutIons, Inc, (Other OTC_,SWNM,PK)
Is pleased to announce that It has recently retaIned the pro_ssIonal servIces
of New World Regulatory SolutIons to handle _nctIons of the CLIA waIver
and related Issues concernIng the Labguard 8,,q Systems and related products
wIth the FDA,
_
New World Regulatory Solutions Is an InternatIonal organI_atIon that
specIalI_es In obtaInIng FDA clewances hr both domestIc and global
concerns In the In vItro dIagnostIc Industry, , ,
_
__ LOG ON TO FAVORITE FINANCIAL SITE FOR MORE INFO
.. THIs couLD 8E THE 8Ic oNE oF THE suMMERI
.. THIS oNE coULD Co UP UP AND AWAY I
.. ADD SWNM To YoUR RADAR NoW I
_
_
RemovaI_, _b! h DIsclaImer_, ___

Note that most of the underscores on blank lines were added by me to keep the code fragment from being mangled by WordPress.

Update 2006-08-07

I have received a new image spam today, this one looking like a money-laundering scheme.
Using the same technique as described above, this spam image yields this text.