Category Archives: Spam & Other Internet Pests

Project Honeypot Is Back – Sort Of

As reported recently, the Project Honeypot public web servers had been taken out of commission due to problems with insufficient infrastructure.

I have started to receive reports of visits to honeypots again, so thought I would go have a look. I was rather surprised by the statistics on a harvester caught with the aid of one of my MX records:

4 visit(s) to 0 honey pot(s)
0 message(s) resulting from harvests
525% of harvests result in messages
seen with 1 user-agent(s)

Really? I am afraid that my confidence in this project is dropping all the more when the site coders seem to lack a grasp of simple maths. (Hey, I'm not exactly Mr Numeracy, but even I can see that is not right.)

I will continue to monitor the situation; to me, the project is still a great idea, management and QA issues notwithstanding.

Project Honeypot: Closed for Maintenance

I paid Project Honeypot a visit this morning to get instructions for the http:BL facility and found the site closed for maintenance.  The explanation is that there are insufficient resources available to run the honeypot system and the public site (at least that is my interpretation).  Did I, as a member, receive a mail saying “help, we’re running at our limits, can anyone mirror the site”?  No.  Is there any indication of the URIs that we need to find the documentation in the Google cache?  No.

I am thorougly unimpressed as I was, this very morning, about to implement http:BL in an application that I am writing – it allows user-submitted URIs to be checked against a blacklist of “spamvertised” and phishing sites and the like and return a risk level that can be used to determine what to do with the submitted data.  Very useful tool – but hard to implement without the instructions and my access key.  (Even the members-only area of the site is out of commission).

Whilst I appreciate that the project may have resource issues, this smells to me like a case of management incompetence.


— Disgusted, Tunbridge Wells.

No Comment Spam – Yet

After a fortnight, it may still be early days; however, since writing the code to allow Smiffy’s Place to take comments, I have yet to deal with any Comment Spam. There are a few possible reasons for this:

  • E-mail verification is required for the first comment posted, per e-mail address; this might be too difficult to code for a robot (although I can see how to do it with about 100 lines of Perl), and too time-consuming for a cottage-industry spammer doing it by hand.
  • E-mail verification allows me to pinpoint both the IP address from which the request came and, if confirmed, the IP address of the mail host. Would spammers really want to leave such obvious tracks?
  • Smiffy’s Place is driven by unique software – writing a robot to target WordPress may be worth the time and effort, but who is going to bother for a single, small blog? Anything that does get through is far more likely to have come from someone trying to spam manually.
  • I am not about to disclose how in a public forum, but I will say that the comments programme will not just accept a POST request from anywhere.

It comes to something when the worst nuisance robot activity is actually coming from a large search engine. For the meantime, I will delay rolling-out my Phase II anti-comment-spam measures.

MSNbot – Ban the Bot, Or Not?

I am quite happy to have search engine robots visiting Smiffy’s Place, following the links and indexing the content. The majority of my robot visitors seem to do just that, and are no trouble at all.

Today, I had reason to look at the server error log for Smiffy’s Place and found some activity coming from msnbot with which I was less than pleased.

  1. It had tried to access my main images directory, despite the fact that this is banned in robots.txt
  2. Although I cannot make out the full request, my blog software had thrown an error on a URI which contained the following string: .get_permalink($post->ID). This – I believe – is actually WordPress code. Just why was a search engine robot trying to force scripting into my software?
  3. It had followed a concealed and commented out link to a honeypot for spam harvesters.

When I started documenting this, I was wondering whether msnbot is really welcome at Smiffy’s Place. Having now got to the end of the post, I am in no doubt that it is not. My robots.txt now includes the following lines:

User-agent: msnbot/1.0
Disallow /

I will now watch with interest to see whether or not visits from msnbot actually cease. If not, the netblock will be finding its way into an iptables rule to keep it out of my servers once and for all.

I would love to hear the experiences of others in relation to this robot – please leave a comment against this article, or drop an e-mail to the address at the bottom of the page.

Oh, Smiffy’s Place now also has a filter to return a rude message on receipt of anything that looks like an XSS attempt. Try adding a dollar sign to the end of the URI of this post and you will see.

Unwelcome Visitors

For a server with no publicly advertised addresses, I get a steady stream of hits on my backup Web server – none of them good. For those who are interested, I have created an annotated text file, detailing a couple of days of traffic.

I have been watching this closely on both of my public servers for about a week; I think I have nearly enough data to automate parsing the logs, doing whois and ptr record searches, etc.

Other than putting it into a database, I don’t know quite what I will do with this data at present. Still, my philosophy is "what you don’t record, you can’t analyse."

Another Address Lost

Spammers have finally started sending to Good luck to them, because that address is now history. Although I get a large amount of spam through my business address (which I will not post here for obvious reasons), my "sacrificial address" concept seems to work fairly well.

Regular visitors to Smiffy’s Place may notice a new contact address at the bottom of the page. At the moment, I have to set this address in the database that drives this site. My plan is to improve this process by having a single programme that changes the address in the database, removes the old address from the Postfix mail server virtual users table and inserts the new one. I will document the technique used, as I cannot see how disclosure will benefit spammers.

Spams with images: an experiment

Recently, I moved to using Dspam for spam filtering, as SpamAssassin just wasn't doing the job. Following training, Dspam detection rates appeared good, although a certain type of spam where the body was a GIF image was missed time and time again.

Today, Dspam recognised one of these spams, and for the first time. Unfortunately, I do not know why or how this came about, so cannot pass on any tips in this respect.

In discussions on the Dspam mailing lists, it was mooted that Optical Character Recognition (OCR) may be one way to address the problem. With this in mind, I performed an experiment to see whether it was possible to extract the text from one of these images, using readily-availiable Open Source software.

  1. Copy and pasted base64 encoded image from Dspam quarantine into a file; in this example, the file is called suspectgif.
  2. Decode the base64 to create a GIF stream (base64 command)
  3. Convert the GIF stream to a PNM stream (giftopnm – part of netpbm)
  4. Perform OCR on the PNM stream (gocr)

Assuming that we have written the base64 encoded data to our file, here is the code to do the above:

cat suspectgif | base64 -d | giftopnm | gocr -

If WordPress should wrap the above line, after gocr, there is a space then a dash to indicate that gocr should read from standard input.

The result of this is shown below. Note the first two lines of the output which refer to “garbage” placed at the end of the GIF file so that it does not give a recognisable signature. It would be interesting to know if this output would be sufficient to train a Bayesian filter.

giftopnm: Extraneous data at end of image. Skipped to end of image
giftopnm: bogus character 0x00, ignoring
Stock Symbol_. SWNM
Thursday Close_. $O, lU5 (Up 400Xo)
Market Ca__. $42,OOO,OOO,OO (Approx)
TUES AUG l, 2006
Southwestern MedIcal RetaIns the ServIces of
New World Regulatory SolutIons hr CLIA ( FDA Approvals
TAMPA, FL __ Southwestern MedIcal SolutIons, Inc, (Other OTC_,SWNM,PK)
Is pleased to announce that It has recently retaIned the pro_ssIonal servIces
of New World Regulatory SolutIons to handle _nctIons of the CLIA waIver
and related Issues concernIng the Labguard 8,,q Systems and related products
wIth the FDA,
New World Regulatory Solutions Is an InternatIonal organI_atIon that
specIalI_es In obtaInIng FDA clewances hr both domestIc and global
concerns In the In vItro dIagnostIc Industry, , ,
.. THIs couLD 8E THE 8Ic oNE oF THE suMMERI
RemovaI_, _b! h DIsclaImer_, ___

Note that most of the underscores on blank lines were added by me to keep the code fragment from being mangled by WordPress.

Update 2006-08-07

I have received a new image spam today, this one looking like a money-laundering scheme.
Using the same technique as described above, this spam image yields this text.

Another PHP Comment-despamming technique

Another method for de-spamming web log comments is illustrated in wordinbox.php. In this instance, a word is selected from a list; the user is asked to include the word in their post, the word being stripped automatically, when the form is submitted. It is assumed here that the word list used contains words with absolutely no relevance to the context in which they are used. The same could be done, incorporating relevant words into the text, obviating the need for stripping them later.

PHP Text Captcha

In the post A More Accessible Alternative to Graphical Capchas, I discussed a method of using a question/answer system for determining whether one is dealing with a human visitor or a naughty robot. The example was given in Perl, but I said that it would be easy to adapt for PHP. This wasn't entirely correct, as PHP has a different approach to arrays from Perl, but I have now done it.

I have posted an example implementation on this site, which also provides links to download the PHP source files.

Notes on the routine that does all the work are interspersed with the code in the file textcaptcha.php

Spam, Spam, Spam, Spam!

I am so glad that I have started to use my 'blog again; I had forgotten the joy that deleting all the 'spam' comments can bring!

Checking the WHOIS information, most seem to be coming from the Ukraine at present; Russian Mafia at work?

The balance of products being pushed seems to have changed; the Generic Vagra is still very much in evidence, but there seem to be a lot of mobile phone ring tones on offer as well.

This brings me to something that I cannot understand – why anyone would not only go to the trouble of downloading a ring tone for a mobile phone, but actually pay for that dubious privilege. It's a telephone – you need to know that someone is calling you, not have an uplifting aesthetic experience (a mobile phone ring tune could only be so for someone who has just spent the last decade in a sensory deprivation tank).

Please note – if you should want to comment on this post, the word 'ringtone' is taboo.

See also: De-Spamming WordPress Comments elsewhere on this web log.