The sorry state of i18n

[prev] [thread] [next] [lurker] [Date index for 2006/03/20]

From: spc (Sean 'Captain Napalm' Conner)
Subject: The sorry state of i18n
Date: 19:15 on 20 Mar 2006
  This stems from spam, so right off the bat it's hateful.  But other than
the spam issue, I'm not sure where to place the rest of my hate since it
crosses multiple programs across multiple platforms.

  Now, you may be asking yourself, ``Self, what does i18n have to do with
spam and softare hate?''  Glad you asked (even if you didn't).

  I work at a small web hosting company and even though we're small, we get
an insane amount of spam through our network (then again, who doesn't?).  We
have a dedicated platform (read:  commercial, proprietary and expensive)
that does nothing but filter for spam---it is, in effect, a Spam Firewall. 
You point the MX record to this device and it'll scrub the incoming
email---blocking from known spammers, letting through the rest but marking
on the subect line emails that *may* be spam (and so far, it's never been
wrong when it marks an email as spam).  So, a message that comes into the
Spam Firewall as:

	Subject: Play longer!  Increase your mortgate by 3 inches!

if not outright blocked, will be slightly modified to read:

	Subject: [SPAM] Play longer!  Increase your mortgate by 3 inches!

  I'm the system adminstrator for said small web hosting company, and as
such, I have root's mail from each of our servers headed to my account. 
Which means I get a ton of email---log summaries, mail bounces, problem
notifications, what have you.  In order to keep from being inundated I've
set up procmail to filter and file all my incoming email.  So, it was easy
enough to setup the following rule in procmail:

	:0:
		* ^Subject: .*SPAM.*
		in-SPAM

  Never mind the obscure syntax and the difficulty in actually scanning for
a literal '['---this works enough to send all spam marked emails to the bit
bucket.

  But I noticed that not all marked spam was being caught.  There I am, in
mutt, and what do I see in my inbox?

	Subject: [SPAM] Play longer!  Increase your mortgate by 3 inches!

  That shouldn't be there.  Let me test something---I sent from my personal
account an email to my work account with "[SPAM]" in the subject line, and
lo' it ended up in 'in-SPAM' just like I told procmail to do.  Yet I still
get 

	Subject: [SPAM] Play longer!  Increase your mortgate by 3 inches!

  What's going on?  

  Suspecting that somehow procmail wasn't seeing the actual subject line, I
checked the incoming mail spool file directly and what do I see?

	Subject: =?ISO-8859-1?B?W1NQQU1dIA==?= =?ISO-8859-1?B?UGxheSBsb25nZXIhICBJbmNyZWFzZSB5b3VyIG1vcnRnYXRlIGJ5IDMgaW5jaGVz?=

  Aha! [1] MIME crap! [1]  I18n crap! [2]  With varying degress of support (or
non-support in the case of procmail).  

  Okay, so where's the hate?

  Let's see ... the Spam Firewall?  Okay, it's nice that it can decode
encoded header lines, but *why* oh *why* does it encode "[SPAM]" if the
subject line is encoded?  Obviously you can have portions of a head encoded
and not all of it.  I'm guessing the Spam Firewall vendor can't (or probably
won't) fix this because the actual bit that does the rewriting of the
subject line is probably some third party i18n library that the Spam
Firewall uses and it's not cost effective to "fix" this particular problem,
since for most people it's not a "problem" at all.  

  Stupid.

  Procmail?  For not supporting i18n at all?  Are there any regex engines
out there that can deal with i18n?  Does procmail need to be updated to
support MIME?

  Hate.

  Mutt?  Well ... it supports MIME and i18n, but it masked this particular
problem for a few days.  It's tempting to rip out MIME support from mutt
(since I can't stand MIME but that's an issue I have to deal with) but it
does make it difficult to deal with the occasional attachment.  Perhaps a
toggle to flip MIME support on and off ... 

  Agravation.

  Spam?  Well, that's pure hate incarnate.

  So I dutifully add:

	:0:
		^Subject: =?ISO-8859-1?B?W1NQQU1dIA==?=.*
		in-SPAM

to .procmailrc and get on with my life, until I start seeing

	Subject: [SPAM] Play longer!  Increase your mortgate by 3 inches!

in the inbox yet again.  What now?

	Subject: =?UTF-8?B?W1NQQU1dIA==?= =?UTF-8?B?UGxheSBsb25nZXIhICBJbmNyZWFzZSB5b3VyIG1vcnRnYXRlIGJ5IDMgaW5jaGVz?=

  Sigh.

  -spc (Actually, I think it was originally encoded in WINDOWS-1251 which is
	a whole other form of hate ... )

[1]	It's actually encoded in UTF-8 in this example---I don't have a full
	example in ISO-8859-1 but it's close enough to serve for an example.

[2]	Mostly hateful, but I can see a use for it.

[3]	Not crap at all, but I'm ranting here.

Generated at 12:00 on 03 Apr 2006 by mariachi 0.52