Antispam: Improving detection of Japanese Emails

Recently, we received some false positive (good emails marked as SPAM) and false negative messages (spam emails not detected) from our partners in Japan. It seems that our Antispam engine did not cope well with some messages written in Japanese. Fortunately, the problems were minor and easy to fix.

A large part of messages had a Message-Id header rewritten by an intermediary mail server which made the antispam engine think that those messages were forged (pretending to be sent by Microsoft Outlook Express). Theoretically, the Message-Id header should uniquely represent an email message all over the world. In order to make it unique, Outlook Express generates a Message-Id header having a certain pattern. That’s why, when we met the rewritten Message-Id header, which did not look like generated by Outlook Express at all, we thought that the messages were forged, and thus spam.

A small percent of the messages had a subject header that seemed strange to us, because it would use the same encoding many times. It looked like : Japanese(…) ā€“ Japanese(….) ā€“ Japanese(…) instead of simply Japanese(……….). Another method is to double encode the subject, like this: Japanese(… Japanese(….) …). After reviewing lots of legit messages written in foreign languages more closely, we concluded that this was not such an abnormal behavior, even though this pattern is often met in spam.

Another problem was with messages that were sent encoded with base64, without specifying the content type in the header. The content could have even been represented by 7-bit characters, so, it did not need any encoding. Spammers often use this pattern, in order to hide the message from Antispam filters that cannot handle base64 encoding. Instead of simply writing VIAGRA, they encode it in base64, the result being VklBR1JB. Normally, messages in foreign languages need to be encoded in base64, because the contents cannot be represented by ASCII characters, and most foreign language encodings need 8bit data. But the Japanese messages did not need base64, because they use a special encoding, iso-2022-jp. This encoding can handle both normal characters (ASCII) and Japanese characters, through a special symbol that switches modes. Apparently, the sender of the message did not know that, so they encoded the messages in base64 anyway.

An interesting fact with the spam emails written in Japanese is that they tend to be plain text (with the charset=”iso-2022-jp”) and also providing a rich content. These emails contain formatted text in form of paragraphs, bulleted lists and ASCII art, as can be seen in the picture below.

Fig. 1: Japanese spam mails.

Fig. 1: Japanese spam mails.

Extrapolating, based on the spams we received, it seems that more than a half of the spams received by the Japanese are written in their language. The rest is in English.

Vlad Dinulescu
Software Engineer

Sorin Mustaca
Manager International Software Development