2006年7月28日

[SpamAssassin-JP 320] Re: [Mirai] Japanese Spamassassin Improvements

** SpamAssassin メーリングリスト **
** 注意:このメールへの返信は SpamAssassin-jp へ行きます **
Dear Mr. Warren,

CCing to spamassassin-jp mailing list.

I heard from Mr. Nokubi that you are interested in enhancing
SpamAssassin's rule sets and also interested in improvements of Japanese

handling.

I wrote a first draft patch fo SA that can handles Japanese e-mail well.
Technically speaking this patch is based on Myer's "charset
normalization patch" (SA Bug 4636). This patch could convert various
charset into UTF-8 but it does not support Bayes engine (because Myers
does not use Bayes technique). I slightly modified his patch and added
tokenization with Kakasi, widely used Japanese tokenization program
besed on dictionary. Also I changed Bayes tokenization routine so that
it can handle tokenized UTF-8 text correctly.

My first patch improved (1) productivity of writing "body" rules, and
(2) Bayes calculation for non-spam (ham). Without the patch, Bayes
tends to conclude higher probability for spams and hams, in other words,
Bayes can not discriminate two types well. But with the patch, Bayes
can store "meaningful" tokens and therefore its score becomes "meaningful".

However, this patch has some problems to be resolved. Kakasi is GPLed
but SA is Apache license. No compatibility with existing "body" rules.
Performance loss with UTF-8 processing and tokenization and so on.

Honestly, I am not an expert Perl programmer and as I am a CEO of small
company, I do not have enough time and skill to improve the path more.
So I decided to find another good programmer. This is a major reason
why I proposed "Japan SpamAssassin User's Group."

Mr. Takizawa raised his hands and he is writing and maintaining new
patch now. Some problems mentioned above is discussed at
spamassassin-jp list and almost resolved. A new test category called
"nbody" (abbrebiation of normalized-body) is added, so existing "body"
tests are compatible. He added "normalized_charset" config option.
Setting 0 to this option (or omitting this option) disables charaset
normalization and Japanese specific tokenization at all. He also
changed primary tokenization program from Kakasi to MeCab, LGPLed
software. The newest patch issued yesterday works great with SA 3.1.4.


I and many active members think that handling Japanese spam is not
necessarily be specific to users in Japan. We do not want to maintain
Takizawa's patch independently with original SA code. In the near
future when the patch is well tested, Takizawa and I will submit it to
SA's development community.


As for rules, Mr. Matsuda has been maintaining rulse sets for Japanese
spams for years. Although this patch uses "older" body tests, I think
it is a good start point. I have been maintaining new ruleset
conforming Takiazwa patch (using "nbody" tests) and it is not bad.

My personal plan is to establish maintenance policy first. Secondly
modify Matsuda's and my rule based on this policy and add new tests.
Then we will be able to publish "Japanese rulesets".


It is my pleasure if we could discuss more closely about SA coding issue
and ruleset issue to improve these activities.

If you have any question or comment, please do not hesitate to ask me.

NOKUBI Takatsugu wrote:
> CCed to Motoharu Kubo, a member of SpamAssassin.JP. He has a plan to
> merge their Japanese SA patch to the upstream.
>
> I only informed about the summary of Warren's first mail and URLs, so
> he want to hear more details about Japanese SA imporovements.
>
> Could you describe the detail to him please?

--
----------------------------------------------------------------------
Motoharu Kubo member of SpamAssassin.JP
mkubo@xxxxx CEO of ThirdWare Inc.

--
SpamAssassin メーリングリスト
http://mm.apache.jp/mailman/listinfo/spamassassin-jp

投稿者 xml-rpc : 2006年7月28日 17:02
役に立ちました?:
過去のフィードバック 平均:(0) 総合:(0) 投票回数:(0)
本記事へのTrackback: http://hoop.euqset.org/blog/mt-tb2006.cgi/43028
トラックバック
コメント
コメントする




画像の中に見える文字を入力してください。