Just because Total Recall is a current hot issue to me, I want to write this as a memo, which may benefit someone else.
Basically, the Japanese language uses three different types of characters in writing:
are generally used for foreign loanwords.
Only for the purpose of extracting "relevant" segments from a TR database, it might be effective to (though there are many exceptions):
Then, a Japanese sentence like:
would be treated as:
アップル 新製品 発表 予定
Such a skeleton (which may look like an aggregation of variables from a segment pattern) could be a good key in getting more relevant results (I'm not 100% sure, though; some testing would be needed).
Of course, the created TM should be set to "fuzzy without word separator" when applied.
A further memo.
Kuromoji looks great. And it's written in Java. Thanks! I'll do some tests.