Start a new topic

Analysis of Japanese language

Just because Total Recall is a current hot issue to me, I want to write this as a memo, which may benefit someone else.

Basically, the Japanese language uses three different types of characters in writing:


  1. kanji (old Chinese pictographic) characters
  2. hiragana (phonetic) characters
  3. katakana (phonetic) characters (as another form of hiragana)


katakana characters are generally used for foreign loanwords.

Only for the purpose of extracting "relevant" segments from a TR database, it might be effective to (though there are many exceptions):

  1. Treat a hiragana character and a string of them, as word separators (i.e., ignore them).
  2. Treat all Japanese (double-width) punctuation marks and brackets as word separators.
  3. Treat a kanji character and a string of them, and a katakana character and a string of them, as distinct words. Or everything that is neither a hiragana nor a Japanese punctuation mark/bracket, and a string that does not contain any of them.

 

Then, a Japanese sentence like:


アップルは、まもなく「新製品」を発表する予定だ。


would be treated as:


アップル       新製品  発表  予定


Such a skeleton (which may look like an aggregation of variables from a segment pattern) could be a good key in getting more relevant results (I'm not 100% sure, though; some testing would be needed).


Of course, the created TM should be set to "fuzzy without word separator" when applied.



A further memo.


  • A string of two kanji characters (appearing between non-kanji characters) can almost always be treated as a single word.
  • A string of three kanji characters (ditto) can almost always be treated as a single word.
  • A string of four kanji characters (ditto) can almost always be treated as a combination of two 2-character words.
  • A string of five kanji characters (ditto) can almost always be treated as a combination of one 2-character word plus one 3-character word, or the reverse.

What a language this is!

A much simpler approach:

Insert a white space between different types of characters.

 

@M: Insert a white space between different types of characters.

Hi M, maybe you could do some tests with kuromoji and see if you get better matches in CT?
http://www.atilika.org/

Hi, Alain


Kuromoji looks great. And it's written in Java. Thanks! I'll do some tests.

Hi, Alain

I just started a Tips and Tricks topic on some test report (not using Kuromoji).


Where ? :-)

 

Aaargh... found it in Tips and Tricks. sorry.

 

Login to post a comment