In the past (not in the future!) there has been some mentioning of adding the Lucene tokenizers to CafeTran.
I'm currently comparing*) matching results for both the Lucene tokenizer for German as a source language and CafeTran's current Hunspell-based stemming solution, that is activated via:
The differences are interesting, hence my question: how about Lucene tokenizers?
*) For this, I've extracted a list of nearly one thousand technical terms from my background glossary, all containing the same stem. I've loaded this list as a Word document both in omegaT and in CafeTran and then created a one-word glossary, only containing the word stem. When I cycle from segment to segment, I can see where the word stem is being recognised in the extracted technical terms. Like I said: very interesting.
If you'd like to learn more about the Lucene tokenizers, you could watch this video:
A search on my harddisk revealed that the Lucene tokenizers are also used in Fluency Now and Wordfast Pro 4.
OmegaT uses them too.
I don't know anything about this tokenizer, but if it can improve the accuracy of "matches," I would also be very much willing to request it.
As I mentioned somewhere else before, CT currently recognizes "information" as a variation (hence, a match) of "form," for example. That is, when the source segment contains "information," the translations of "form" are displayed as matches, though this is obviously inaccurate according to the translator's measure. I suspect this may be due to the structure (rules) of the Hunspell file.
If this issue can be solved simply by switching over to this tokenizer, it certainly would benefit all users.
Masato: ...but if it can improve the accuracy of "matches,"
It may, but:
The "o" is big. Very big. Like an omega, not an omikron.
Anyway, the quality of fuzzy matches remains a concern to me.
Masato: ... it seems that the accuracy of fuzzy matches can be improved by switching over to Lucene
Why? I think the Lucene tokeniser uses the Hunspell dictionary in OmegaT, whereas I wouldn't be surprised if the Hunspell spelling checker uses the Lucene tokeniser.
I know nothing about Lucene, so I added: provided that it is better in this respect. If it's no better than Hunspell, this topic is of no interest to me now Thank you.
I think that the Lucene tokenizer and the Hunspell tokenizer are two different implementations of the tokenizer concept.
I wouldn't be surprised if L works better than H for language X and that it's the other way around for language omnikron (do not mix this up with omega or you will get a lolly).