Start a new topic

Lucene tokenizers

In the past (not in the future!) there has been some mentioning of adding the Lucene tokenizers to CafeTran.

I'm currently comparing*) matching results for both the Lucene tokenizer for German as a source language and CafeTran's current Hunspell-based stemming solution, that is activated via:

The differences are interesting, hence my question: how about Lucene tokenizers?

*) For this, I've extracted a list of nearly one thousand technical terms from my background glossary, all containing the same stem. I've loaded this list as a Word document both in omegaT and in CafeTran and then created a one-word glossary, only containing the word stem. When I cycle from segment to segment, I can see where the word stem is being recognised in the extracted technical terms. Like I said: very interesting.

If you'd like to learn more about the Lucene tokenizers, you could watch this video:

A search on my harddisk revealed that the Lucene tokenizers are also used in Fluency Now and Wordfast Pro 4.

OmegaT uses them too.


I don't know anything about this tokenizer, but if it can improve the accuracy of "matches," I would also be very much willing to request it.

As I mentioned somewhere else before, CT currently recognizes "information" as a variation (hence, a match) of "form," for example. That is, when the source segment contains "information," the translations of "form" are displayed as matches, though this is obviously inaccurate according to the translator's measure. I suspect this may be due to the structure (rules) of the Hunspell file.

If this issue can be solved simply by switching over to this tokenizer, it certainly would benefit all users.

I read that work has be done to improve the Luce tokenizer for Japanese as a source. Why not run a quick test in oT too. Shouldn't cost you more than half an hour. After that, you'll know if the tokenizer is worthwhile for your languages. Step two could indeed be to request implementation.
JapaneseAnalyzer (Lucene + Kuromoji [Japanese morphological analyzer]) is interesting.



Masato: ...but if it can improve the accuracy of "matches,"

It may, but:

  • The tokeniser may interfere with Auto-Assembling*
  • It will slow down the Automatic Workflow considerably*
  • There doesn't seem to be a solution that works for languages that use no spaces between words
  • The TM for Fragments already boasts some kind of stemming/recognition of parts of words
[* according to Our Beloved Leader]

We've been here before. I think it was me who put it up. In fact, it was me. Igor rejected the idea, but promised to look into it. He undoubtedly did, with no results up to date. I'm fine with that.

H. oT

The "o" is big. Very big. Like an omega, not an omikron.


1 person likes this
>> if it can improve the accuracy of "matches"

I should have said "the accuracy/quality of fuzzy matches." According to the developer's note at the time of release of this fuzzy matching feature, when there is an exact term match, CT uses it without searching for fuzzy matches, and in the absence of exact matches, it performs the stem look-up.

So, it's not that the quality of the AA result may be compromised as a result of fuzzy matches being adopted instead of exact matches. The program is very skilfully designed in this respect, I believe.


Anyway, the quality of fuzzy matches remains a concern to me.

I don't know how the Hunspell file is (internally) used by CT, but if CT relies on it for the definition of "fuzziness" or coverage of inflections, it seems that the accuracy of fuzzy matches can be improved by switching over to Lucene provided that it is better in this respect.


Masato: ... it seems that the accuracy of fuzzy matches can be improved by switching over to Lucene

Why? I think the Lucene tokeniser uses the Hunspell dictionary in OmegaT, whereas I wouldn't be surprised if the Hunspell spelling checker uses the Lucene tokeniser.


I know nothing about Lucene, so I added: provided that it is better in this respect. If it's no better than Hunspell, this topic is of no interest to me now Thank you.

I think that the Lucene tokenizer and the Hunspell tokenizer are two different implementations of the tokenizer concept.

I wouldn't be surprised if L works better than H for language X and that it's the other way around for language omnikron (do not mix this up with omega or you will get a lolly).

More on stemmers here.


Login to post a comment