Start a new topic

Q & R: Using Huspell file for fuzzy term recognition possible?

 I think CT is a very nice tool, the feature-rich CAT tool that's fun to use, no doubt.


But, as I don't know about regexes, I'm still wondering how to structure my glossaries: whether or not to add every possible inflection as an alternative entry. Obviously, doing it would take a lifetime.


I know that a fragments memory can be used to get fuzzy term matches. But, for some reasons, I sometimes prefer text-based glossaries.


So, my question is: is it possible to implement a new feature using the Hunspell file in one way or another to enable fuzzy term recognition (based on prefix matches)?


I don't know about its structure, but it must contain all frequently used words, singulars, plurals, and more.


Fuzzy term recognition, if possible without labor on the user's side, would reduce the user's burden very much.


Cheers,



Masato: is it possible to implement a new feature using the Hunspell file in one way or another to enable fuzzy term recognition (based on prefix matches)?


What you are asking for, is called tokenisation. OmegaT uses it (as the only CAT tool?). CT can't use it, as it would conflict with AA, says Igor. Given the choice, I'd go for AA anytime. Tokenisation usually works on word level only. However, it may be possible to use the *.aff file part of the Hunspell dictionary to arrive at all declinations/conjugations of a word, and add them to your resource. Too lazy to even try, I'm afraid.


H.

>> CT can't use it, as it would conflict with AA, says Igor.

It makes sense. Thank you.

 

Masato: It makes sense


But so does tokenisation. I never understood why I had to enter both "house" and "houses" in my resources.


You could try "prefix matching" or "stemming," I did try long ago, didn't like it a bit.


H.

Hey Masato,


I'm not sure whether this will be of any use for your language combination, but nevertheless, here's some info.


Feel free to teach me some Japanese, it would be interesting to see how you handle this.

Thank you for your advice (two of you).

It seems I can't find the best solution for me right away.

Cheers,

 

Hi Masato,


I will explore the possibility of adding a tokenizer to tab-delimited glossaries optionally that would enable fuzzy matching for them.


Igor

>Q & R: Using Huspell file for fuzzy term recognition possible?


I think Hunspell would have caught that one :).

>I will explore the possibility of adding a tokenizer to tab-delimited glossaries optionally that would enable fuzzy matching for them.


It could be very useful to have some optional tokenisation available.


Software strings in automation texts are often using a limited character set to avoid 'special' characters like Umlauts. Currently I have to enter all terms twice (once for normal purposes, once for software translations). I just pasted a screenshot in this forum and noticed some fine examples of this double entry necessity. You can deal with it via source-side alternatives, if you want:


From:




These are:


Behälterlücke;Behaelterluecke

unterfüllte;unterfuellte

Behälter;Behaelter

Heißwasserspülung;Heisswasserspuelung

Füllhöhenkontrolle;Fuellhoehenkontrolle

Füllfehler;Fuellfehler


But a tokeniser that can be activated at wish, would be the superior solution here.

Moi: However, it may be possible to use the *.aff file part of the Hunspell dictionary to arrive at all declinations/conjugations of a word, and add them to your resource.


My idea of using the Hunspell dictionary is almost the opposite of what Igor implemented in the latest build:

  • Filter your TM for fragments on single words
  • [Somehow] Use the *.aff file to arrive at all declinations/conjugations of those single words
  • Add them to the TM for fragments
  • Set the TM for fragments (now single words) to newer dups only
If you have EN house = NL huis in your original TM, this would add houses = huis in the new one.   Next time you come across houses, you add the term with the correct Dutch plural huizen to the TM. Although it may be possible to use the *.aff file of the TL in combination with the one for the SL to arrive at the correct plural, it'll probably cost too much aspirin.

H.

Possibly related small problem: Compounds. There must be a way to split compounds in their meaningful parts, and again, the Hunspell *.aff file can be useful, especially if there's a list of words that can form a compound, like there is for German. Like my concept above, I think it doesn't need to be integrated in CT, but more than the above, it would help.


H.

Ipse: ... and again, the Hunspell *.aff file can be useful


I added Konformität and Bewertung to the TM for fragments.


As you can see, CT recognises Konformität, but not Bewertung. I have no idea if it's feasible, but I can imagine CT using Hunspell to recognise the "s" of Konformitätsbewertung as belonging to Konformität, not followed by a word boundary, so CT can continue searching for Bewertung.


Too simplistic? Probably, there a lots of things that can go wrong. On the other hand, the same goes for "virtual matches" - I had another crazy example yesterday - but most of the time, they are useful.


H.

More on compound splitting algorithms:

http://www.aclweb.org/anthology/R11-1058


H.

Login to post a comment