Start a new topic

REQ: Better TB settings

After my last rant some weeks ago I was notified that I had a wrong (let's say legacy) setting.


Now I wanted CT to recognize a word that contains some issues: Maître d'œuvre. I need to introduce whole bunch of entries to reflecting possible (but common spelling error):


Maître d'œuvre (Standard)

Maître d’œuvre ("wrong" apostrophe, very common in FR)

Maitre d'œuvre (missing circumflex)

Maître d'oeuvre (the œ)


In French there are dozens of quite typical errors.


So wouldn't it be nice to have CT ignore the difference between e.g. different apostrophes, vocals with and without accents or any other similar problem?


The only 2 workarounds are:

- making big TB entries (in my example above you need to multiply the mistakes)

- fiddling around with RegEx (after making your personal RegEx master).


Any other idea?




Did you test a tmx for terms to see if it catches the variants?

I assumed the TMX format for glossaries was no longer supported. 


And even if: How do I proceed?


Sorry for my ignorance, it is Friday ...

The current solution to catch the numerous variants of the phrase is via the regular expression in you glossary. For example:


|Ma.tre d.(œ|oe)uvre


The pipe at the start says it is a regex. in the glossary


The dot means any character possible.


(œ|oe) = œ or oe


The above reg. ex. might be optimized yet but it should work fine.  

This is great, thanks. I assumed that it would be much more complicated.

The optimized solution which lists all the possibilities of the apostrophes and accents would be much more complicated. I took a shortcut using a dot (meaning all possible characters). 

Personally, I'd prefer legible source terms at all times. So I'd go for adding all possible (well, most of them) variants at the source side:


Maître d'œuvre;Maître d’œuvre;Maitre d'œuvre;Maître d'oeuvre TAB Project Manager;Projektleiter


And then I'd create a script or macro to generate the likely variants of the source term and past them at the end. This may sound complicated, but it doesn't have to be.


With the macro, I'd:


  • Select the source term
  • Select the already typed target term
  • Press the CTRL key and click the Add new term icon to open the New term dialogue box.
  • With the cursor positioned in the source term field, I'd press on the keyboard shortcut assigned to the fuzzicate macro, to generate the likely variants.
Macro:
  • Copy the source term to the clipboard.
  • Run some Find and Replace actions (to replace ' with ’, î with i, é and è and ê with e, etc.) to create source term', source term'', source term''' etc.
  • Add a semicolon after the source term and past, source term';source term'';source term''.
Besides the legibility and exactness this would have the huge advantage that you keep the correctly spelled source term at the first position, so you can always:
  • Delete the wrong source term', source term'', source term''' etc. to create a dictionary-style glossary with only correctly spelled entries.
  • Swap the glossary/use it in the opposite direction.
BTW: Did you already check what the Hunspell for French does with the source terms, when you type them in a DE > FR project? Does it recognise source term', source term'', source term''' etc.?

If so, come back with your findings and we'll talk further.

I'm happy to help you with a macro.

Additional advantage: Once you've created and optimised the macro (you'll be adding extra F/R actions in time), you can fully concentrate on translating and don't have to get distracted by the need to create complicated regular expressions during your work.

The TB is shown as it appears in the text, no matter what RegEx has been entered. But there are three good arguments for Hans:

  • Swap the glossary/use it in the opposite direction.
  • don't have to get distracted by the need to create complicated regular expressions during your work
  • (this is mine) better exchangeability with colleagues shoes TB does not accept RegEx

This is what the French Hunspell (the LO extension, I assume they correspond) gives out:


blob1478271141595.png


It accepts maitre (there is no such spelling in FR) and the "wrong" apostrophe, but not oeuvre.

>It accepts maitre (there is no such spelling in FR) and the "wrong" apostrophe, but not oeuvre.


This is what I was expecting. And probably Hunspell can be taught to accept oeuvre for œuvre too (by modifying the human readable Hunspell files–not as complicated as one thinks).


Okay, this is why I asked:


You could ask Igor to have Hunspell run in the source pane too, in the French variant. I'm not sure whether Hunspell can run two instances at the same time. Else, you could perhaps accept to run it for the source language during the translation phase and to run it for the target language during the reviewing phase.


Once this is operational, the next step would be to allow stemming via Hunspell for the source language too.


Et voila, Bob es ton oncle.

I've been talking nonsense here: this stemming is already present in the source box :). Must have been a bloedpropje.

Login to post a comment