Start a new topic

Term recognition (reloaded)

Simple detail question:

How can I make CT recognize this term?

image


Please note that there are different occurrences

  • l’Amérique du Sud
  • l'Amérique du Sud
  • l`Amérique du Sud

Note the difference in the apostrophes (the 3rd case is quite seldom). It depends on programs and some other aspects, which apostrophe is being used.


If I understand correctly, I have the following options

  • Prefix matching: this means to get many false matches, depending on the glossary. This would be as a kind of bad case okay, but if I see it correctly, this does not work for multi word matches, such as "Amérique du Sud".
  • Pipe character at the start of the word. Really? For any French word starting with a vocal?
  • Enter the term with the "l'" (and the two or three apostrophe flavors). Not seriously?


Perhaps I oversaw something?


The handling of terms with apostrophes concerns users translating from Fench, Italian, Catalan and many more (that I might ignore here). I do not think this is an exotic problem.



 this argument is rather ugly


Yes, it is but I don't really care as such arguments are the least convincing and the first to ignore completely.


Anyway, I think an option in Preferences to ignore (not match) user-defined morphemes might be added in the near future to cover all similar cases. Instead of hard-coding them, the user of a specific language will be able to define those few prefixes. 


2 people like this

Yeah, I have been getting that too: entries in my glossary are not being highlighted if they are touching a comma, apostrophe, and a number of other characters, which is a pain in the ass.


Michael


1 person likes this

… and same for Europe => Europa when after an apostrophe (just in case Amérique is claimed to be too exotic or not included in Hunspell).


1 person likes this

A  few tips to match phrases with apostrophes:


1. Perform the manual search. You should find what you are looking for.

2. Keep apostrophes in your multiword glossary phrases (e.g. l'Amérique du Sud - Südamerika).

3, Load your glossary via the Memory interface (Memory > Open memory). It provides full fuzzy matching for longer phrases.


Curly apostrophes in the source segments can be easily replaced by the straight ones by Edit > Find > Replace all if they really hinder automatic matching for you. The apostrophe thing can also be solved by applying more fuzzy algorithms. However, users will start complaining about scrolling too much to locate their match in tens of similar results. The current algorithm is tuned to find the proper balance between the number of the results and limited fuzziness.   


1 person likes this

Tre: This would mean to get a bunch of segments where terms are recognized or perhaps not. Let's take a segment of 50 words with 6 terms (and one with an apostroph). I will see the six terms, but not the seventh.


In my test, both "l'Amérique du Sud" and "l’Amérique du Sud" were recognized in this QA check, so I think it should work for all exact term occurrences.


A specific language User's dictionary can be edited in Edit > Edit user's spelling dictionary, if you load a project that uses that target language. You can install the French hunspell dictiionary even if you don't translate to that language.

> In my test, both "l'Amérique du Sud" and "l’Amérique du Sud" were recognized in this QA check, so I think it should work for all exact term occurrences.

Sure, but I targeted to a practical case where many terms (without apostrophes), many hits and so on are given. And even after recognition I need to check the glossary. By hand, as the hits are not displayed. Would this still be recommandable?

> A specific language User's dictionary can be edited in Edit > Edit user's spelling dictionary, if you load a project that uses that target language. You can install the French hunspell dictiionary even if you don't translate to that language.

That's all clear. I am already not convinced of the approach that only single words out of Hunspell are being recognized after an apostrophe in CT, so now create an extra project (FR Hunspell is installed anyway) to include terms or term variants that are not in Hunspell? Seriously? Only to recognize a term behind an apostrophe? This might be okay for us who love fiddling around with files and playing with text editors, but for most users not. And up to now this all is not even documented (perhaps the process of documentation might reveal how cumbersome this is).


... while even the simplistic tool OmegaT offers this feature (see above, sorry, this argument is rather ugly).

 

Okay, these are my next results:

a) I added „Amérique“ as a one-word glossary entry. Works indeed like a charm, "Amérique" is recognised. Obviously the "Look up word stems" glossary feature (provided that you have the Hunspell) only works with one word terms (I assume Hunspell has in case of French only one-word entries)
b) I added „U+0041“ (this is the concerned apostrophe) to Additional space characters, with a comma, as prescribed (the only other entry is the locked space). But now neither Amérique nor Amérique du Sud are recognised (NB: I did not restart CT, but only went up and down to have this result). Can we call this Pandora's box?

And now?
No chance to recognise two- or more-word glossary entries with apostrophes?

 

I forgot the last point:

c) After inserting U+0041 (the concerned apostrophe) to Additional space characters, CT does not even recognise "en Amérique du Sud". After deleting it again from this field, it is of course recognised. Sense?

 

Why do you do b) if a) works "like a charm" for you. I wonder what you wish to achieve by making CT treat your apostrophe as a space.


No chance to recognise two- or more-word glossary entries with apostrophes?


It's not clear to me either. If you have two or three word phrase in the glossary, then you should expect exact matching for such multiword glossary phrases. If you expect fuzzy matching, use translation memories instead.  

Next round perhaps?
I inserted U+002D to play around with the small hyphen.
This is what it recognised:

image

This had not been recognized before, I would not ecen expect it (this tag is bad luck, IMHO).

At least in this case "France-Amérique du Sud" (without tag) is being recognised, this was my intention.

 

@Igor: I was doing b) because a) only recognizes one-word terms, but not three-word-terms.

 

> It's not clear to me either. If you have two or three word phrase in the glossary, then you should expect exact matching for such multiword glossary phrases

 

But this is exactly the point. They are not recognized when being behind an apostrophe (as said above, the flavor of apostrophes might differ, depending on several factors)

 

If you wish any character to be skipped during matching add it to the "Do not match" list in Preferences. That's a fast solution to remove any characters that might interfere with the matching of words or fragments.  

Hmm, I checked, and the comma is in that list. I will try to see if it happens today and send in some screenshots. Here is my current list:


,.。:;!¡?¿[]"«»‘’“”„'’()


and written differently:

,

.

:

;

!

¡

?

¿

[

]

"

«

»

'

(

)

This might help in case of commas, but with the apostrophe it does not help. Even not after a restart (for the records: in the cases above this restart finally has also been made).

 

So any terms with the comma should be recognized and displayed in the Matchboard. They should also be highlighted in the source segment. There is only one current limitation to the highlighting. CafeTran does not highlight the phrase in the source segment if such a character is in the middle of the matched phrase. The Matchboard shows such a phrase just fine.   

Login to post a comment