Start a new topic

Term recognition (reloaded)

Simple detail question:

How can I make CT recognize this term?

image


Please note that there are different occurrences

  • l’Amérique du Sud
  • l'Amérique du Sud
  • l`Amérique du Sud

Note the difference in the apostrophes (the 3rd case is quite seldom). It depends on programs and some other aspects, which apostrophe is being used.


If I understand correctly, I have the following options

  • Prefix matching: this means to get many false matches, depending on the glossary. This would be as a kind of bad case okay, but if I see it correctly, this does not work for multi word matches, such as "Amérique du Sud".
  • Pipe character at the start of the word. Really? For any French word starting with a vocal?
  • Enter the term with the "l'" (and the two or three apostrophe flavors). Not seriously?


Perhaps I oversaw something?


The handling of terms with apostrophes concerns users translating from Fench, Italian, Catalan and many more (that I might ignore here). I do not think this is an exotic problem.



… and BTW, Amérique => Amerika isn't recognised here either after an apostrophe.

 

… and same for Europe => Europa when after an apostrophe (just in case Amérique is claimed to be too exotic or not included in Hunspell).


1 person likes this

So what was recognized in your 'temporary success" above?

Only Amérique as one-word term (with having a term entry of its own, of course). That was hereNun (posting starting with "Okay, these are my next results:"). The setting should have been the same, though I see now that inserting the apostrophe into the DNM field does not change anything. Perhaps this was without restart (can CT then have a kind of interim state?). Porca madonna, as the Italian would say.

Maybe it would help to see if and how Amérique or Europe is recognized somehow after apostrophes.

It is really confusing. Once you claim that "Only Amérique as one-word term" is recognized, a few posts later you say it isn't.

> It is really confusing. Once you claim that "Only Amérique as one-word term" is recognized, a few posts later you say it isn't.

I can only agree (the display was stopped after re-opening on the same machine one day later, same project, same glossary, same TM).

From a pragmatic point of view: The issue is that now it is not recognized (any more), and we both (or at least I) do not know why. Surely no kind of Marian apparition.

Perhaps Jean Dimitriadis (idlm) has an idea (if he works from French to another language).

Between your trials you seemed to have played with various glossary options for matching as well as with the "Do not match" list. Perhaps restoring them to the point where it worked for you might help.  

The „various glossary options for matching“ were after this success not so many.
  • Look up word stems (kept activated after the first success)
  • insert U+002D into "additional space characters" (respectively delete it from there)
  • insert apostroph (this one: ’) into DNM (respectively delete it from there)
  • Prefix matching can be excluded here, as it gave many unwanted results

I will test this thoroughly in the next days.

Some more information:
When having straight apostrophes (depending on file type, often in text files, but rather seldom in Word docs), it works at least for one word terms. Same setting as above, with "Look up word stems", it works (without indeed not).
Summing it up after investing some time:
Setting:
  • as recommended, Look up words stems and FR Hunspell installed
  • in the Glossary: Amérique;Amerique - Amerika, Amérique du Sud - Südamerika, Amerique Latine - Lateinamerika; autres - andere
Straight apostrophes:
  • CT recognizes any one word terms behind an apostrophe as long as they are in Hunspell
  • CT does not recognize one word terms behind an apostrophe that are not in Hunspell (here: Amerique – without accent – as variant in the glossary)
  • CT does not recognize any two or more word terms behind an apostrophe

image


Curly apostrophes:
  • CT does not recognize neither one word nor two or more word terms behind an apostrophe

image

Notes:
  • by accident, the first apostroph in "C'est" of the 2nd screenshot is not a curly one)
  • my TM town account was still open, this might explain the further marks in the source text.
  • Amerique is not correct in French, but IMHO it is good practice to include it in the glossary to catch source text mistakes and – more often – the term in capital letters, where accents are not always used


If I understand correctly, this is the actual and optimal state of the glossary function in CafeTran. What means that CT gets unusable to check files against client's glossaries, at least with FR or any other language with many apostrophes in use (and multi word terms are quite common in French texts).

This is my test file.

 

txt
(241 Bytes)

A  few tips to match phrases with apostrophes:


1. Perform the manual search. You should find what you are looking for.

2. Keep apostrophes in your multiword glossary phrases (e.g. l'Amérique du Sud - Südamerika).

3, Load your glossary via the Memory interface (Memory > Open memory). It provides full fuzzy matching for longer phrases.


Curly apostrophes in the source segments can be easily replaced by the straight ones by Edit > Find > Replace all if they really hinder automatic matching for you. The apostrophe thing can also be solved by applying more fuzzy algorithms. However, users will start complaining about scrolling too much to locate their match in tens of similar results. The current algorithm is tuned to find the proper balance between the number of the results and limited fuzziness.   


1 person likes this

I only occasionally translate from French into Greek, as I generally work on the EN/EL>FR language pairs.


I could not make extensive tests, but in my quick run:


I confirm the behavior with multiple and single word glossary/fragment entries when there is a curly apostrophe:

HTML

 

If the straight apostrophe ' is used, single word glossary/fragment entries are recognized and highlighted.


Depending on fuzzy matching settings, this could indeed be catched by the TM for longer phrases.


With Preferences > Workflow > "Automatic selection of whole words" option enabled, I have noticed that a word is selected along with the text characters before the straight apostrophe: For example, "l'Amérique" is selected (and recognized) as one word. This does not happen with the curly apostrophe or the other more exotic one used in tre's example.


Because the curly apostrophe is very frequent in French, but also in several other languages I think, apart from the very valid suggestions above (especially the one about replacing the source apostrophes), at least for one word matches, it could be useful to make CafeTran recognize "l’Amérique" (with the curly apostrophe) as one word as well. Does this depend on a user-defined setting or happen at the application level?


Jean

Hm, as a mere workaround for not so important jobs, maybe, but on the long run:
  • „Manual search“ with an obligatory client glossary that contains several hundred or thousands of terms is a no go (eg. the case for Daimler or Volkswagen, but I assume many more companies). Sometimes I get a new, reworked version of a glossary every 2 months.
  • Apostrophes are not limited to one letter (see here, "acheter" again) and not to one symbol (but U+0027, U+2019 and U+02BC, see also here). So how many entries are there to be made for one single word?And – see above – for a a new, reworked version of a glossary every 2 months?
  • „Curly apostrophes in the source segments can be easily replaced“. Indeed they can, but they should not in external file formats (Studio, memoQ), as it can and most probably will prevent the re-import.
  • The apostrophe thing can also be solved by applying more fuzzy algorithms“ - but not with the actual release (maybe I did not understand this correctly)?

I do not understand that CT is able to detect many, many terms in different contexts correctly, with and within quotation marks, parenthesis, tags and so on, it even finds as in jack-ass with the corresponding setting. But why not behind apostrophes (that are not something special of a Subsaharan local dialect)? Why would this lead to many false matches?


And then there is still the glitch in the Frequent words feature here.

@idlm

> it could be useful to make CafeTran recognize "l’Amérique" (with the curly apostrophe) as one word as well.
Even then and even with a straight apostrophe, it won't detect "L’AMERIQUE", as AMERIQUE/Amerique is not in the Hunspell (although in your glossary).

> Does this depend on a user-defined setting or happen at the application level?

CafeTran has tons of settings, but actually – if I understand correctly – his kind of recognition fails on application level. Please correct me if I am wrong.


When doing sometimes proofreads with an obligatory client glossary, you could also be concerned.

Login to post a comment