Start a new topic

More apostrophes (more French fries, indeed)

In another (yet open) thread I mentioned the annoyance with apostrophes that avoid term recognition. Indeed there is another consequence of this apostroph thing as can be seen on the next screenshot.


image


As can you seen (a bit guessed) from the screenshot, I can filter for "c'est" in the project, but without result, as CafeTran (Java) puts a straight apostroph and the text (AFAIR from Word) puts a curly apostroph. The approach "mark and search" should not be the only way (and it is not always viable).


There should be a kind of tolerance for this. I would not ask this for quotation marks, this is a different thing and much more seldom.


I do not dare to say "other tools can do this", but indeed it is the case.


CafeTran (Java) puts a straight apostroph


That's not true. Neither CafeTran nor Java change any characters. Even in your screenshot, the apostrophe is not the straight one.  I really don't know what you wish to achieve. If you have that apostrophe word or phrase in your glossary it will be recognized as it is there.

You misunderstood the issue. The "c'est" in the quick search bar is entered by hand (as I wrote above, marking a text and searching for it is not always viable) and it has a straight apostrophe, while the segment has a curly one.

it is not always viable


This is always viable.  Or if you prefer typing it by hand, just type it as you can see it in your segments. I can't see any issue whatsoever here.

Let's see it from another perspective:
Unicode "knows" three apostrophes:
  • U+0027 ' : the Typewriter apostrophe, used e.g. by Java (when entering the text by hand) and by most text editors
  • U+2019 ’ : right single quotation mark, mostly used by Word processors
  • U+02BC ʼ : the so-called modifier letter apostrophe

Should this in the end mean that a user must enter a search term three times to get all variants of a perfectly typed term (I would agree that the 3rd variant might be rather seldom)?


The user can enter it (or just select it in the source segment) as it is exactly in your Word document. What's the point of choosing the other versions if they are not there in your project, that is, searching for something non-existent? 

> What's the point of choosing the other versions if they are not there in your project, that is, searching for something non-existent?

The point is that I do not know if it is non-existent. Users may copy text from other docs, from the internet, from text files and so on. Source texts are not perfect (and it is not my task to make them perfect, though I point out mistakes).

And in the end it would also mean – but only if you think here that entering a word or a couple of words with apostroph is the only viable way to recognize it – that to enter one simple term you need to enter it e.g.
  • 1 x with "l" and U+0027, U+2019 and U+02BC
  • 1 x with "d" and U+0027, U+2019 and U+02BC
  • 1 x with "qu" and U+0027, U+2019 and U+02BC (in some cases)
This would mean 9 terms on the source side, without counting variants.

Indeed this is a case that does not occur in some languages, but in some important languages it does.

You may call it „fuzzieness". I will make some tests during the holidays to get these apostrophed terms catched, as Prefix matching actually does not help.

You might extend this topic to some letters and quotation marks (e.g. allow a fuzzy search similar to what Google does), just imagine someone looking for "Łódź" in a document and only enters "Lodz". But this might go too far.

 

I guess the type of fuzziness algorithm what you have in mind would produce hundreds of similar results (just like Google does) for each word in a segment during the automatic search, which in turn is rather counterproductive. I think that translators rather value a shorter list of the results with the limited fuzziness than scrolling and checking for each possible variant of a word, as you propose. Try using Google search and you will see that it will provide the results not only for the word you are looking for but also tens of similarities. For the general search purpose, Google like search is naturally great. I have my doubts about the compromise in speed and the quality of being concise in the translation context. Moreover, translators' resources are more or less uniform in style with several exceptions.

> I guess the type of fuzziness algorithm what you have in mind would produce hundreds of similar results

Not necessarily, e.g. find "Lodz" and "Łódź" in one go, though I see the problems, even when implementing it only as option:
  • accents can be important
  • false spelling could be harder to find

But in the end, how about the apostrophes then?

Just BTW, the list of the apostrophe prefixes above can be extended, even with only one word:
  • d'acheter
  • l'acheter
  • m'acheter
  • t'acheter
  • qu'acheter
  • s'acheter
That's difficult French, indeed.

Just BTW, the list of the apostrophe prefixes above can be extended, even with only one word:


In the other thread,  you said that "Look up word stems" option working with the Hunspell French spellchecker dictionary solved the above for you. 

This was only a temporary success (see here, just above the issue with the Frequent words feature – apostrophes again, indeed). Preferences are the basis settings, also in Do not match, FR Hunspell is installed, only "Look up word stem" is checked.

I just opened it again. No, it does not work (assuming that Amérique is in the Hunspell, and yes, really, it had worked before). And "prefix matching" does not help, with all necessary settings. Even worse: with prefix matching, the term "Amérique du Sud" in "en Amérique du Sud" is not recognized, but when deactivated, it is recognized there (without apostrophe)

And the "Amérique" thing above was only for the first word of the term, not for the term itself.

Okay, I will continue in the other thread.

Login to post a comment