Start a new topic

Finetuning match values

 Some days ago I made the following discovery:


image


Left side: my TM

Right side: Trados TM (customer)

Both with the same priority.


Shouldn't the hit from my TM have a slightly or much higher value? From a generalistic point of view, 58 % are okay, as one word (the number) is different, but this is irritating. Most other CAT tools would offer a much higher match value.



One more:

image

Only 58 %? Why? There is only one word different, with most letters the same.
I know that it would be difficult to simply ignore accents on the source side (it can make a huge difference) or to compare letter by letter, but this is a bit strange. Or is there any chance to finetune this?

 

For Latin-based languages, CafeTran compares whole words in a segment (not single characters within words, which is much slower). This is the compromise between speed, efficiency, accuracy, and, what is the most important of all, the total speed of dynamic matching features such finding hits/fragments and their auto-assembling. Switching to character-based calculation would be at the cost of the total responsiveness of the matching results.  

I understand (and I suppose that ignoring accents costs too much RAM and time, as many characters would be concerned), but it is especially the first question that intrigues me: Why does in the first example °Chimney" have the same fuzzy match rate for "1200 Chimney" as "1600 Chimney".? Is the penalty for a missing word the same as for another word – a number – even with the same number of decimals (this would be the nearest explaination)?

On the other hand side, this was a propagated segment. Should these propageted segments with different numbers not have a much higher fuzzy match rate? Very often I have a look at this filter when starting a project – giving me a good first insight at homogeneity. But then, while some of these propgated segments work like pure magic, while others do not even show up the fuzzy match (my threshold is 50 %), in most cases rather short segements. As this filter groups the segments, clicking on the content of the last checked segment in the grid does the job for me, bur shouldn't this be easier?

Is the penalty for a missing word the same as for another word – a number – even with the same number of decimals (this would be the nearest explaination)?


Yes, the penalty is the same, as the different number is treated like a different word.


Should these propageted segments with different numbers not have a much higher fuzzy match rate?


You may be right and that's something to consider to half the penalty for different numbers. My only worry is to keep it still fast. Each word in a segment would need to be checked if it a number. For short segments that would be okay but imagine the never-ending sentences in legal translations. 

Sure, I see and I understand.

Assuming that the filter "Repeated and propagated segments" is prepared "somewhere" in the background and not spontaneous (then it would be amazingly fast), couldn't this comparison be limited to that filter?

Or e.g. only segments with up to 5 words. Because from, let's say 6 or more words the difference is not that big (theoretically about 16 %). Would that be feasible?

 

See the following screenshots with longer and short sentences:

image


image


image


image


image


The example with the screw can show that in case of a letter number combination (mostly product names or here screw sizes) the higher match rate might perhaps not be welcome (the Volvo segment is for our Dutch superuser).


The test I did was with a simple Word document, so there were no tags inside the text.


Before coming up with the tuned-up solution for numbers, why don't simply lower the percentage of the fuzzy segments display for such short technical segments (e.g to 33%)?

Good point. To be honest, all of the simple examples in my last positing were automatically propagated, so I should be glad and not worry about the fuzzy match rate.

The next time I come up with a propagated segment that does not propagate, I will test it and come back to this (or open a new thread).

 

were automatically propagated.


Yes, the propagation is not related to the fuzzy percentage accuracy in any way.  

Anyway, you need a fuzzy match insert threshold of 50 % for that. This should be noted in the Help, as some colleagues might set this value higher (I do not know what the default value is).

Indeed, even an advanced mode for people with fast computers would not make sense.
Just another example:

image

Even with a fuzzy rate of 33 % this propagated segment does not show up. Yes, I understand that propagation and fuzziness do not relate, and the propagete feature works like a charm in many cases (in some others not, e.g. when the order of numbershas been changed).

The pro is: it is helpful to see this segment here as it has the same structrure as above. The con is  that it is frustrating especially in such a case where some more rework has to be done (and other tools are able to auto-propagate this).

 

How about ignoring all the numbers from 0 to 9?
I guess you can do it by entering them in Edit > Preferences > Memory > Do not match. But I don't know how this setting will affect auto-propagation of numbers (though I estimate that auto-propagation bypasses (does not use) TMs).

 

By defining more complex numbers as non-translatables, CT should be able to propagate such segments when the project is reloaded. A non-translatable regular expression:


|\d+[,.]\d+


catches the numbers in your example.

 

Login to post a comment