Start a new topic

Tags and fuzziness

Sometimes docs contain a table of contents and then again the same segment as header (in some cases with a number at the end). If in one of these cases there are no spaces at the tag (e. g. a tab), CT comes into trouble with MT (okay, this is less important) and with fuzziness.


Only 56 %? This is not serious, is it? The only difference is a space and the number at the end and that the TOC did not have a space between 5 and R. Couldn't this tolerance be adjusted (or even customized. e.g. concerning accents that can make a difference, but not always)?

Other tools show up a much better match rate here (= quicker perception and processing by the translator), and I think CT should do this, too.

In some cases the fuzzy match rate even falls under the rate where it is displayed, e. g. if there was "a" instead of "à" or the wrong apostrophe (the apostrophe is wrong here, BTW, another pledge in CT that this is not tolerated by CT when recognizing terms).

Hi Torsten,

The current fuzzy matching implementation for segments calculates the difference at the word level (not at the character level as you propose). Therefore, in the above case these are two different words actually. There are two reasons for it. First, it is processing speed - it is just much faster to analyze/compare words than individual characters.. Second, it is your actual work you have to do to correct the segment, which in many cases involves substituting the wrong word with the correct one. However, I see you point that sometimes only one or two characters may need to be changed as in the above-example. For CafeTran, it is still one word difference.


But there is a tag between 2 words (here between „5“ and „R“). Does this mean that „9.5<tag>Introduction“ and „9.5 Introduction“ have 0 % fuzziness?

The helping point here would be to understand two strings separated by a tag as two words instead of one (for fuzziness and for MT, as in other tools it is the case).

In my humble opinion the case of e.g. „9.5<tag>Introduction“ (abstract term: tags such as tabs without spaces behind or after tags) is much more frequent than „terr<tag>able tagged doc<tag>ument“, where these tags should be ignored (but I assume they won't).

In this particular case, it involves the correction of the filter for this document type to indicate that the tag separates two words indeed (as opposed to "doc<tag>ument" type of tags or unknown tags). You might submit a ticket and attach a source document to look into it.


I do not think that a ticket is necessary here.

I would be glad to have the filter (not the SRX rules) changed so that

- the sequence of <number><tag><letter or at least 2-word-string> are seen as two entities

- any sequence with a tag inside words makes two words, if the tag contains a kind of non-breaking space, a tab or a hyphen (obviously CT does not recognize them all „as is“).

Login to post a comment