Start a new topic

Subsegment to Auto threshold and Subsegment to Virtual threshold

I'm playing with CT setting again. After reading the wiki, I still do not understand the following.


How do they relate to each other, in terms of correct guessing and AA?

If I set Subsegment to Auto threshold to 0, does this mean that CT will never use subsegment in AA?

How does increasing Subsegment to Virtual threshold increase the probability of CT guessing correctly?

More importantly, how can I delete CT's wrong guess of subsegment translation from its memory? There are a lot of them now.

I'm currently using the default setting.


Thank you in advance!

Kwang




Dear Igor,

The hits feature had long been the most difficult part of CT for me to understand.

Now, it's not, and I'm very happy.

Thanks for your clear explanation.

 Cheers

Masato

Hi Masato,


1. An example with minimal length difference set at 80%


ABCDEFGHIJ (10-character source word)


CafeTran accepts the following target length possibilities:


ABCDEFGH = 80% (minimal length difference)

ABCDEFGHI = 90%

ABCDEFGHIJ = 100%

ABCDEFGHIJK = 90%

ABCDEFGHIJKL = 80% (minimal length difference)


2. Yes, your description is correct.


Igor

Hi Igor,


Two more questions about this feature. I want to provide my website visitors with an accurate knowledge so that they can know what they are really doing with this feature.


1. Subsegment minimal length difference (%)


For example,


ABCDE (5-character word)

WXYZ (4-character word)


In this case, "subsegment length difference" is 20% (?). If so, lowering the minimal length difference to zero, for example, means finding words of exactly the same length, too. Am I correct?



2. Statistical approach


Does CT make subsegment guesses like below?


S: CafeTran is a CAT tool.

T: ABCDDDKATtoorrr.


S: Trados is one of popular CAT tools.

T: FJEJDDDKATtrrr.


So, when there is a sufficient number of TUs containing "CAT" in the source and "KAT" in the target that can be compared against one another, CT strikes out differentials and picks up a common (frequently appearing) string of characters as a probable subsegment.


Am I correct?


Cheers

Masato

Hi Masato,


Yes, it is possible. One of the tuning options for the target hits is "Subsegment minimal length difference" in Edit > Options > Memory tab. As English and Japanese term pair may have a significant difference in length, try to lower this settings for your language pair.


Igor

Hi Igor,


I'm asking this question because in most (not all) cases, the whole target segment (Japanese) is shown as a "guess" as follows.



So, first I thought that the "hits" function is designed to display TM target segments whose source segments contain certain words appearing in the current source segment, rather than finding possible subsegment pairs that could ultimately be used for auto-assembling.


Is it possible, if there is a large enough TM, for CT to pick up a certain part of the target segment as a possible Japanese equivalent for, say, "the cost of"?


Thanks always,

Masato

Hi Masato,


CafeTran uses two steps to detect hits for source subsegments in TM:


1. By looking for exact fragments in the TM. It should always produce an exact match for a given source subsegment.

2. By analyzing the frequencies of hits on the source and target side. This is a statistical approach and the results get more accurate as the number of hits increases. You can see different colors for the hits meaning the accuracy of the hit. The higher number of the source hits, the higher probability the target hit is accurate. Japanese language hits are analyzed on the character level while the languages with a defined word separator are analyzed on the word level.


Igor 

A question about hits.


I understand hits are about subsegments or parts of a source sentence, but I don't understand how CT can identify subsegments of TM target sentences that correspond to those source subsegments (especially when Japanese is involved).


Is the hits feature mainly designed for language pairs with a word separator (space)?


Peace,

Masato

Thank you!

BTW, why do I not have blue hits, but I have magenta, orange, and yellow hits?


Blue hits turn to yellow when you set the dark theme.


Igor

Subsegment to Virtual threshold tells the program that after that number of hits, the program will treat it the same as an exact fragment. Then. it is a sort of 'virtual' or 'guessed' exact fragment.


How can I allow CT to show only a subsegment/to treat it as a hit only when it contains more than, let's say, 5 words?


See the Edit > Options > Memory > Minimal subsegment length (in characters) 


 > By enabling only Fuzzy, will CT still use subsegment matches in AA?


No.


Is it possible to not allow CT to use subsegment matches in AA?


Set the Subsegment to Auto threshold very high. Then CT will pick only 'sure' candidates for AA.


Igor

Hi Igor,


So it seems I have to increase Subsegment to Virtual threshold to increase probability of accuracy.


Other questions now arise.

How can I allow CT to show only a subsegment/to treat it as a hit only when it contains more than, let's say, 5 words? 

By enabling only Fuzzy, will CT still use subsegment matches in AA?

Is it possible to not allow CT to use subsegment matches in AA?


BTW, why do I not have blue hits, but I have magenta, orange, and yellow hits?


Kwang

Hi Masato


It is in where your fuzzy matches/fragments are shown.

You have to use Matching Type: Fuzzy and Hits, though, for that TM.


Kwang

Sorry to cut in, but I haven't recognized these colors.


Where do they appear?


Masato

Hi Kwang,


Yes, the 'guessing' results may vary depending on the language pair. The accuracy of hits should increase as your TM gets larger and larger since CT has more data to analyze. The blue hits are low-accuracy "guesses", the orange ones are medium-accuracy while the purple ones are the highest accuracy hits. Hover the mouse over a hit number and you will see the full context of the 'guess'.


Igor

Hi Igor,


Thank you.

I've come across that. Unfortunately, I seem to not understand how I can achieve what I want.

I want CT to only show the yellow hits (i.e. the ones with full target text) as most of the time the orange hits are wrong, and I don't want CT to use any guessed subsegment in AA.


Kwang

Login to post a comment