Start a new topic

Subsegment to Auto threshold and Subsegment to Virtual threshold

I'm playing with CT setting again. After reading the wiki, I still do not understand the following.


How do they relate to each other, in terms of correct guessing and AA?

If I set Subsegment to Auto threshold to 0, does this mean that CT will never use subsegment in AA?

How does increasing Subsegment to Virtual threshold increase the probability of CT guessing correctly?

More importantly, how can I delete CT's wrong guess of subsegment translation from its memory? There are a lot of them now.

I'm currently using the default setting.


Thank you in advance!

Kwang




Hi Masato,


1. An example with minimal length difference set at 80%


ABCDEFGHIJ (10-character source word)


CafeTran accepts the following target length possibilities:


ABCDEFGH = 80% (minimal length difference)

ABCDEFGHI = 90%

ABCDEFGHIJ = 100%

ABCDEFGHIJK = 90%

ABCDEFGHIJKL = 80% (minimal length difference)


2. Yes, your description is correct.


Igor

Hi Igor,


Thank you.

I've come across that. Unfortunately, I seem to not understand how I can achieve what I want.

I want CT to only show the yellow hits (i.e. the ones with full target text) as most of the time the orange hits are wrong, and I don't want CT to use any guessed subsegment in AA.


Kwang

Hi Kwang,


Yes, the 'guessing' results may vary depending on the language pair. The accuracy of hits should increase as your TM gets larger and larger since CT has more data to analyze. The blue hits are low-accuracy "guesses", the orange ones are medium-accuracy while the purple ones are the highest accuracy hits. Hover the mouse over a hit number and you will see the full context of the 'guess'.


Igor

Sorry to cut in, but I haven't recognized these colors.


Where do they appear?


Masato

Hi Masato


It is in where your fuzzy matches/fragments are shown.

You have to use Matching Type: Fuzzy and Hits, though, for that TM.


Kwang

Hi Igor,


So it seems I have to increase Subsegment to Virtual threshold to increase probability of accuracy.


Other questions now arise.

How can I allow CT to show only a subsegment/to treat it as a hit only when it contains more than, let's say, 5 words? 

By enabling only Fuzzy, will CT still use subsegment matches in AA?

Is it possible to not allow CT to use subsegment matches in AA?


BTW, why do I not have blue hits, but I have magenta, orange, and yellow hits?


Kwang

Hi Masato,


CafeTran uses two steps to detect hits for source subsegments in TM:


1. By looking for exact fragments in the TM. It should always produce an exact match for a given source subsegment.

2. By analyzing the frequencies of hits on the source and target side. This is a statistical approach and the results get more accurate as the number of hits increases. You can see different colors for the hits meaning the accuracy of the hit. The higher number of the source hits, the higher probability the target hit is accurate. Japanese language hits are analyzed on the character level while the languages with a defined word separator are analyzed on the word level.


Igor 

Subsegment to Virtual threshold tells the program that after that number of hits, the program will treat it the same as an exact fragment. Then. it is a sort of 'virtual' or 'guessed' exact fragment.


How can I allow CT to show only a subsegment/to treat it as a hit only when it contains more than, let's say, 5 words?


See the Edit > Options > Memory > Minimal subsegment length (in characters) 


 > By enabling only Fuzzy, will CT still use subsegment matches in AA?


No.


Is it possible to not allow CT to use subsegment matches in AA?


Set the Subsegment to Auto threshold very high. Then CT will pick only 'sure' candidates for AA.


Igor

BTW, why do I not have blue hits, but I have magenta, orange, and yellow hits?


Blue hits turn to yellow when you set the dark theme.


Igor

Thank you!

Hi Igor,


I'm asking this question because in most (not all) cases, the whole target segment (Japanese) is shown as a "guess" as follows.



So, first I thought that the "hits" function is designed to display TM target segments whose source segments contain certain words appearing in the current source segment, rather than finding possible subsegment pairs that could ultimately be used for auto-assembling.


Is it possible, if there is a large enough TM, for CT to pick up a certain part of the target segment as a possible Japanese equivalent for, say, "the cost of"?


Thanks always,

Masato

Dear Igor,

The hits feature had long been the most difficult part of CT for me to understand.

Now, it's not, and I'm very happy.

Thanks for your clear explanation.

 Cheers

Masato

I share this question!

A question about hits.


I understand hits are about subsegments or parts of a source sentence, but I don't understand how CT can identify subsegments of TM target sentences that correspond to those source subsegments (especially when Japanese is involved).


Is the hits feature mainly designed for language pairs with a word separator (space)?


Peace,

Masato

Hi Masato,


Yes, it is possible. One of the tuning options for the target hits is "Subsegment minimal length difference" in Edit > Options > Memory tab. As English and Japanese term pair may have a significant difference in length, try to lower this settings for your language pair.


Igor

Login to post a comment