Hi, I am a Tibetan-English translator and am currently doing a review of CAT platforms. I just tested your latest version of CafeTran and I am seeing a problem that is common across CAT platforms in regard to how it your code is reading the Tibetan script:
Let me explain first that in the Tibetan script there are no spaces between words instead there are little dots called tseks, e.g., ཐམས་ཅད་མཁྱེན་ཅིང་ཀུན་གཟིགས། is a phrase with seven words and dot between each word. CafeTran is recognizing each of these dots as a character rather than a word delimiter, thereby registering phrases as single words and as a result getting very poor leverage for both fuzzy matching and concordance search.
I'm just a Cafetranslator, not a moderator or the developer (Igor), and I'm not familiar with Tibetan (your explanation and test documents will certainly be helpful), but here are a few thoughts to boot. Also, please note you might not get an official answer before Monday:
In the TM options (right click inside the tab of an open TM), there is an option to set the Matching type: Fuzzy & Hits, Fuzzy and Fuzzy without word separator
I suggest you first try how TM matches work with that "Fuzzy without word separator" enabled.
With this option, CafeTran analyzes source segments on a character basis, which is suitable for languages without a word boundary (e.g. Chinese or Japanese).
If that does not work, please set the option back to Fuzzy or Fuzzy & Hits, and proceed with suggestion number 2.
In Edit > Preferences > Memory, you can try adding U+0F0B as "Additional space characters". Does this improve or alter TM matching for Tibetan in any way?
Thanks Jean, I probably should have checked for more configurations before assuming it was a deeper coding issue. I will give your suggestions a try and see if it works.
Hey Celso, did you succeed? Would be interesting to read the experiences of a Tibetan translator ;).
Hi Jean and alwayslockyourbike,
I did try Jean's suggestions here. The first suggestion didn't work. Whenever I tried setting the TM to "Fuzzy without word separator" the TM would go back to "Fuzzy his" even when I created a new project and TM from scratch--not sure if this is an issue with the language pairing or perhaps my CafeTran version. (?)
However, Jean's second selection here, adding "U+0F0B" to Edit > Preferences > Memory, as "Additional space characters" this did indeed resolve the problem. Though my fuzzy matches were still a few percentage points under the Roman transliteration tests, the results were satisfactory.
I will add a post about this and some other tips for getting started in the forum for any other Tibetan-language translators that might come here.
Thank you both for your time and support with this issue!
I will attach my test files here in case anyone needs to reference this problem, but as I mentioned this seems to be an adequate fix.