Start a new topic

Problem with CafeTran code while searching Tibetan script

 Hi, I am a Tibetan-English translator and am currently doing a review of CAT platforms. I just tested your latest version of CafeTran and I am seeing a problem that is common across CAT platforms in regard to how it your code is reading the Tibetan script:


Let me explain first that in the Tibetan script there are no spaces between words instead there are little dots called tseks, e.g., ཐམས་ཅད་མཁྱེན་ཅིང་ཀུན་གཟིགས།  is a phrase with seven words and dot between each word. CafeTran is recognizing each of these dots as a character rather than a word delimiter, thereby registering phrases as single words and as a result getting very poor leverage for both fuzzy matching and concordance search.


I was able to confirm this by doing a simple test: in the document called "Test C Tibetan Script" I created a TM for a simple 8 lines of verse (creating one segment for verse) then I ran this against a new source text where I altered a word or two on each line so that we would expect a fuzzy match for each line, though I left one line 100% as a control.

Then I ran the exact same test in another document called "Test C Tibetan Romanized Transliteration". The exacts same TM and source except all the Tibetan was converted into the Wylie Roman transliteration, where all of the punctuation dots are replaced by spaces, the example phrase from above would thus be rendered "thams cad mkhyen cing kun gzigs/".

The results confirm the problem as the transliterated document found fuzzy matches as one would expect, but the Tibetan-Unicode document failed to find fuzzy matches with the corresponding Tibetan-Unicode .tmx file, except in the one case where it was a 100% match.

So basically, CafeTran just needs a very small update to recognize the Tibetan punctuation properly. It should recognize all the tseks, the little punctuation dots (Unicode =  U+0F0B) as word delimiters equivalent to spaces (Unicode = U+0020). This minor adjustment would make the platform functional for Tibetan as it is with many other languages.

Would a moderator please contact the development department and let them know about this problem, otherwise I'm afraid your CAT platform is simply not compatible with the Tibetan language.

If you would like me to send in my test documents: the (1) Tibetan-Unicode source .docx, (2) Tibetan Roman Transliteration source .docx, (3) Tibetan-Unicode .tmx, and (4) Tibetan Roman Transliteration .tmx, feel free to email me at celso.wilkinson@gmail.com, I would be happy to send them to you and answer any of your questions.

Best wishes,
-Celso


Hello Celso,


I'm just a Cafetranslator, not a moderator or the developer (Igor), and I'm not familiar with Tibetan (your explanation and test documents will certainly be helpful), but here are a few thoughts to boot. Also, please note you might not get an official answer before Monday:


1.

In the TM options (right click inside the tab of an open TM), there is an option to set the Matching type: Fuzzy & Hits, Fuzzy and Fuzzy without word separator


I suggest you first try how TM matches work with that "Fuzzy without word separator" enabled.


With this option, CafeTran analyzes source segments on a character basis, which is suitable for languages without a word boundary (e.g. Chinese or Japanese). 


Reference: https://github.com/idimitriadis0/TheCafeTranFiles/wiki/3-TM-options#tm-options


If that does not work, please set the option back to Fuzzy or Fuzzy & Hits, and proceed with suggestion number 2.


2.

In Edit > Preferences > Memory, you can try adding U+0F0B as "Additional space characters". Does this improve or alter TM matching for Tibetan in any way?


Best wishes,


Jean


Thanks Jean, I probably should have checked for more configurations before assuming it was a deeper coding issue. I will give your suggestions a try and see if it works.

Hey Celso, did you succeed? Would be interesting to read the experiences of a Tibetan translator ;).


image


Hi Jean and alwayslockyourbike,


I did try Jean's suggestions here. The first suggestion didn't work. Whenever I tried setting the TM to "Fuzzy without word separator" the TM would go back to "Fuzzy his" even when I created a new project and TM from scratch--not sure if this is an issue with the language pairing or perhaps my CafeTran version. (?)


However, Jean's second selection here, adding "U+0F0B" to Edit > Preferences > Memory, as "Additional space characters" this did indeed resolve the problem. Though my fuzzy matches were still a few percentage points under the Roman transliteration tests, the results were satisfactory.


I will add a post about this and some other tips for getting started in the forum for any other Tibetan-language translators that might come here.


Thank you both for your time and support with this issue!


I will attach my test files here in case anyone needs to reference this problem, but as I mentioned this seems to be an adequate fix.


Best wishes,

-Celso

tmx
docx
tmx
docx
Login to post a comment