I've just been playing around with segmentation rules in CafeTran.
It took me some time to work out how to get them to do what I want, and there doesn't appear to be any documentation for this feature, so I thought I'd share some tips I picked up here.
If you want to do anything useful with the segmentation rules, you first need to modify the preferences to use a rules.srx file:
Go into Preferences -> General and, under Segmentation, select 'Rules.srx'
Once you've done this, you will generally need to add your language to the 'language map'. (The exception: a handful of languages are already set up, in which case they don't need to be added anew.)
To do so:
1. Click on Segmentation editor -> Language maps -> New language map
2. Enter the "language pattern". The language pattern looks complicated, but is actually simple (unless you want different patterns for different sublanguages). It depends on the 2 letter language code. For De, for example, it is [Dd][Ee].* , for Fr [Ff][Rr].* . etc., etc.
3. Enter the language name.
Now for the good bit - you can find a pretty comprehensive set of segmentation rules for most languages here: https://github.com/languagetool-org/languagetool/blob/master/languagetool-core/src/main/resources/org/languagetool/resource/segment.srx
To get this into your Rules.srx file (note: I am assuming you have some basic familiarity with XML):
1. Open the Rules.srx file using a text editor, (The Rules.srx file is located in Cafetran\rules\segmentation. Unfortunately you can't modify this location.)
2. Go to the link above. Find the line: <languagerule languagerulename="YourLanguage">.
3. Copy everything from the above line to the next </languagerule> line.
4. Paste it into your Rules.srx file (immediately after an appropriate </languagerule> line). If you already have rules set up for this language, you will need to either overwrite them or merge the two - I'll assume you can work out how to do this.
Once you've done that, the easiest way to edit or add further rules appears to be using Ratel: http://www.opentag.com/okapi/wiki/index.php?title=ratel
Now, there is one additional good bit: there is no need to edit Rules.srx, CafeTran recognizes any srx file present in "cafetran/rules/segmentation" folder. So in Preferences, Rules.srx can coexist with other srx you import or edit. Simply add the srx file to the file location, restart CafeTran and in Preferences, you'll be able to choose it.
Apart from the quite complex srx segmentation rules, you might also check out much simpler Resources > Abbreviations which CafeTran treats like exceptions to the default segmentation rules.
Wow, I downloeaded the file @ https://github.com/languagetool-org/languagetool/blob/master/languagetool-core/src/main/resources/org/languagetool/resource/segment.srx and it is AMAZING. It has tons of languages in it. You should ship CT with rules from this file!
Also, I think the best free seg rule editor is currently: http://www.maxprograms.com/products/srxeditor.html
Great, thanks for the link, I've added it to the suggestions for Segmentation in CafeTran Espresso - Preferences reference file.
I have been using the one shipped with OmegaT: [https://raw.githubusercontent.com/omegat-org/omegat/master/src/org/omegat/core/segmentation/defaultRules.srx only adding a colon to the Default Break rules (from [\.\?\!]+ to [\.\?\!:]+ ) to achieve a Trados-like segmentation, but will try to add more abbreviations from the LanguageTool srx to the file I use.
Hi Jean, can you help me? Since I only translate sdlxliff files, I never needed touching segmentation rules, but now I'm starting working with docx files again and I'd like to use OmegaT's but need some additions:
break after every "," ";" ":" and "]".
you need to edit the OmegaTRules.srx file.
You need to find the break rules for your language. Have a look at the existing break rules, copy one and replace the existing punctuation character with your new one. Repeat as required.
"You need to find the break rules for your language."
That's your source language of course.
Well I have no idea about REGEX, it's like reading arabic haha. Thanks anyway!