I've just been playing around with segmentation rules in CafeTran.
It took me some time to work out how to get them to do what I want, and there doesn't appear to be any documentation for this feature, so I thought I'd share some tips I picked up here.
If you want to do anything useful with the segmentation rules, you first need to modify the preferences to use a rules.srx file:
Go into Preferences -> General and, under Segmentation, select 'Rules.srx'
Once you've done this, you will generally need to add your language to the 'language map'. (The exception: a handful of languages are already set up, in which case they don't need to be added anew.)
To do so:
1. Click on Segmentation editor -> Language maps -> New language map
2. Enter the "language pattern". The language pattern looks complicated, but is actually simple (unless you want different patterns for different sublanguages). It depends on the 2 letter language code. For De, for example, it is [Dd][Ee].* , for Fr [Ff][Rr].* . etc., etc.
3. Enter the language name.
Now for the good bit - you can find a pretty comprehensive set of segmentation rules for most languages here: https://github.com/languagetool-org/languagetool/blob/master/languagetool-core/src/main/resources/org/languagetool/resource/segment.srx
To get this into your Rules.srx file (note: I am assuming you have some basic familiarity with XML):
1. Open the Rules.srx file using a text editor, (The Rules.srx file is located in Cafetran\rules\segmentation. Unfortunately you can't modify this location.)
2. Go to the link above. Find the line: <languagerule languagerulename="YourLanguage">.
3. Copy everything from the above line to the next </languagerule> line.
4. Paste it into your Rules.srx file (immediately after an appropriate </languagerule> line). If you already have rules set up for this language, you will need to either overwrite them or merge the two - I'll assume you can work out how to do this.
Once you've done that, the easiest way to edit or add further rules appears to be using Ratel: http://www.opentag.com/okapi/wiki/index.php?title=ratel
Great, thanks for the link, I've added it to the suggestions for Segmentation in CafeTran Espresso - Preferences reference file.
I have been using the one shipped with OmegaT: [https://raw.githubusercontent.com/omegat-org/omegat/master/src/org/omegat/core/segmentation/defaultRules.srx only adding a colon to the Default Break rules (from [\.\?\!]+ to [\.\?\!:]+ ) to achieve a Trados-like segmentation, but will try to add more abbreviations from the LanguageTool srx to the file I use.
Wow, I downloeaded the file @ https://github.com/languagetool-org/languagetool/blob/master/languagetool-core/src/main/resources/org/languagetool/resource/segment.srx and it is AMAZING. It has tons of languages in it. You should ship CT with rules from this file!
Also, I think the best free seg rule editor is currently: http://www.maxprograms.com/products/srxeditor.html
Apart from the quite complex srx segmentation rules, you might also check out much simpler Resources > Abbreviations which CafeTran treats like exceptions to the default segmentation rules.
Now, there is one additional good bit: there is no need to edit Rules.srx, CafeTran recognizes any srx file present in "cafetran/rules/segmentation" folder. So in Preferences, Rules.srx can coexist with other srx you import or edit. Simply add the srx file to the file location, restart CafeTran and in Preferences, you'll be able to choose it.