Start a new topic

Setting up segmentation rules

I've just been playing around with segmentation rules in CafeTran.

It took me some time to work out how to get them to do what I want, and there doesn't appear to be any documentation for this feature, so I thought I'd share some tips I picked up here.


If you want to do anything useful with the segmentation rules, you first need to modify the preferences to use a rules.srx file:

Go into Preferences -> General and, under Segmentation, select 'Rules.srx'


Once you've done this, you will generally need to add your language to the 'language map'. (The exception: a handful of languages are already set up, in which case they don't need to be added anew.)

To do so:

1. Click on Segmentation editor -> Language maps -> New language map

2. Enter the "language pattern". The language pattern looks complicated, but is actually simple (unless you want different patterns for different sublanguages). It depends on the 2 letter language code. For De, for example, it is [Dd][Ee].* , for Fr [Ff][Rr].* . etc., etc.

3. Enter the language name.


Now for the good bit - you can find a pretty comprehensive set of segmentation rules for most languages here: https://github.com/languagetool-org/languagetool/blob/master/languagetool-core/src/main/resources/org/languagetool/resource/segment.srx

To get this into your Rules.srx file (note: I am assuming you have some basic familiarity with XML):

1. Open the Rules.srx file using a text editor, (The Rules.srx file is located in Cafetran\rules\segmentation. Unfortunately you can't modify this location.)

2. Go to the link above. Find the line: <languagerule languagerulename="YourLanguage">.

3. Copy everything from the above line to the next </languagerule> line.

4. Paste it into your Rules.srx file (immediately after an appropriate </languagerule> line). If you already have rules set up for this language, you will need to either overwrite them or merge the two - I'll assume you can work out how to do this.


Once you've done that, the easiest way to edit or add further rules appears to be using Ratel: http://www.opentag.com/okapi/wiki/index.php?title=ratel



1 person likes this idea

Nice! thanks for sharing. Ratel from Okapi framework is indeed the way to go for SRX editing IMHO.

I didn't know about the LanguageTool srx, I'll have to compare with what I already have in place, seems quite different.

For consistency as I'm also using this tool, I currently use the srx file from OmegaT, downloadable here:
https://sourceforge.net/p/omegat/svn/HEAD/tree/trunk/src/org/omegat/core/segmentation/defaultRules.srx

FYI, I've just added a rule to also segment at colons (:), in order to more closely match Trados behavior.

Open the SRX in Ratel, choose Default in the rule list (no specific language, applies to all, except if a language specific rule overrides it) and change [\.\?\!]+ to [\.\?\!:]+


Now, there is one additional good bit: there is no need to edit Rules.srx, CafeTran recognizes any srx file present in "cafetran/rules/segmentation" folder. So in Preferences, Rules.srx can coexist with other srx you import or edit. Simply add the srx file to the file location, restart CafeTran and in Preferences, you'll be able to choose it.

Many thanks for the link to the OmegaT files.
I looked at OmegaT when I was looking for srx rules, but because the OmegaT documentation states that it is not possible import or export srx rules I didn't look for long. It's a great tool, and I'm sure their srx rules are very good.

Thanks too for the tip on using alternative srx files in the cafetran/rules/segmentation folder. I'm going to do just that with the OmegaT file right now.

 

Apart from the quite complex srx segmentation rules, you might also check out much simpler Resources > Abbreviations which CafeTran treats like exceptions to the default segmentation rules.


Igor

Hi Igor,


Wow, I downloeaded the file @ https://github.com/languagetool-org/languagetool/blob/master/languagetool-core/src/main/resources/org/languagetool/resource/segment.srx and it is AMAZING. It has tons of languages in it. You should ship CT with rules from this file!


image



Also, I think the best free seg rule editor is currently: http://www.maxprograms.com/products/srxeditor.html


Michael



Great, thanks for the link, I've added it to the suggestions for Segmentation in CafeTran Espresso - Preferences reference file.


I have been using the one shipped with OmegaT: [https://raw.githubusercontent.com/omegat-org/omegat/master/src/org/omegat/core/segmentation/defaultRules.srx only adding a colon to the Default Break rules (from [\.\?\!]+ to [\.\?\!:]+ ) to achieve a Trados-like segmentation, but will try to add more abbreviations from the LanguageTool srx to the file I use.

Login to post a comment