How to force segmentation after the triangular bullet ►, possibly followed by a font switch?
I just realise that for these (MS Word) documents, I can hide the ►.
Just for educational purposes (and to avoid the need to hide the ►): How can I define a segmentation rule to exclude the leading ► from the segments?
I don't think segmentation rules can be used for hiding any characters.
> Why doesn't CafeTran split after the bullet?
You seem to single out a break rule as being one segment itself. There might exist a regular expression for such a break rule but I don't know it or if it is possible at all.
>There might exist a regular expression for such a break rule but I don't know it or if it is possible at all.
Since this task is far from trivial (other users might want to force segmentation after bullets too) and since the announced regex tagger for CafeTran isn't available yet, I've asked an expert on this subject and he answers:
The rule looks fine to me.
As long as the triangle is immediately followed by a word (like in your example) that should work.
I’ve tried that rule with Ratel and our segmenter and it works fine. See the screen shot below that shows the triangle is in its own segment.
The SRX file itself looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<header segmentsubflows="yes" cascade="no">
<formathandle type="start" include="no"></formathandle>
<formathandle type="end" include="yes"></formathandle>
<formathandle type="isolated" include="no"></formathandle>
<okpsrx:sample language="en" useMappedRules="yes">►Reduce the maintenance costs</okpsrx:sample>
<languagemap languagepattern=".*" languagerulename="default"></languagemap>
Nothing really different from your example.
Obviously this is a test on a text string, not on a Word file.
When looking at the XLIFF extraction of the Word file I’ve noticed that there are inline codes before and after the triangle, probably some font change: maybe this is what causes your non-breaking behavior? The \w cannot be matches because there is some an inline code(s) between the triangle and the next word?
Yes, your rule should work fine.
But this probably depends on how the segmenter you are using deals with inline codes.
I’ve attached the Word file extracted to XLIFF and segmented with your rule, using Rainbow.
And you can see that the triangle is in its own segment there (in the <seg-source> elements).
I've opened the provided XLF file in CafeTran, and it looks fine:
So, Igor, perhaps with this info you can turn some screws to make CafeTran even more perfect than perfect (plus quam ultra perfect)?
Yep, the inline tags around the bullet prevent achieving what you wish. Ultra-perfection means leaving the rest of the world in the dust. :)
A bit more grateful please, Kmitowski. After showing the Woz how to build the Apple Computer, after showing you how to code CafeTran, after telling Peter how to develop Keyboard Maestro, and after teaching us how to use regexes, AppleScript, and more, much more, Lenting now tells you how to change segmentation rules.
Lenting: Please don't forget to mention that I translate from English now too.
Completely irrelevant in this topic, but I don't see why you wouldn't translate from English. Or from French. Another language?
Please hold your horses, Woorden, I was merely documenting some techniques here. If you don't want to use them: please feel free to ignore the posting.
Lenting: Please hold your horses
The minute you stop farting.