Start a new topic

Strange segmentation

I have the following Japanese sentence:


爽やかなプロセッコDOCビアンコに合わせて楽しんでくださいね。


The part in red has Latin characters, all the rest is Japanese. It's a whole sentence, therefore CTE should not segment it. But, this is what I get:

image


CTE has eliminated the "D" and the "C" letters and segmented the sentence in three pieces.


I'm currently using OmegaT default segmentation file that I've tweaked for my needs. It works more or less OK with the various Japanese punctuation marks, but there are no rules in the Japanese part of the file that trigger a segmentation like that in the picture.


How can this strange segmentation be avoided?


I should add that by manually merging the three segments above the original sentence is re-composed, so it's not really a problem. Still, I wonder why this kind of segmentation occurs.

Can anybody please suggest how to tell CTE not to segment words like "HOME" (Latin characters) at each letter when they are included in a Japanese text? Currently I get this:


H

O

M

E


And some of these characters are omitted in the source editor. Only by manually joining all the segments I can get to whole sentence in a single segment.



Could one solution be to instruct the segmentation rules file not to break the segment at any Latin letter not followed by a punctuation mark or hard return?

You might try to switch back to CafeTran's default segmentation rules for this specific project.

Thanks. This is what I actually did (selecting "Segmentation"), and I confirm that it works much better. It seems that the standard OmegaT segmentation rules are not good enough for Japanese.


One question: are default segmentation rules hard-coded into CTE? If not, where is exactly the related file? If possible, I would need to add new rules to keep into account the many strange symbols and sentence constructions the Japanese normally use.

Login to post a comment