I wonder why despite segmentation is set to sentence the program considers even small white spaces as the reason for starting a new segment. In case a document contains many undesirable spaces after conversion it is really tiresome to join segments often comprising seperate words.
Is there any way to avoid such segmentation?
It shouldn't. What file format? Perhaps a screenshot?
The default "Sentence" segmentation does segment on multiple white spaces. However, please try the alternative segmentation bases on Rules.srx in Edit > Preferences > Segmentation drop-down list as you create a new project.
I ran into the same problem. Lots of double spaces in an html file. Also, since some tags are seen as a space, CT segments at some tags (even though I'm using "Sentence" without having ticked "segment at all tags".
What are the differences between the "Sentence" segmentation rules and the actual Rules.srx.
You might try to use the segmentation based on Rules.srx. It does not segment on multiple white spaces while the Sentence segmentation does so.
Ok, thanks. Any other differences I should worry about for html files in the Rules.srx file?
Ok, the Rules.srx files took care of the double spaces, but why does CT still segments at tags? See images.
CT view (the segments before and after tag 1 were joined manually). I have not yet joined the third segment of the sentence (...at location X,...).
HTML filter segments on html tags as well so the user needs to join segments on the tags which form a part of the segment, like the strong tag in your example.
Ok, I guess if you didn't segment at all tags, then the <li>, <br> tags, etc would not be segmented. Fair enough.