Start a new topic

Segmentation error with tags

Hi Igor,


I've finally twigged that the reason my files don't always segment correctly at colons is because of tags.


Where a colon is followed by a change of font (e.g.;

Interviewer: Well Mr. Jones, ... )

CT fails to segment at the colon (presumably because of the font tag). This is not correct behaviour. Segmentation should ignore tags in this context.


Please fix.


Thanks,
Jeremy



Hey Igor, has this been fixed?

With the current implementation of the segmentation function, the formatting tags cannot be included in the segmentation rules. In this case, you might just specify the colon as the break (beforebreak rule) in the .srx rules. Then, you should achieve the same effect.

Hi Igor,


I think you've misunderstood – I'm not wanting or trying to include tags in the segmentation rules and the colon is specified as the break. The issue is that the segmentation fails when the colon is followed by a tag.


My rule looks like this:

<rule break="yes">
<beforebreak>[\.\?\!\:]+</beforebreak>
<afterbreak>\s</afterbreak></rule>


Now while I concede that the <afterbreak> bit could probably, as your suggest, be deleted, it seems to me that segmentation should nonetheless break at a colon followed by a space even where the format changes after the colon. In other words tags should be ideally virually stripped out before identifying segmentation points. I can see that that might be tricky to code, however.


Jeremy


Login to post a comment