Start a new topic

wrong tags in source txt

Hello, 

I'm still exploring the software and I encountered this issue
image


Why is the source text segmented like this? the dots do not even exist in the source text, and I don't understand why they are treated as segments in the first place? 

Can the dots represent non-breaking spaces? You can check via the pilcrow symbol. Furthermore: what is the source of the document? Web? Scan? You could try if the Ms Word ocr gives a better import.
By segmenting we mean how text is chopped in pieces. Your screenshot seams to show a correctly segmented sentence.

Hi!

The source is a word file, I will try the ocr solution.


but how is the segmentation correct? 

segment 1: the white house said it believes Iran is

segment 2: .

segment 3: planning to supply Russia

segment 4: .

segment 5: "with several hundred" ..........

What you see in the box is one segment. The dot represents a fixed space.

Thank you so much! 

Can the source text & its tags be edited?

You have to activate editing via Edit > Edit source segments


BUT!!!


  1. You can only edit the text (e.g. a typo)
  2. Never (ever) remove a tag or change their order
If you want to have segments without these non-breaking spaces (at least, that is what I guess they are):

Remove them in the source, as far this is possible. 

Note: there must be some extra formatting around these spaces, since normally they aren't surrounded by tags.

Perhaps you can use another (cleaner) Ms Word document for your learning?

Login to post a comment