Start a new topic

Re-segmenting a TMX file

  • From the Task menu choose Split TMX units:


image


  • Set the split characters both for source and target language:

image


  • From the Memory menu choose Browse memory:

image


ctp
(17.6 KB)

How about adding a checkbox [ ] Remove split characters?

First attempt to create a regular expression to insert # between sentences in TUs that contain multiple sentences.


Regular expression for finding: (?<=[a-z][\.\!\?]) (?=([A-Z]))

Replacement string:  #

(note the space before the #)


Finding:


image


Replacing:


image


Of course this regular expression has to be fine-tuned.


Note that you can perform these actions on the fly, during actual translation work. No import/export or saving in Word as HTML or other cumbersome tricks needed.

Lenting: Of course this regular expression has to be fine-tuned.


And even then it won't work.


H.

>Lenting: Of course this regular expression has to be fine-tuned.

 

Sorry to disagree, Mr. Hans van den Broek! Works as advertised here. Great feature, Igor. Many thanks.

And the video ...


Re-segmenting a TM - Part 1

https://youtu.be/h7xEoARMKB0


Re-segmenting a TM - Part 2

https://youtu.be/gEBbA4okhdk

And the files to replicate ...

tmx
(1.75 KB)
docx
(29.2 KB)

Lenting: Of course this regular expression has to be fine-tuned.


Yes.


image


You can argue it doesn't matter much (and I'd agree), but the screenshot above shows again how treacherous regexes are.

Thanks for the nice example! By examples like these, the expression can be improved. All pretty straight forward, except for the last sentence, that starts with a lowercase letter. Does the normal sentence segmentation catch this one? Hey Igor, I know you're busy on the beach, but how about letting CafeTrans using its segmentation rules here? (See what's happening here? You give me one finger, and I'll grab your hand. As long as it's your hand and not another part of your body ;)).
Last > fourth

Lenting: All pretty straight forward


I doubt it. And I'm quite sure there will be many other "exceptions." Before His Igorness came up with his solution, I tried to write a regex myself. I used a short, real-life text (an email by my baby-sister who will turn 55 next Monday). That already showed one of the above problems. I admit I wrote the text for Regexr screenshot above myself.


...how about letting CafeTrans using its segmentation rules here?


That may or may not be possible. Unless, of course, you use my "solution," the one I provided on ProZ and here. That would most certainly be possible, and aligning shouldn't be a problem, especially not in this case.


H.


I imported a .docx of the above (plus blabla) in CT. Perfect!


image


H.

docx
Van den Broek: I admit I wrote the text for Regexr screenshot above myself. // that's fine with me. Not a problem. BTW: does the tester have a setting for the Java flavour? Anyway, I prefer testing in CafeTran. And of course you can make up a new example that won't match an improved expression. That's my way to keep you away from your work.

It looks like it's all much easier.

  1. Open the wretched TMX file in CT's Edit TMX mode
  2. Export as HTML (other export format may be possible)
  3. Open the HTML in Word, save as .docx
  4. Drop the .docx on the Dashboard
  5. Save as TMX.

No regexes, no aligning, no nothingness required.

H.

Login to post a comment