Start a new topic

Re-segmenting a TMX file

  • From the Task menu choose Split TMX units:


image


  • Set the split characters both for source and target language:

image


  • From the Memory menu choose Browse memory:

image


ctp
(17.6 KB)

In this case, aligning problems can only occur if the punctuation of the TL differs from the punctuation of the SL. Unlikely, but it can happen, and you'll find out soon enough. Regexes can - and therefore will - make havoc of the TM, and they'll suffer from exactly the same problem.


Simply changing a segmentation setting in the TMX file.


In that case, our troubles are over. I doubt if it works.


H.

BTW: The solution suggested for oT (in the oT forum) is nice too. Simply changing a segmentation setting in the TMX file.

>My problem at the moment is, that I can't seem to avoid aligning. 


That's what I wrote to you some days ago.


>It's not a real problem at this stage


It actually is a big problem. At least for me. Every aligning is introducing new possible errors. I have the same feelings towards aligning as you have towards regular expressions :).

  1. Open the wretched TMX file in CT's Edit TMX mode
  2. Export as HTML (other export format may be possible)
  3. Open the HTML in Word, save as .docx
  4. Select both Word columns one by one, and open them in CT one by one, sentence segmentation enabled
  5. Save
  6. Align (auto)
H.

Lenting: The iPhones aren't matched because of the ) after 'surprise'


I know, that's why I put the ) in.


I still think it's a good idea to let CT do the segmentation: You wouldn't need a very complex regex, and it would be the same segmentation as the source document shows, if it's a CT project. If not, you'll have AA insertable fragments. So forget about those regexes. My problem at the moment is, that I can't seem to avoid aligning. It's not a real problem at this stage, but I want to reduce steps.


H.

image


Duh! A double escape is needed for the ):


(?<=[a-z%\d\\)][\.\!\?]) (?=([a-z]?[A-Z]))

I cannot get the closing parenthesis to the first matching expression.


Expression:


(?<=[a-z%\d][\.\!\?]) (?=([a-z]?[A-Z]))


Result:


image



The iPhones aren't matched because of the ) after 'surprise'. Not good.

Lenting: Next time, please test before you post


I did. Got mixed up with all the test files on my desktop.


Okay, I'll concentrate on the regular expressions then


Don't. Too complicated. My original solution still should work, with CT doing the segmentation, but it would require aligning, and I think that wouldn't be necessary.


H.

Haha, I was just testing your procedure :). Next time, please test before you post. Think before you speak. Or even better: don't speak.


(Just joking here, like you always are. At least, I cannot imagine that you mean all the mean things that you write here and elsewhere. Okay, I'll concentrate on the regular expressions then.)


For what it's worth: here's the Word document that you instructed.

docx
(36.8 KB)

Forget the above. I still think it's possible - No regexes, no aligning, no nothingness required - but it seems I forgot a step or two.


H.

It looks like it's all much easier.

  1. Open the wretched TMX file in CT's Edit TMX mode
  2. Export as HTML (other export format may be possible)
  3. Open the HTML in Word, save as .docx
  4. Drop the .docx on the Dashboard
  5. Save as TMX.

No regexes, no aligning, no nothingness required.

H.

Van den Broek: I admit I wrote the text for Regexr screenshot above myself. // that's fine with me. Not a problem. BTW: does the tester have a setting for the Java flavour? Anyway, I prefer testing in CafeTran. And of course you can make up a new example that won't match an improved expression. That's my way to keep you away from your work.

I imported a .docx of the above (plus blabla) in CT. Perfect!


image


H.

docx

Lenting: All pretty straight forward


I doubt it. And I'm quite sure there will be many other "exceptions." Before His Igorness came up with his solution, I tried to write a regex myself. I used a short, real-life text (an email by my baby-sister who will turn 55 next Monday). That already showed one of the above problems. I admit I wrote the text for Regexr screenshot above myself.


...how about letting CafeTrans using its segmentation rules here?


That may or may not be possible. Unless, of course, you use my "solution," the one I provided on ProZ and here. That would most certainly be possible, and aligning shouldn't be a problem, especially not in this case.


H.


Login to post a comment