Start a new topic

Re-segmenting a TMX file

  • From the Task menu choose Split TMX units:


image


  • Set the split characters both for source and target language:

image


  • From the Memory menu choose Browse memory:

image


ctp
(17.6 KB)

Forget the above. I still think it's possible - No regexes, no aligning, no nothingness required - but it seems I forgot a step or two.


H.

Haha, I was just testing your procedure :). Next time, please test before you post. Think before you speak. Or even better: don't speak.


(Just joking here, like you always are. At least, I cannot imagine that you mean all the mean things that you write here and elsewhere. Okay, I'll concentrate on the regular expressions then.)


For what it's worth: here's the Word document that you instructed.

docx
(36.8 KB)

Lenting: Next time, please test before you post


I did. Got mixed up with all the test files on my desktop.


Okay, I'll concentrate on the regular expressions then


Don't. Too complicated. My original solution still should work, with CT doing the segmentation, but it would require aligning, and I think that wouldn't be necessary.


H.

I cannot get the closing parenthesis to the first matching expression.


Expression:


(?<=[a-z%\d][\.\!\?]) (?=([a-z]?[A-Z]))


Result:


image



The iPhones aren't matched because of the ) after 'surprise'. Not good.

Duh! A double escape is needed for the ):


(?<=[a-z%\d\\)][\.\!\?]) (?=([a-z]?[A-Z]))

image


Lenting: The iPhones aren't matched because of the ) after 'surprise'


I know, that's why I put the ) in.


I still think it's a good idea to let CT do the segmentation: You wouldn't need a very complex regex, and it would be the same segmentation as the source document shows, if it's a CT project. If not, you'll have AA insertable fragments. So forget about those regexes. My problem at the moment is, that I can't seem to avoid aligning. It's not a real problem at this stage, but I want to reduce steps.


H.

  1. Open the wretched TMX file in CT's Edit TMX mode
  2. Export as HTML (other export format may be possible)
  3. Open the HTML in Word, save as .docx
  4. Select both Word columns one by one, and open them in CT one by one, sentence segmentation enabled
  5. Save
  6. Align (auto)
H.

>My problem at the moment is, that I can't seem to avoid aligning. 


That's what I wrote to you some days ago.


>It's not a real problem at this stage


It actually is a big problem. At least for me. Every aligning is introducing new possible errors. I have the same feelings towards aligning as you have towards regular expressions :).

BTW: The solution suggested for oT (in the oT forum) is nice too. Simply changing a segmentation setting in the TMX file.

In this case, aligning problems can only occur if the punctuation of the TL differs from the punctuation of the SL. Unlikely, but it can happen, and you'll find out soon enough. Regexes can - and therefore will - make havoc of the TM, and they'll suffer from exactly the same problem.


Simply changing a segmentation setting in the TMX file.


In that case, our troubles are over. I doubt if it works.


H.

Login to post a comment