Start a new topic

Re-segmenting a TMX

In 2012 I posted a question about how to re-segment TMX files that contain several sentences in one Translation Unit (TU, the bilingual items of any TMX file):


http://www.proz.com/forum/cat_tools_technical_help/232142-re_segmentation_of_tmx_files_is_there_a_tool_for_this.html


Recently, Michael has brought this intriguing topic back to my attention. And, how it goes, this morning I got the idea to use a workaround via glossaries, which, BTW, are amongst the best features that CafeTran offers!


I created the attached glossary for a quick test (note that I've removed the optional trailing punctuation marks and spaces in this stage). From the Memory menu, I selected Import from glossary, et voila, we're always half through:


image


Any chance that we get a way to have a special mode of aligning here:


First sentence=Eerste zin

Second sentence=Tweede zin

Third sentence=Derde zin


I think that this could be really useful, like all things that can be done with glossaries (if you have an open mind).

txt
(81 Bytes)

Second test, with a longer glossary and some punctuation characters and trailing spaces (see the attached glossary if you want to replicate the test):


image


The idea is to use regular expressions (which I find really useful, BTW) to get from something like:

HTML

<tu tuid="1">
<tuv xml:lang="en-GB"><seg>Second sentence! Third sentence? Fourth sentence, </seg>
</tuv>

 To:

HTML

<tu tuid="1">
<tuv xml:lang="en-GB"><seg>Second sentence! ;Third sentence? ;Fourth sentence, </seg>
</tuv> 

 DISCLAIMER: If you don't have the flexibility to use glossaries, please ignore this posting.


txt
(150 Bytes)

Tell Michael I will implement the automatic TMX segment split if he translates his every second document in CT. Deal? :) 


1 person likes this

Pretty simple, at least in the case above.


  1. Open the TMX file in CT Edit TM
  2. Export it to a two-columns HTML
  3. Open the HTML in Word
  4. Select the SL column, and save it as a file. Repeat with the TL column
  5. Use a regex to replace all punctuation in both files with a hard return, or, if you want to keep the punctuation, use a regex that adds a hard return after the punctuation (this may result in empty "rows" which should be deleted, either before or after point 6)
  6. Align using CT's aligner (auto should do) or another aligner
  7. Import the aligned file into a TMX memory, because tab del glossaries are pure and utter shit
You will lose any and all metadata present in the original TMX file.

H.

>Align using CT's aligner (auto should do) or another aligner


No aligning please.

Login to post a comment