Start a new topic

Adding two extra QA checks

It would be nice if two QA checks could be added:

  1. QA reversed translation consistency (AKA source segment consistency)
  2. QA 'tolerant' translation consistency, disregarding differences in (a) numbers and (b) punctuation marks and (c) leading and trailing white space and (d) tags (this check could also replace the current QA translation consistency–as far as I'm concerned).
In the meantime I could try to create a DIY solution, using the bilingual table. However, this isn't available for all file formats (e.g. not for SDLXLIFF). I could use the HTML export instead.

Nevertheless, a fully incorporated QA would be optimal :)


Of course, (a), (b), (c) and (d) should be ignored in check # 1 too.

I'm now experimenting with export of a project to check as TMX. A few regex searches (3, possibly 2) will create a tab-delimited text file (see attached).


Next step: remove all spaces, numbers and punctuation marks. Problem: How to find the offending segments back later?

txt
(575 Bytes)
tmx
(1.96 KB)

After having removed spaces and punctuation (greb), I keep this:


1.png


Sorting and deleting duplicates, will keep 4 lines. I've colour-coded them:


2.png


First three lines have the same source but different targets, which require examination.


Fourth line has the same target as the first, and this requires further attention too.

Test document:


0.png


Creating a document with all source segments that have been translated differently:


1.png


Result:


2.png


Creating a document with target segments that refer to different source segments:


3.png


The result:


4.png


Regular expression to select the header of a TMX file:


Screen%20Shot%202016-10-15%20at%2018.29.27.png


The next step would be to write an applescript to automate the steps to create a file with all identical source segments with different translations (disregarding punctuation marks etc.) and another one with all identical target files with different sources (disregarding punctuation marks etc.) from a TMX file exported from a translation project, using e.g. TextWrangler.


But I guess that Igor will provide a built-in solution faster than I can provide this applescript.


gnar gnar

The syntax to select segment start 'headers':


<tu.*?seg>

For inspiration: http://macscripter.net/viewtopic.php?id=44894 Perhaps I'll start borrowing from here after my first caffeine shot. Problem with stuff like this, is that it's quite addictive and thus time consuming :).

Here's the preparation part:


tell application "TextWrangler"

tell front text window

 -- Conversion from TMX to tab-del

 -- Insert tab characters between source and target segments

 replace "<\\/seg.*?seg>" using " " options {search mode:grep, starting at top:true}

 -- Remove segment ending markup

 replace "</seg></tuv></tu>" using "" options {starting at top:true}

 -- Remove closing body markup

 replace "</body>" using "" options {starting at top:true}

 -- Remove closing TMX markup

 replace "</tmx>" using "" options {starting at top:true}

 -- Remove any TMX header

 replace "<\\?[\\w\\W]*<body>\\r" using "" options {search mode:grep, starting at top:true}

 -- Remove segment start markup

 replace "<tu.*?seg>" using "" options {search mode:grep, starting at top:true}

 -- Start cleaning up the tab-del

 -- Remove numbers

 replace "\\d+" using "" options {search mode:grep, starting at top:true}

 -- Remove punctuation characters

 replace "[\\!\\?,\\.:;“”’‘]" using "" options {search mode:grep, starting at top:true}

 -- Reduce space sequences to single spaces

 replace "[ ]{2,}" using " " options {search mode:grep, starting at top:true}

 -- Remove spaces at segment start

 replace "\\r[ ]" using "\\r" options {search mode:grep, starting at top:true}

 replace "\\t[ ]" using "\\t" options {search mode:grep, starting at top:true}

 -- Remove spaces at segment ending

 replace "[ ]\\r" using "\\r" options {search mode:grep, starting at top:true}

 replace "[ ]\\t" using "\\t" options {search mode:grep, starting at top:true}

 -- Remove duplicate lines

 --

 -- Extract source segments with different translations

 --

 -- Extract identical translations stemming from different sources 

 --

end tell

end tell

Here's the finished script, you can test it with the attached TMX file. 


Aim: Check your translation project both on identical source segments with different translations and on identical translations with different source segments, while ignoring differences in tags, numbers, punctuation marks and spaces.

Usage:

  • Export your current project as a TMX file.
  • Open it in TextWrangler.
  • Open the script in the Script Editor and run it.
  • Check the two files "Identical sources" and "Identical targets".
  • Make any necessary modifications to your project while navigating to the relevant segments via the Find dialogue box.

Have fun with this script!


tell application "TextWrangler"

tell front text document

 -- Conversion from TMX to tab-del

 -- Insert tab characters between source and target segments

 replace "<\\/seg.*?seg>" using " " options {search mode:grep, starting at top:true}

 -- Remove segment ending markup

 replace "</seg></tuv></tu>" using "" options {starting at top:true}

 -- Remove closing body markup

 replace "</body>" using "" options {starting at top:true}

 -- Remove closing TMX markup

 replace "</tmx>" using "" options {starting at top:true}

 -- Remove any TMX header

 replace "<\\?[\\w\\W]*<body>\\r" using "" options {search mode:grep, starting at top:true}

 -- Remove segment start markup

 replace "<tu.*?seg>" using "" options {search mode:grep, starting at top:true}

 -- Start cleaning up the tab-del

 -- Remove numbers

 replace "\\d+" using "" options {search mode:grep, starting at top:true}

 -- Remove punctuation characters

 replace "[!?,.:;“”’‘\"]" using "" options {search mode:grep, starting at top:true}

 -- Reduce space sequences to single spaces

 replace "[ ]{2,}" using " " options {search mode:grep, starting at top:true}

 -- Remove spaces at segment start

 replace "\\r[ ]" using "\\r" options {search mode:grep, starting at top:true}

 replace "\\t[ ]" using "\\t" options {search mode:grep, starting at top:true}

 -- Remove spaces at segment ending

 replace "[ ]\\r" using "\\r" options {search mode:grep, starting at top:true}

 replace "[ ]\\t" using "\\t" options {search mode:grep, starting at top:true}

 -- Remove duplicate lines

 process duplicate lines duplicates options {match mode:leaving_one} output options {deleting duplicates:true}

 -- Extract source segments with different translations

 process duplicate lines duplicates options {match mode:matching_all, match pattern:"^.*?\\t", match subpattern key:entire_match} output options {duplicates to new document:true}

 save to ((path to desktop folder) as text) & "Identical sources"

 close

 -- Extract identical translations stemming from different sources 

 process duplicate lines duplicates options {match mode:matching_all, match pattern:"\\t.*?$", match subpattern key:entire_match} output options {duplicates to new document:true}

 -- Sort on targets

 sort lines sorting options {match pattern:"\\t.*?$", sort subpattern key:entire_match} output options {replacing target:true}

 save to ((path to desktop folder) as text) & "Identical targets"

 

end tell

end tell

tmx
(1.45 KB)

And here's the script attached.

scpt

The latest version attached.

scpt
Login to post a comment