This would be a very dangerous feature but nevertheless useful: a feature to get tags inside words outside of these word, in order to enable term recognition:
(Especially in sloppy Studio projects)
Perhaps it's already possible via regex?
Until now I'd recommend CodeZapper (or maybe TransTools that I have not tested yet) for any Word file, not to forget the special filter for OCR'ed docs inside CT.
For any XML file I do not think this will work. Just imagine the following cases
This is a <b>test c<1>ase</b>.
This is a <b>test case</b> n<1>o. <b>4</b>.
Where <b> sets a text bold and <1> is a senseless tag (a very simple case, indeed). Where should CT put the <1>-tag? This would be possible (in the sense of "put tag between the other tags like in the source text"), but in many other cases it would produce a tag error, I assume.
If you are only speaking of term recognition, indeed, I agree, it would be nice to have these tags ignored.
Torsten: For any XML file I do not think this will work.
Basically, a *.docx file is an *.xml file.
Rogue codes like in your "This is a <b>test c<1>ase</b>." can be relatively easily removed because there's no closing tag, no </1> in your case. I was rather pissed off because there's no equivalent of CZ or TT for the Mac, until I found out, the only thing that was really missing, was exactly the rogue code feature. I tried to study that subject, was able to get rid of most (but not all) rogue codes in some standard nasty *.docx files, and then decided it wasn't worth the trouble because CT's filter is good enough.
>If you are only speaking of term recognition, indeed, I agree, it would be nice to have these tags ignored.
Just had this nice example of a tag preventing term recognition, that I'd like to share:
OCR'd project, using the MS OCR filter (which actually works very good on this job)