Start a new topic

Feature to remove/shift in-word tags

This would be a very dangerous feature but nevertheless useful: a feature to get tags inside words outside of these word, in order to enable term recognition:



(Especially in sloppy Studio projects)


Perhaps it's already possible via regex?


Until now I'd recommend CodeZapper (or maybe TransTools that I have not tested yet) for any Word file, not to forget the special filter for OCR'ed docs inside CT.


For any XML file I do not think this will work. Just imagine the following cases


This is a <b>test c<1>ase</b>.

This is a <b>test case</b> n<1>o. <b>4</b>.


Where <b> sets a text bold and <1> is a senseless tag (a very simple case, indeed). Where should CT put the <1>-tag? This would be possible (in the sense of "put tag between the other tags like in the source text"), but in many other cases it would produce a tag error, I assume.


If you are only speaking of term recognition, indeed, I agree, it would be nice to have these tags ignored.


1 person likes this

Torsten: For any XML file I do not think this will work.


Basically, a *.docx file is an *.xml file.


Rogue codes like in your "This is a <b>test c<1>ase</b>." can be relatively easily removed because there's no closing tag, no </1> in your case. I was rather pissed off because there's no equivalent of CZ or TT for the Mac, until I found out, the only thing that was really missing, was exactly the rogue code feature. I tried to study that subject, was able to get rid of most (but not all) rogue codes in some standard nasty *.docx files, and then decided it wasn't worth the trouble because CT's filter is good enough.


H.


1 person likes this

>If you are only speaking of term recognition, indeed, I agree, it would be nice to have these tags ignored.


Indeed

Just had this nice example of a tag preventing term recognition, that I'd like to share:



OCR'd project, using the MS OCR filter (which actually works very good on this job)

Login to post a comment