Start a new topic

Add filter for OpenOffice/LibreOffice OCR

The situation: most ODT files I work with have many unnecessary tags, marking each edit, even without spellchecking on. This makes it very impractical to work on such files. Also, some versions of Abbyy Finereader offer support for directly saving OCRed documents to ODF formats.


Usually, I have to save the file as DOCX and either use CT's DOCX OCR filter or CodeZapper. However, there is no 100% compatibility between DOCX and ODT, and some documents get messed up if they are converted to DOCX.


CodeZapper or TransTools being only for DOCX documents, I can't find a workaround for conveniently working on OpenDocument files.


Would it be feasible to add an ODF filter for OCR in CafeTran?


Would it make sense for other users?


1 person likes this idea

I can see that this is important for you and for that reason I clicked 'Do you like this idea!'.


Since I've never received any ODT files I'd be interested to learn which kind of clients do send you these? I could imagine: German local governments, since some of them migrated to Linux and OO. Perhaps research institutes too? I'd be amazed if companies would use OO (instead of MS Office).


Just out of interest ...

Thank you for voting.


OpenDocument is an accepted, recommended, adopted or even required file format (sometimes along with Linux as you said, see https://en.wikipedia.org/wiki/List_of_Linux_adopters) by several governmental and other organizations across the world.


https://en.wikipedia.org/wiki/OpenDocument_adoption


For example, according to French government's RGI (general interoperability framework), ODF is the "recommended format for office documents within French administrations".


This is to be expected because ODF is an Open Standard and not a closed format.


And being an Open Standard, I guess it may be easier to implement such a filter than it might have been for Word DOCX.


In my line of work, I have received ODT files from some French clients in the past, although I admit DOCX is the majority for sure, in business environment.


However, I'm also using LibreOffice extensively for other translation related tasks: saving OCRed documents as ODT, working on literary translations, Free Software localization, CV and other business documents.


I like the fact CafeTran accepts ODF files, and allows/supports the use of OO/LO in various ways (unless I am mistaken, export as ODT for bilingual review is lacking, but hardly an issue because the resulting DOCX can be opened in LibreOffice - the issue is with documents that you can't save as DOCX without loosing some formatting/layout, and they are more common than you might think).


At least my own use certainly defies the preponderance of MS Office, although I keep a copy handy for when it's absolutely necessary...



 

 some versions of Abbyy Finereader offer support for directly saving OCRed documents to ODF formats.


I was not aware of any OCR program saving the OCRed documents with the .odt extension. Please submit a support ticket and attach a sample LibreOffice file after being OCRed.


Igor

Hi Igor,

The current Pro and Corporate editions of Abbyy FineReader offer saving as ODT and sending to OpenOffice/LibreOffice:

http://www.abbyy.com/finereader/corporate/editions-comparison/

 

Attaching a related screenshot.


---


I've done some more tests with OCRed documents saved from FineReader to ODT, and it seems that the main culprit is the spellchecker: if I don't spellcheck the text, the document does not have many tags going on in CT. As soon as there are some corrections or some kind of edits (and an OCRed text is bound to have some), it becomes a tag soup. Running the document through Antidote (a paid correction tool with a corrector to LibreOffice) does not produce tags when applying the corrections.

Also, when copying and pasting PDFs (even without formatting) into LibreOffice, tags are inserted in paragraphs that have no apparent formatting.

I'll go on and open a ticket so that I can send you some files for further review. Thanks for your interest.
png
Login to post a comment