Start a new topic

OCR for Linux

Hi guys, do you know any decent OCR for linux?


Hello Victor,


OCR engines available for Linux include Tesseract (originally developped by Hewlett-Packard, now open sourced and developped by Google), CuneiForm and Gocr.


Tesseract is quite accurate.


Tesseract itself is CLI based (command line) but there are many programs that you can use as a graphical interface.


My favorite is gImageReader (although I've had some issues with the latest builds), it has nice features. Others include gscan2pdf, OCRFeeder, OCRopy, YAFG and VietOCR.


If you use XSane for scanning, there is also a program called xsane2tess than can be used to directly run OCR on scanned pages.


If you're looking for accurate text OCR, give a try, if you seek to work on documents with complex layout, you will probably be disappointed.


BTW, there is a nice website that helps find alternative programs depending on your platform :


http://alternativeto.net/


1 person likes this

I'm sure that you are aware of this one:

https://www.abbyy.com/en-eu/ocr-sdk-linux/technical-specifications/ ?

Actually I just need to extract the text and working on plain text so it can make the job, however I was more looking for apps that could process what should be read (like text, images...) and what not (like pages numbers, header...).


So Ill try, thanks a lot!

I see! Well, you can check Tesseract's Wiki: https://github.com/tesseract-ocr/tesseract/wiki

especially https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality


What I can recommend right out, is to try ScanTailor, an interactive post-processing tool for scanned pages. It performs operations such as page splitting, deskewing, adding/removing borders, and others. You give it raw scans, and you get pages ready to be printed or assembled into a PDF or DJVU file. Scanning, optical character recognition, and assembling multi-page documents are out of scope of this project.


http://scantailor.org/


It lets you prepare your files for OCR and improve it's quality.


https://github.com/scantailor/scantailor/wiki


If you want to scan a PDF, you'll need to get pages as images out of it: TIFF, PNG...


You can try PDF REDACT TOOLS for that: https://firstlook.org/code/project/pdf-redact-tools/


No automatic, but certainly extremely enjoyable pre-OCR work! And you can use the result with any OCR software you want (Tesseract, Abbyy FineReader and whatnot.)


Here's the Wiki: https://github.com/scantailor/scantailor/wiki


Cheers!

Hey idim,
many thanks for the tip on gImageReader. I'd given up on trying to find a decent OCR program for Linux, but gImageReader is excellent. Took a bit of googling to get the resources required for OCRing German docs, but now I've got there this is going to save me heaps of trouble.
Thanks!!
Jeremy

 

Hello Jeremy,


Happy you like gImageReader, I hope it will serve you well :-)

Login to post a comment