Hello Victor,
OCR engines available for Linux include Tesseract (originally developped by Hewlett-Packard, now open sourced and developped by Google), CuneiForm and Gocr.
Tesseract is quite accurate.
Tesseract itself is CLI based (command line) but there are many programs that you can use as a graphical interface.
My favorite is gImageReader (although I've had some issues with the latest builds), it has nice features. Others include gscan2pdf, OCRFeeder, OCRopy, YAFG and VietOCR.
If you use XSane for scanning, there is also a program called xsane2tess than can be used to directly run OCR on scanned pages.
If you're looking for accurate text OCR, give a try, if you seek to work on documents with complex layout, you will probably be disappointed.
BTW, there is a nice website that helps find alternative programs depending on your platform :
I'm sure that you are aware of this one:
https://www.abbyy.com/en-eu/ocr-sdk-linux/technical-specifications/ ?
BTW, Victor, did you see this:
https://cafetran.freshdesk.com/support/discussions/topics/6000046862
?
Actually I just need to extract the text and working on plain text so it can make the job, however I was more looking for apps that could process what should be read (like text, images...) and what not (like pages numbers, header...).
So Ill try, thanks a lot!
I see! Well, you can check Tesseract's Wiki: https://github.com/tesseract-ocr/tesseract/wiki
especially https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
What I can recommend right out, is to try ScanTailor, an interactive post-processing tool for scanned pages. It performs operations such as page splitting, deskewing, adding/removing borders, and others. You give it raw scans, and you get pages ready to be printed or assembled into a PDF or DJVU file. Scanning, optical character recognition, and assembling multi-page documents are out of scope of this project.
It lets you prepare your files for OCR and improve it's quality.
https://github.com/scantailor/scantailor/wiki
If you want to scan a PDF, you'll need to get pages as images out of it: TIFF, PNG...
You can try PDF REDACT TOOLS for that: https://firstlook.org/code/project/pdf-redact-tools/
No automatic, but certainly extremely enjoyable pre-OCR work! And you can use the result with any OCR software you want (Tesseract, Abbyy FineReader and whatnot.)
Here's the Wiki: https://github.com/scantailor/scantailor/wiki
Cheers!
Hello Jeremy,
Happy you like gImageReader, I hope it will serve you well :-)
victorparragarcia
Hi guys, do you know any decent OCR for linux?