Originally Published: 2014-11-13
- Machow2.com › Free-ocr-software-macTop 10 Free OCR Software For Mac Of 2021
- Cached
- Greendn.wellnesswithlove.co › Free-ocr-softwareFree Ocr Software For Mac Os X 10.12
Jul 26, 2013 There are top 5 free OCR software for Mac and Windows that cater to OCR PDF on Mac. Just take a look at them and pick up one for your assistant. Adobe Acrobat X Pro (for Mac/Windows) Adobe Acrobat is the most comprehensive PDF manager. But not all users know that it also allows OCR scanning of documents. Ancient Greek OCR on OS X. Ancient Greek OCR is easiest to use on Mac OS X with the free software VietOCR application, with the Homebrew Tesseract package. Below are step by step instructions to install and set it up, and use it, for Ancient Greek OCR. Download PDFScanner - Scanning and OCR for macOS 10.14 or later and enjoy it on your Mac. There are many applications for macOS that allow scanning of images or text. Most of them are however complex, slow or not really suited for scanning documents or letters.
This is a short writeup of the working process I came up with for command-line OCR of a non-OCR’d PDF with searchable PDF output on OS X, after running into a thousand little gotchas. 1
Machow2.com › Free-ocr-software-macTop 10 Free OCR Software For Mac Of 2021
Software Installation
- Install homebrew (if you haven’t already).
Install ImageMagick (needs TIFF and Ghostscript support):
Install Tesseract with all languages:
- Install pdftk server from the package installer.
Cached
Processing Workflow
I’m going to assume you have a non-OCR’d PDF you want to convert into a searchable PDF.
Split and convert the PDF with ImageMagick
convert
:OCR the pages with Tesseract: 23
Join your individual PDF files into a single, searchable PDF with
pdftk
: 4
Greendn.wellnesswithlove.co › Free-ocr-softwareFree Ocr Software For Mac Os X 10.12
Now merged.pdf
should contain your searchable, OCR’d PDF. I’ve wrapped this workflow up into a script, or alternately you may want to see if the robust OCRmyPDF script works for your needs.
Footnotes
A sampling of the various ways in which Tesseract/Leptonica is picky in its TIFF handling:
Error in pixConvertRGBToGray: pixs not 32 bpp
,Error in pixReadFromTiffStream: spp not in set
,Error in pixReadStreamTiff: pix not read
,Error in pixReadTiff: pix not read
,Error in pixRead: pix not read
,Error in findTiffCompression: function not present
,Error in pixReadStream: Unknown format: no pix returned
,Error in pixReadStream: tiff: no pix returned
,Unsupported image type.
↩If your document isn’t in English, pass the
-l tla
flag as the first argument totesseract
. See theLANGUAGES
section ofman tesseract
. You can also install and use your own training data, for example, for Ancient Greek or Latin. On OS X, you’ll want to copy thelang.traineddata
file to/usr/local/share/tessdata
. ↩If you have GNU Parallel installed (
brew install parallel
), you can parallelize this process:I initially tried to use the
join.py
Preview Automator script that comes bundled with OS X (at/System/Library/Automator/Combine PDF Pages.action/Contents/Resources/join.py
), but this seems to mangle the actual OCR text into unsearchable whitespace for me (confusingly, this preserves selectable line/character bounding boxes, so it looks like there’s OCR’d text there but there’s not). I originally suggested using Ghostscript to combine the PDF files with the command:However, this mangles non-Latin scripts. If you would still like to use Ghostscript instead of
pdftk
, the command:May give you good, relatively compressed results (from explicitly setting a more modern PDF compatibility level) while preserving non-Latin scripts.
I realized at the end of writing this guide that you can also use
convert
to create a multipage TIFF (omit the_%05d
format specifier in your output filename) and process/output that directly with Tesseract, but I like being able to parallelize the OCR,3 and recombining with pdftk gives me better compression in my testing. ↩