Free Ocr Mac Os X

tags: ocr, mac
Originally Published: 2014-11-13

Machow2.com › Free-ocr-software-macTop 10 Free OCR Software For Mac Of 2021
Cached
Greendn.wellnesswithlove.co › Free-ocr-softwareFree Ocr Software For Mac Os X 10.12

Jul 26, 2013 There are top 5 free OCR software for Mac and Windows that cater to OCR PDF on Mac. Just take a look at them and pick up one for your assistant. Adobe Acrobat X Pro (for Mac/Windows) Adobe Acrobat is the most comprehensive PDF manager. But not all users know that it also allows OCR scanning of documents. Ancient Greek OCR on OS X. Ancient Greek OCR is easiest to use on Mac OS X with the free software VietOCR application, with the Homebrew Tesseract package. Below are step by step instructions to install and set it up, and use it, for Ancient Greek OCR. Download PDFScanner - Scanning and OCR for macOS 10.14 or later and enjoy it on your Mac. ‎There are many applications for macOS that allow scanning of images or text. Most of them are however complex, slow or not really suited for scanning documents or letters.

This is a short writeup of the working process I came up with for command-line OCR of a non-OCR’d PDF with searchable PDF output on OS X, after running into a thousand little gotchas. ¹

Machow2.com › Free-ocr-software-macTop 10 Free OCR Software For Mac Of 2021

Software Installation

Install homebrew (if you haven’t already).
Install ImageMagick (needs TIFF and Ghostscript support):
Install Tesseract with all languages:
Install pdftk server from the package installer.

Cached

Processing Workflow

I’m going to assume you have a non-OCR’d PDF you want to convert into a searchable PDF.

Split and convert the PDF with ImageMagick convert:
OCR the pages with Tesseract: ²³
Join your individual PDF files into a single, searchable PDF with pdftk: ⁴

Greendn.wellnesswithlove.co › Free-ocr-softwareFree Ocr Software For Mac Os X 10.12

Now merged.pdf should contain your searchable, OCR’d PDF. I’ve wrapped this workflow up into a script, or alternately you may want to see if the robust OCRmyPDF script works for your needs.

Footnotes

A sampling of the various ways in which Tesseract/Leptonica is picky in its TIFF handling: Error in pixConvertRGBToGray: pixs not 32 bpp, Error in pixReadFromTiffStream: spp not in set, Error in pixReadStreamTiff: pix not read, Error in pixReadTiff: pix not read, Error in pixRead: pix not read, Error in findTiffCompression: function not present, Error in pixReadStream: Unknown format: no pix returned, Error in pixReadStream: tiff: no pix returned, Unsupported image type.↩
If your document isn’t in English, pass the -l tla flag as the first argument to tesseract. See the LANGUAGES section of man tesseract. You can also install and use your own training data, for example, for Ancient Greek or Latin. On OS X, you’ll want to copy the lang.traineddata file to /usr/local/share/tessdata. ↩
If you have GNU Parallel installed (brew install parallel), you can parallelize this process:
I initially tried to use the join.py Preview Automator script that comes bundled with OS X (at /System/Library/Automator/Combine PDF Pages.action/Contents/Resources/join.py), but this seems to mangle the actual OCR text into unsearchable whitespace for me (confusingly, this preserves selectable line/character bounding boxes, so it looks like there’s OCR’d text there but there’s not). I originally suggested using Ghostscript to combine the PDF files with the command:
However, this mangles non-Latin scripts. If you would still like to use Ghostscript instead of pdftk, the command:
May give you good, relatively compressed results (from explicitly setting a more modern PDF compatibility level) while preserving non-Latin scripts.
I realized at the end of writing this guide that you can also use convert to create a multipage TIFF (omit the _%05d format specifier in your output filename) and process/output that directly with Tesseract, but I like being able to parallelize the OCR,³ and recombining with pdftk gives me better compression in my testing. ↩