The landscape of paid OCR solutions is, well, expensive. And the unbiased folks from the tesseract homepage say, "Tesseract is probably the most accurate open source OCR engine available." Tesseract OCR's images, but
tesseract as an engine to convert whole PDFs. With a few crafty command-line utilities, we can create a watched folder that will automatically OCR any PDF copied into it, and create a nice OCR'ed PDF which you can cut-and-paste text from or search happily.
We're going to install
pypdfocr which uses the
tesseract open source OCR library.
First, install Homebrew. Homebrew is awesome, it's a package manager for OSX.
Next, use homebrew to install python. This can take a little while.
brew install python
Finally, download the requirements for pypdfocr.
brew install tesseract brew install ghostscript brew install poppler
When I ran it, I got this error:
WARNING: Could not execute identify to calculate DPI (try installing imagemagick?), so defaulting to 300dpi so I also did:
brew install imagemagick
Once you have these, you can probably just do:
pip install pypdfocr
Try running pypdfocr and if it doesn't work install with pip each of the dependencies below.
brew install pil but I got
No distributions at all found for pil, so instead we install pilkit:
pip install pilkit pip install reportlab pip install watchdog pip install pypdf2 pip install pypdfocr
I recommend doing the following as well. As the dev notes: "...if you want Tesseract to recognize rotated documents (upside down, or rotated 90 degrees) then you need to find your tessdata directory and do the following:"
cd /usr/local/share/tessdata cp eng.traineddata osd.traineddata
So, it's all installed. Next, make a directory for pypdfocr to watch:
And create a script, put it in your path (probably at ~/bin), and make it excecutable:
nano ~/bin/pypdfocr-daemon.sh # or use whatever text editor you prefer
and paste this in, replaing YOURUSER with, of course, your very own user name. It seems that scripts loaded through launchctl need full paths and don't load path variables:
#!/bin/bash ulimit -n 9024 /usr/local/bin/pypdfocr -w /Users/YOURUSER/ocr
Note: we're using absolute paths because launchd doesn't execute in a shell context and doesn't read ~/ for home directory. This may not be necessary seeing as I set the $PATH below.
ulimit -n 9024 attempts to set the number of allowed processes. Also you can't set that number too high or it asks for root, which then sets the ulimit for a root shell, which is worthless. It appears launchd's environment is restricted and thus, when concatenating the PDF and opening nearly a thousand files throws:
IOError: [Errno 24] Too many open files.
now make it executable:
chmod +x ~/bin/pypdfocr-daemon.sh
Create a plist file that points to this launcher:
And paste this in (replacing YOURUSER with your very own user name):
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"> <plist version="1.0"> <dict> <key>EnvironmentVariables</key> <dict> <key>PATH</key> <string>/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin</string> </dict> <key>Label</key> <string>ocr-pypdfocr-daemon</string> <key>KeepAlive</key> <true/> <key>ProgramArguments</key> <array> <string>/Users/YOURUSER/bin/pypdfocr-daemon.sh</string> </array> </dict> </plist>
Note: The PATH argument above is because launchd operates with a system-wide path, and it either prefers
/usr/local/bin, finding a built-in python rather than the homebrewed python, or
/usr/local/bin is simply not in the
$PATH variable. Thus,
/usr/local/bin needs to come first to search that path for commands before trying other paths. Before I'd set the PATH, I was getting the error:
Exited with code: 255.
Load it to test if it works (it will load on boot next time you boot):
launchctl load ~/Library/LaunchAgents/com.apple.pypdfocr.daemon.plist
See if it loaded:
launchctl list | grep pypdfocr
And you should see something like this:
- 2 com.apple.pypdfocr.daemon
And there you are. It will watch that folder and always OCR PDFs you copy into it. It can take a while to convert large PDFs, but once they're done, they'll be re-named filename_ocr.pdf. You can see that the daemon is working correctly because a minute or two after you copy the file into the folder there will be a profusion of .png and .html files as it processes the pdf.
If it's not working, try
Because it's a daemon, it will attempt to reload every ten seconds.
If you don't want that, replace
Finally, if things aren't working properly, set
StandardErrorPath so that you can see more detailed errors than just a numeric code.
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"> <plist version="1.0"> <dict> <key>EnvironmentVariables</key> <dict> <key>PATH</key> <string>/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin</string> </dict> <key>Label</key> <string>ocr-pypdfocr-daemon</string> <key>KeepAlive</key> <true/> <key>StandardErrorPath</key> <string>/Users/YOURUSER/ocr-bak/log.txt</string> <key>StandardOutPath</key> <string>/Users/YOURUSER/ocr-bak/log-out.txt</string> <key>ProgramArguments</key> <array> <string>/Users/YOURUSER/bin/pypdfocr-daemon.sh</string> </array> </dict> </plist>