PDF to PDF OCR launchd Daemon: Setting a Watched Folder to Create OCR'ed, Searchable PDFs on Mac OSX

 

The landscape of paid OCR solutions is, well, expensive. And the unbiased folks from the tesseract homepage say, "Tesseract is probably the most accurate open source OCR engine available." Tesseract OCR's images, but pypdfocr uses tesseract as an engine to convert whole PDFs. With a few crafty command-line utilities, we can create a watched folder that will automatically OCR any PDF copied into it, and create a nice OCR'ed PDF which you can cut-and-paste text from or search happily.

This tutorial is partially based on the one at the pypdfocr page, and I owe much to this launchd tutorial.

The Plan

We're going to install pypdfocr which uses the tesseract open source OCR library.

First, install Homebrew. Homebrew is awesome, it's a package manager for OSX.

Next, use homebrew to install python. This can take a little while.

brew install python 

Finally, download the requirements for pypdfocr.

brew install tesseract
brew install ghostscript
brew install poppler 

When I ran it, I got this error: WARNING: Could not execute identify to calculate DPI (try installing imagemagick?), so defaulting to 300dpi so I also did:

brew install imagemagick

Once you have these, you can probably just do:

pip install pypdfocr

Try running pypdfocr and if it doesn't work install with pip each of the dependencies below.

Developer suggests brew install pil but I got No distributions at all found for pil, so instead we install pilkit:

pip install pilkit
pip install reportlab
pip install watchdog
pip install pypdf2
pip install pypdfocr

I recommend doing the following as well. As the dev notes: "...if you want Tesseract to recognize rotated documents (upside down, or rotated 90 degrees) then you need to find your tessdata directory and do the following:"

cd /usr/local/share/tessdata
cp eng.traineddata osd.traineddata

After Installation

So, it's all installed. Next, make a directory for pypdfocr to watch:

mkdir ~/ocr

And create a script, put it in your path (probably at ~/bin), and make it excecutable:

nano ~/bin/pypdfocr-daemon.sh # or use whatever text editor you prefer

and paste this in, replaing YOURUSER with, of course, your very own user name. It seems that scripts loaded through launchctl need full paths and don't load path variables:

#!/bin/bash
ulimit -n 9024
/usr/local/bin/pypdfocr -w /Users/YOURUSER/ocr

Note: we're using absolute paths because launchd doesn't execute in a shell context and doesn't read ~/ for home directory. This may not be necessary seeing as I set the $PATH below. ulimit -n 9024 attempts to set the number of allowed processes. Also you can't set that number too high or it asks for root, which then sets the ulimit for a root shell, which is worthless. It appears launchd's environment is restricted and thus, when concatenating the PDF and opening nearly a thousand files throws: IOError: [Errno 24] Too many open files.

now make it executable:

chmod +x ~/bin/pypdfocr-daemon.sh

Create a plist file that points to this launcher:

nano ~/Library/LaunchAgents/com.apple.pypdfocr.daemon.plist

And paste this in (replacing YOURUSER with your very own user name):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
  <dict>
    <key>EnvironmentVariables</key>
      <dict>
        <key>PATH</key>
          <string>/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin</string>
      </dict>
    <key>Label</key>
      <string>ocr-pypdfocr-daemon</string>
    <key>KeepAlive</key>
      <true/>
    <key>ProgramArguments</key>
      <array>
        <string>/Users/YOURUSER/bin/pypdfocr-daemon.sh</string>
      </array>
  </dict>
</plist>

Note: The PATH argument above is because launchd operates with a system-wide path, and it either prefers /usr/bin to /usr/local/bin, finding a built-in python rather than the homebrewed python, or /usr/local/bin is simply not in the $PATH variable. Thus, /usr/local/bin needs to come first to search that path for commands before trying other paths. Before I'd set the PATH, I was getting the error: Exited with code: 255.

Testing

Load it to test if it works (it will load on boot next time you boot):

launchctl load ~/Library/LaunchAgents/com.apple.pypdfocr.daemon.plist

See if it loaded:

launchctl list | grep pypdfocr

And you should see something like this:

-   2   com.apple.pypdfocr.daemon

And there you are. It will watch that folder and always OCR PDFs you copy into it. It can take a while to convert large PDFs, but once they're done, they'll be re-named filename_ocr.pdf. You can see that the daemon is working correctly because a minute or two after you copy the file into the folder there will be a profusion of .png and .html files as it processes the pdf.

If it's not working, try

tail -f /var/log/system.log

Because it's a daemon, it will attempt to reload every ten seconds.

If you don't want that, replace

<key>KeepAlive</key>

with

<key>RunAtLoad</key>

Finally, if things aren't working properly, set StandardOutPath and StandardErrorPath so that you can see more detailed errors than just a numeric code.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
  <dict>
    <key>EnvironmentVariables</key>
      <dict>
        <key>PATH</key>
          <string>/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin</string>
      </dict>
    <key>Label</key>
      <string>ocr-pypdfocr-daemon</string>
    <key>KeepAlive</key>
      <true/>
    <key>StandardErrorPath</key> 
      <string>/Users/YOURUSER/ocr-bak/log.txt</string>
    <key>StandardOutPath</key>
      <string>/Users/YOURUSER/ocr-bak/log-out.txt</string>
    <key>ProgramArguments</key>
      <array>
        <string>/Users/YOURUSER/bin/pypdfocr-daemon.sh</string>
      </array>
  </dict>
</plist>

About the Author

Hi. My name is Jeremiah John. I'm a sf/f writer and activist.

I just completed a dystopian science fiction novel. I run a website which I created that connects farms with churches, mosques, and synagogues to buy fresh vegetables directly and distribute them on a sliding scale to those in need.

In 2003, I spent six months in prison for civil disobedience while working to close the School of the Americas, converting to Christianity, as one does, while I was in the clink.