Strange IndiaStrange India



There’s nothing worse than opening a PDF and realizing you can’t use the search function or even highlight text. This typically happens when a PDF was created by scanning a paper document—it’s just a series of images. Most modern scanning software uses Optical Character Recognition (OCR) so that words are both searchable and selectable but sometimes you’ll run into documents where this didn’t happen.

In those cases, the free and open source OCRmyPDF is perfect to have around. This is a command line application that quickly converts any PDF file into a PDF/A file complete with optical character recognition, meaning you’ll be able to search the text. Even better, it’s completely free.

Installing the application is best done using your package manager on Linux devices and using Homebrew on Mac. Windows users can technically install the application by installing Python and a few other dependencies—look into that if you’re willing to do some digging.

Once the application is set up, you can use it by typing ocrmypdf followed by the name of the document you want to add OCR to, and then the name of the document you’d like to create. So, for example, ocrmypdf before.pdf after.pdf would take “before.pdf”, add character recognition, then create a new document called “after.pdf”.

The process will take awhile, depending on the size of the document, and it might not be entirely accurate if the image quality is low. Even saying all that, though, I found this did a pretty good job even with the most ancient and poorly compressed PDFs I could dig up.

An image from an old history textbook shown here with copyable text.


Credit: Justin Pot

And there’s more you can do here: In fact, the Cookbook on the OCRmyPDF documentation outlines a bunch of things you could do. You can compress the images in the PDF, for example, by adding --pdfa-image-compression jpeg to your commend. You can automatically re-orient any pages with sideways text by adding --rotate-pages to the command. Or maybe the PDF you’re processing already has OCR that you think is poor quality—you can add --redo-ocr to the command; this will strip out existing OCR information and start over.

You get the idea: There’s a lot here. Check out the documentation for more information because there’s more this thing can do.





Source link

By AUTHOR

Leave a Reply

Your email address will not be published. Required fields are marked *