Happy Birthday, PDF!

Published on 2023-06-15

On June 15, 1993, the PDF format was introduced. Happy Birthday, PDF! We are happy you are there, as you have changed the IDP world in recent years.

Before PDFs became common for business documents, IDP software was focused on processing images, usually TIFF files coming from scanners. Everyone knows the issues that arise from the combination of bad scanners and mediocre OCR products: Doing something useful with the documents can become tricky. OCR errors due to engine deficiencies or bad-quality scans can be a nightmare to resolve, and fuzzy techniques and machine learning can only partially overcome that.

Enter PDF. Now you don’t need OCR anymore, right? PDFs have perfect text and those two problems with scans and OCR go away. Or do they?

What matters for IDP is the PDF Source

There are 2 kinds of sources that matter:

“Shady PDFs” – PDFs that are created from existing images. E.g. scanners configured to output PDF. Or images from mobile cameras saved as PDFs. These often don’t have a text layer. Or if they do, it might be a bad one.
“Trustworthy” PDFs -They were printed from an application like Word or an ERP system directly into a PDF file. These are “trustworthy” in terms of providing perfect text because the text layer is created directly from the original file or data.

When PDFs come into your organization (invoices attached to emails, photos of documents from mobile images of your customers, etc), you may or may not know where the images come from, once they reach the IDP product that processes them. It is important that you do though, or if you really cannot know, treat them as images. Let’s look at these 2 kinds of PDF sources and explore why that is so.

“Shady PDFs”

There are many sources from where such PDFs enter your IDP product’s input stream. Some examples:

A vendor sends a paper invoice to your mailbox. Your scan service provider scans and saves it as PDF and from there, it goes to the IDP product.
- If the PDF has a text layer, where does it come from? Maybe the scan provider uses a scan tool that runs a built-in OCR engine.
- - In the first case, the PDF’s text layer was created by a less-than-ideal OCR engine, not the high-end one in your IDP platform. Simply trusting that text layer isn’t a good idea. It is recommended to treat the PDF as if it had no text layer (like in the second case). This causes your IDP product to perform proper OCR.
- Most of the time such PDFs only have an image layer.
You are an insurance company and your customers send you all kinds of documents. They use a home scanner or a mobile phone to take a picture and save it as PDF. The home scanner runs cheap OCR on it and creates a bad text layer.
You process documents from public sources on the internet. In this case, you have no knowledge about the source. There may or may not be text, and the text may or may not be any good.

If your documents can come from these kinds of sources, we recommend ignoring the fact that the PDF has text and treating it as if it didn’t, just like any other image file.

“Trustworthy PDFs”

If the application saves a document as PDF, the text layer is created by that application and expected to be 100% correct. A modern IDP product can extract the text from such PDFs and doesn’t have to do OCR.

For the text itself, these PDFs are usually perfect, but we have seen cases where the relation between the image and the text is off. You can notice that sometimes when you open such PDFs in a PDF Viewer and try to highlight text with the mouse. Sometimes, the text doesn’t seem to align with the underlying image. This means the text layer was created with the correct words and characters, but their position wasn’t stored correctly.

This can still cause issues in IDP products:

The simplest problem is that the Human-In-The-Loop applications don’t show the data at the right place in the image when a human performs a review.
Even worse, data extraction can suffer. If certain distances are expected between keywords and values etc, those might be different than expected and the extraction accuracy may suffer despite the text being perfect. Or the IDP software relies on image features like lines in order to segment the text. The lines become useless if they are not in the proper location relative to the text.

Examples of “bad” PDFs

Perfect text but ill-positioned relative to the image.
No text layer at all (that’s often better than having a bad text layer).
Text layer with bad OCR accuracy due to a consumer-grade OCR engine being used
Text layer that is completely unusable due to ASCII/Unicode issues.
Hidden Text. That is an actual feature of the PDF format. Some IDP applications are configured to ignore the hidden text, so despite a perfect text layer being there, the IDP application ignores it.
Password-protected PDFs. These can cause hangs and crashes in the IDP product if it doesn’t support unlocking them.

We all love the PDF format, but it has a lot of pitfalls when it comes to automatically processing PDFs from various, unknown sources. We have seen IDP projects where PDFs are generally treated as images and the text is ignored because it causes more harm than it does good.

Share on

This post was published in

OCR

and tagged with

Data Extraction Extraction OCR PDF