Automatic Document Separation for Invoices - possible! but recommended?

Published on 2022-09-01

Document separation comes into play when multiple documents are either scanned as one pile of paper or when you receive large PDFs containing multiple actual documents. If you want to automate processes based on individual documents, you need to separate them.

This is a crucial step in invoice automation. The mailroom opens all the incoming invoices and puts them on a stack which gets scanned and then processed. Or daily PDFs are emailed or come in via FTP, containing multiple invoices.

In order to process, approve and pay these invoices, they need to be individual documents. There are 3 typical ways to do this and we are going to look into the pros and cons of each.

Why Separation matters: The vanishing invoices

When separation fails, this can cause huge problems downstream. So the method you select should be reliable. When a separation is missed, that means that 2 invoices are scanned as one. Let’s say you scan a 4-page invoice and a 5-page invoice and the capture product that is supposed to detect where the 5-page one starts, misses this document boundary. Then the 5 pages just become additional pages of the first invoice and you end up with a 9-page invoice that are actually 2 invoices. But your operators, coders, and approvers may only look at the first few pages and never realize there is a second invoice.

Such a hidden second invoice might then not get paid and the vendor may call in after weeks to inquire about their money. It’ll be virtually impossible to find that invoice. Even if the operators do detect the error, the 2nd invoice has to be split off somehow and be re-processed. That’s cumbersome and time-consuming as well and due dates and discounts will be missed. If a split is missed, an invoice can “vanish”. You don’t want that to happen!

Method 1: Separator sheets

Separator sheets are special pieces of paper that the mailroom operator ingests between 2 invoices. Open one envelope, put the content on the scanner stack, put a separator sheet on top, then repeat for the next envelope. Separator sheets usually have very large barcodes or patch codes on them that the capture product can easily and reliably detect. Most

capture products can be configured to start a new document when a separator sheet is detected

and delete the separator sheet.

Pros

Most reliable approach

Cons

Waste of one page of paper per invoice
Manual step needed
Doesn’t work with multi-document PDFs

Method 2: Barcode stickers

Many accounting departments use barcode stickers when processing invoices. The barcodes are on a roll and every time the mailroom clerk puts an invoice on the scanner stack, they rip a barcode off the roll and stick it onto the first page of the invoice. This indicates to the Capture Software that a new document begins here. Often, the barcode represents a unique number that can be used as the ID in the ERP system. When the document gets scanned and separated, the barcode value is sent to the ERP system together with the (yet unprocessed) image and thus the invoice is registered and known in the ERP system early on.

Pros

Very reliable approach
Saves paper
Provides a unique ID early on in the process

Cons

Manual step needed
If the barcode sticker is not placed perfectly in a white space of the invoice, it can prevent OCR from reading the text properly. It can also cause images to be misclassified if the barcode is too large and dominant.
Barcodes are slightly less reliable than separator sheets but still a very good choice
Doesn’t work with multi-document PDFs

Method 3: Intelligent capture techniques

Instead of using any barcode or patch code, many capture software vendors offer automatic document separation. Some of them use machine learning approaches and some of them, especially for invoice automation, use a complex set of hard-coded rules.

The machine learning approach works by teaching the system the first pages of documents/invoices and over time it learns to split the page stream there. For invoices, you would need a lot of samples and training here, as every invoice looks different yet has similar content.

The rules approach often looks at such things as page counters on the document like “page 2 of 10” or it extracts data like the vendor and the invoice number from each page and when it finds that to be changing on a page, it assumes that is because this page represents the next invoice. All the rules together do work, but this approach is quite error-prone as you can imagine. If only 1% of the documents has an OCR error in a critical place important to those rules, you may miss a split and cause 1% of the invoices to “vanish” as described earlier. 1% in a company that receives 100,000 invoices a year means 1000 invoices that they need to painfully retrieve and up to 1000 unhappy vendors.

Pros