Language plays an important role in document classification and data extraction. It also matters in downstream processing and when the humans come into the loop. A document is written in a certain language, your users speak many languages, your business interacts with companies in various countries with yet other languages.
Detecting the language of a document early on in the processing workflow can improve accuracy and throughput.
Let’s say you have an AP Automation system in place. It centrally processes invoices that your organization receives from many countries. An OCR engine does text recognition. Human operators review incorrectly extracted data in a browser-based UI. The simplest way of setting up the system with regards to language support is this:
In the OCR engine configuration you select all document languages the system can encounter.
Configure the machine-learning-based data extraction so that it builds an AI model for all languages at once.
Send invoices for review to a large pool of all operators so you can efficiently burn through the review tasks.
This approach can make the configuration of the automation system simple but each of the above “default settings” has several disadvantages, which we will now look into.
The way OCR engine language settings work is like this: They segment the image into characters, classify each character to know which one it is, then form words based on the spacing. Lastly, they use a language-specific dictionary to improve the raw recognition, as it is more likely that a word is correct if it appears in a dictionary.
This works well if you select only one or 2 languages, but many engines allow you to select dozens of languages at once. If you do that, the OCR engine applies all the selected dictionaries and the chances increase that it reads a French word in an English document, as an example.
You can avoid this in many automation products by using language detection. Instead of processing all documents with OCR engine settings using all languages, you can first detect the language and then route the document to a processing step where the OCR engine settings use only one language. This means more configuration effort, but likely results in better recognition.
Many capture products include machine learning tools for data extraction. In the old days, consultants came in and set up regular expressions and dozens of rules per field that needed extraction. This becomes unmanageable quickly. Machine learning to the rescue. Instead of writing rules, users train the system by clicking on the value of each field in several documents. The system builds a model from all these sample documents and figures out the good old rules by itself.
Now, this model’s performance is also impacted by what you throw into the samples used to train it. Invoices from all countries used to train the same model can cause similar effects as explained earlier. Keywords and labels for fields can become inconsistent the more different languages the engine sees in the samples.
Again, you can avoid this with some design effort. Most capture automation products include classification features. You can use language detection based on classification to route the document to a language-specific sub-class. In that sub-class, the fields are extracted with a language-specific AI model only, which likely provides better accuracy. Designing the capture project like this has other great side effects, too:
You can use the language information to verify the extracted vendor. E.g. it is unlikely that a Spanish vendor sends a French invoice.
You can use the language to derive the country of the vendor in many cases. This allows you to fine-tune the local tax validations and other country-specific rules.
When you automate the Accounts Payables process, the people reviewing the incorrectly extracted invoices or doing the downstream processing (approvals, GL-coding, etc.) are often experts. If you centralize the processing of all invoices it makes sense to route them to users fluent in the invoice language. It doesn’t make much sense to have a French accountant review a Chinese invoice.
Language detection can solve that problem, too. You can configure many AP Automation systems to route tasks to individuals or groups based on custom criteria. Why not use the detected invoice language as a criterion to route the invoice to the person who speaks the invoice’s language?
Language detection is usually done with classification technology. There are 3 kinds of tools for this:
Built into the OCR engine. You can configure some OCR engines either select all languages manually or auto-detect the right one. The auto-detection is done internally in the engine. Often, it does the first pass in a language with western characters like English. and then classifies the text with an AI model trained on all languages. But for you as a user of the OCR engine, it is just a checkbox: “Auto-detect language”
Built into the Automation platform. Sometimes, language detection is a feature there that you can use to steer the workflow or modify settings like OCR languages dynamically.
Build your own. Most Capture automation platforms include classification. You can create your own classification model for languages. If you need samples to train such a language classifier, you can find them easily on Kaggle.com, or you use existing, labeled samples from your system of record.
Consider your business case. You might not need to worry about language detection if you process only documents in one language and if all your people working on them speak the same language. But in all other cases, consider language detection capabilities when selecting OCR engines or buying software that solves your document automation problems.