Before we dive into the details, let me explain what you can expect from this post. One of many core technologies for document-driven automation is OCR. OCR is about recognizing text from pixels in images, not about end-2-end automation of your business processes. Most Intelligent Automation products out there come with an OCR engine, because recognizing text from images is necessary in order to automate document-driven processes. However, if you already have a platform like RPA or BPM and you are looking to extend it with OCR, or if you simply want to digitize and make your documents searchable, this post will guide you through the available engines and options.
This post will walk you through the different kinds of OCR engines and services and their use cases. We will take a look at cloud vs. on-premise, standalone software vs. SDK and services, types of documents to process, as well as a list of industry-leading vendors.
This is the main question. We here at Data Fools believe that buying decisions should always start by looking at your needs and use cases, rather than evaluating various technologies and looking for possible applications later.
OCR turns the pixels on document images into words that higher-level technologies like Classification or Data Extraction can use as input. So the basic question is: Is your document automation problem a pixel problem? If you scan paper documents, it obviously is. A scanned image is just a bunch of pixels and you need OCR software to turn them into actionable digital words. If you process PDF documents, you may not need OCR.
Many, but not all PDF documents have a text layer (you notice that when you are able to select text with the mouse in Acrobat Reader). In such a case, it is often – but not always – better to just use that text layer rather than performing OCR on the image layer of the PDF. Or do you process digital content like social media posts, websites, tweets, emails? In that case, you also don’t need OCR. You already have the characters and words.
Indeed the main application for OCR today is still converting paper documents into actionable digital data. But there are others, here are a few examples
You create apps on smart devices with cameras and you need to convert the text in photos taken with the device into digital words. There are lots of types of apps out there that perform OCR, like business card readers, document “scanning” apps, etc.
You want to remote control a virtualization system like Citrix. All you have to work with are basically “screenshots” of the remote machine. OCR can help detect buttons and text in those screens and help decide where to “click” automatically.
This post will focus on enterprise automation needs, but of course, there are also use cases for OCR desktop products. These are tools that just run on a client machine where you can recognize text in images and save them as PDF. They are suited for occasional, on-demand use by individuals, and we will not look at those in this post.
One of the first choices to make when looking at what is the right OCR software for you is this:
There are 2 important factors here.
Do your documents contain sensitive content? Depending on the regulations in your company or those of your client or the country you do business in, you may not be able to send documents to a cloud service, no matter how secure it is. If your company mandate is that the documents remain behind your firewall, a cloud OCR vendor is usually off-limits.
Do you want to host it, maintain it, upgrade it, or do you want the software as a service? This question requires input from your IT department and you need to calculate the long-term cost of each option. A cloud service might be more expensive from a pure license perspective, but you have to factor in the on-premise hosting costs for hardware and staff as well.
Due to network traffic, cloud OCR is naturally slower if you look at the time needed to process a page. This can be overcome of course by parallel calling of the cloud service, but you need to look at your performance needs, and that mainly depends on the number of documents to process per month.
All cloud OCR providers today require that you integrate their REST service in your on-premise system. This means you definitely need some kind of software development team to create an integration with whatever your systems are, for example, Sharepoint, an ERP system, or a database. More on that later.
The billing models can be different. This has nothing to do with cloud architecture per se, more with history. Modern cloud vendors mostly offer a pay-per-use model. You pay a certain amount of money per page processed. Canceling the service is easy, just quit. On-premise enterprise OCR software is usually sold as a perpetual license model where you can process X number of pages per year but pay only once upfront. This makes it much harder to change the provider because you already paid for the current one. Their alternative is a term model where you rent that license year after year.
Another difference is control. If you install software on-premise, you control when you update it and thus when the behavior changes. If you don’t like the new behavior, you can rollback. With cloud services, you have no control over that at all. Of course, the cloud vendors strive to make their software only better from one update to the next, but with OCR this can mean that a specific document that worked well before, might not work as well after the upgrade. The risk of your overall accuracy going down when a cloud OCR vendor upgrades their software is low though.
Some of the classic cloud providers may have an on-premise option nowadays that you need to search for. On-premise is not their primary business so that option may be hidden deeper on their websites. But for customers concerned about security, some of these vendors offer their cloud technology as a virtual container, e.g. Docker or Kubernetes.
Cloud service providers like AWS, Google, Tencent, or Microsoft are very sensitive to your security concerns. They have detailed security and data usage statements published and some of them offer options for even more sensitive customers. For example, Google provides 2 kinds of REST endpoints for many of its services. Type 1 requires that you upload the documents first to secure storage on Google’s servers and then have them processed, then download the results. This requires storage, even if only for a brief period of time, and may scare away some customers. The second type addresses those customers’ needs. The documents submitted via the REST service get processed instantly in memory and never stored for a longer period of time. This type of API is the most secure, but of course, you still have to send the document “to the internet”. Only, they never store it, not even briefly.
You also need to consider if the cloud service is available in your region. In some industries, or by some buyers, it is required that the documents get processed in-country, or at least in a certain region like the EU or North America. Most of the larger cloud providers have data centers all over the world, but not all of the services (including OCR) are supported in every data center. So you need to check if the service you like is available in your region, or at least if you can force the documents to be processed in a certain region. Some providers just route the document to the nearest or least loaded location and don’t give you control. But they document that transparently, so you can easily find out.
Basically, there are 3 types of OCR products
REST service based, either in the cloud or on-premise
Classic SDKs for .NET, C++ or Java
Standalone software or desktop products
Standalone OCR servers that you install in your own IT environment typically monitor a directory and convert images in it to PDF or text files. These products come with a configuration tool for an admin and may have some basic workflow functionalities. Other than some scripting or integration to manage those files that get created, not much actual software development is required to use these products.
SDKs and REST services have in common that you need to integrate them in other software, or that you need to write your own standalone OCR server software that uses these interfaces for the actual OCR. The difference is in how you integrate with them. Most REST Service APIs return structured data for OCR results in a JSON format. Someone needs to write code to convert these results into the format required by the next system that is going to use them. Integrating SDKs into your own software is more involved and needs a real software development team.
So unless you buy a desktop product or a standalone (but inflexible) server application, you will need significant software development efforts to integrate OCR with your systems or applications.
Let us take a look at your use case again. Because not all OCR products out there are great at processing all types of documents.
You need to check that the OCR provider/engine supports all the languages you need to process. For example, if you are a global company looking to process documents centrally, you need to make sure the OCR engine supports all the languages of the documents from all the countries in your organization. Most of the larger vendors for machine-printed documents supported more than a hundred languages, so the chances are high that one vendor is enough. But some organizations need to look into buying several OCR products to cover all languages.
Most of the SDK vendors historically recognize text on scanned business documents. Those documents are their strength. They expect the input to be a scanned piece of paper though, and that can be different from a photograph of a piece of paper. Here is an example:
The scanned document is cropped nicely and the aspect ratio of the image matches the one of the original document. The photographed image shows the document as part of the image, but you also see the background. Furthermore, the image sides are not rectangular, but skewed, due to the camera angle. Parts of the paper on the bottom are also missing due to the angle. There could be other issues, too. For example, the paper on the desk might not be perfectly flat, which could cause extra distortions.
The scanned image is black & white (though that is a scanner option) and the photographed image is in color
In the above examples, the scanned image and the photograph have different resolutions. This can negatively impact OCR results and data extraction.
Some of the text areas are of challenging quality. In the scanned image, due to the original color, the email address is now hard to read. The photographed document has a shaded area on the left side, the scanner has removed that fairly well, making the black characters stand out better.
A classic enterprise OCR engine SDK was designed for scanned images and it may find some of the aspects of the photograph to be challenging. Some of these issues can be overcome easily, like using image perfection software that “rectangularizes” and crops the document from the photograph.
Most cloud vendors started the other way around. Their primary use case was to detect and recognize small amounts of text in larger photographs, e.g. street signs or car license plates. Only later did they apply their technology to business documents.
For example, Microsoft Azure Cognitive Services provides 2 APIs for these use cases. The OCR API is for small amounts of text in a larger image. The Read API is for text-dominant images, like photos or scans of business documents.
You can still see today that these 2 types of vendors are still best at what they originally designed their engines for.
One option besides buying software or licensing a service is to use an open-source product.
Tesseract is the most well-known open-source OCR software for western languages. It is an SDK available for several platforms and programming languages. Hewlett-Packard originally developed it, then open-sourced it in 2005, with ongoing development sponsorship from Google since 2006.
Tesseract was designed for and is still only strong with image-perfected business documents. It cannot be applied well to photographs.
There are dozens of OCR engines out there. Many of them are from the 1990s and haven’t seen much attention in recent years. The list below focuses on the most used modern OCR engines and services. Another overview can be found on Wikipedia (https://en.wikipedia.org/wiki/Comparison_of_optical_character_recognition_software)
AWS from Amazon
Microsoft Azure OCR
OCR is part of Cognitive Services: https://azure.microsoft.com/en-us/services/cognitive-services/computer-vision/#features
Most modern API is Read API: https://docs.microsoft.com/en-us/rest/api/computervision/3.1/read/read
Older OCR APIs are being deprecated
Google Vision OCR
Focus on Asian languages
Tegaki engine for Japanese and Chinese: https://www.tegaki.ai/
Originally designed for cursive handprint: https://hyperscience.com
Vidado.ai (formerly Captricity)
Originally designed for cursive handprint: https://vidado.ai/
FineReader SDK: https://www.abbyy.com/ocr-sdk/
IRISOCR sdk: https://irisdatacapture.com/iris-ocr-sdk/
Capture Recognition Engine (formerly RecoStar): https://www.opentext.com/products-and-solutions/partners-and-alliances/opentext-oem-solutions/advanced-recognition-technologies
Kadmos, focus on handprint: https://rerecognition.com/products/
Focus on Chinese: https://www.wintone.com.cn/en/Products/detail118.aspx
Focus on handprint: https://planet-ai.de/applications/document-analysis/solutions/#textlayer
There is a lot to digest here and we recommend revisiting your use cases and then visiting the various websites to get an impression of the offering and pricing. You can also reach out to us if you need some more in-depth consulting.