Optical Character Recognition (OCR) is a technology that allows computers to interpret pixels on image files as characters and words. This process makes the pixel data useful for higher-level tasks such as data extraction, content-based classification, sentiment analysis, creation of searchable PDF, and so on.
For these tasks, you need to have the text. But when a scanned image is just an array of pixels. OCR is a set of technologies that deal with interpreting these pixels and providing more useful text data.
When dealing with business documents, OCR needs to handle a variety of writing and printing styles. The topmost distinction here is machine-printed versus handwritten. The technology applied to recognize machine print can differ vastly from handprint. However, that depends on what kind of handwriting you are dealing with. If the handwriting is constrained (which means all the characters are separated, not connected), the techniques aren’t so different from detecting machine print. If it is unconstrained, also known as cursive, the game changes because now you can no longer detect individual characters.
Let’s look at each of these problems one at a time and see how OCR and related technology works. I will ignore the problem of grayscale and color images and assume all our business documents are back&white. In real life, you would either use image perfection software to convert color to black&and white or, depending on the OCR software, let it handle that.
The classic approach to OCR for machine printed text consists of 2 steps:
Segment the image into small snippets of individual characters
Classify each snippet to find the closest matching computer character
Step one works here because the text is constrained, which means the characters are not connected. This allows a developer to use imaging algorithms to find connected black pixels, which form characters, and each character is surrounded completely by white pixels. Once the algorithm isolates an “m” as a small image, step 2 sends that image to a neural network that was trained on thousands of different “m”s and all other characters and it will return a score. The higher the score for each possible computer character, the more likely our “m” is an actual m. You may already know that you need to train a neural network on thousands of examples for a single character. Why so many? Well, there is no one “m” when you scan a document. In fact, “m”s look very different depending on the font but mainly depending on the little artifacts caused by scanning. It should always look like this, then life would be easy:
Here are some examples of real “m”s from scanned documents:
While the many appearances of a character are the main challenge for step 2, step 1’s challenge is connected characters. Didn’t I just say they are unconnected? Well in theory. In practice they are often connected when scanned, here are some examples
You can probably read most of these examples. But a computer has a really hard time separating connected characters. For example, is this just an “m” or is this a connected “r” and n”?
Additional steps make the results better. There are imaging algorithms available to developers to help with the segmentation of characters. Another way to optimize the result is closer to how we do it as humans. How do you know if it is an “r” and an “n” or just an “m”? You cannot tell it from the isolated image but you can from context. If you see the entire word or sentence, you know if it is an “m” or not. So OCR engine vendors apply a similar approach. They do their best with steps 1 and 2 and then might end up with multiple possible words based on the individual character scores. But how likely is it that all these words are in the Oxford English Dictionary, assuming our documents are English? The answer is: very low. So if one of the possible recognized words is in a dictionary, then the engine prefers that.
Another typical example is the number zero (“0”) and the capital character “O”. An OCR engine can easily confuse them obviously. But if other more obvious numbers surround the symbol in question, the engine may decide it is a zero, while if the context is alpha characters, the symbol is more likely an “O”.
The classic approach for constrained handprint recognition is similar to machine print. However, this discipline is historically called ICR (Intelligent Character Recognition) as it was invented later than OCR. At the time, it took OCR to the next level, as it now can also recognize handwriting. Hence the name “Intelligent” Character Recognition. The approach is the same. Segment the image into character snippets, then classify the characters with a neural network. Only this time, the neural network is trained on more and different samples. Because, as you can imagine, handwriting creates a lot more variants for each character, as we all write differently.
Unconstrained handwriting recognition (also known as cursive) is the ultimate OCR problem. This is where only a few companies really excel. As you can see from the previous section, step 1 of the approach for constrained text cannot be applied here, as it is nearly impossible to segment the characters before you know which loop or line belongs to which character, so you are in a hen-and-egg situation.
Enter IWR (Intelligent Word Recognition). This type of ICR works on entire words because at least in western languages, words are still somewhat separate. There are 2 approaches to recognizing unconstrained handwriting, both usually use a neural network. But the neural network is trained differently.
A neural network is trained by showing it labeled examples. A labeled example is an example where the desired outcome is defined, so the example is manually correctly labeled with the desired outcome. In this approach, the example is not the image of the word, instead, it is a bunch of handcrafted “features” (as the data scientists call it). Here are some examples for features of the word “the”: It has only one enclosed area (in the “e”), 2 horizontal bars (in the “e” and the “t”), and a bunch of vertical lines, upper partial loops, lower partial loops, etc. All these properties are the features of the word “the”.
This is the rough workflow used in this approach
Take the word image snippet
count/measure the features and put them in a vector (a list of the word’s properties, if you will)
Label the vector (“this list of features represents an example of “the”)
Feed all the vectors into the neural network for training
Once the neural network model is trained it can then be applied to new feature vectors of word images that you want to classify with it.
A variation of this approach is training the neural network with features of just individual characters, but that requires a more sophisticated segmentation at runtime.
There are two disadvantages of this approach:
The developers need to write a lot of image analysis code to derive the features from the word image. The better that works, the better the accuracy will be.
You need a lot of labeled samples, though not as many as for the next approach
With modern neural network architectures like Convolutional Neural Networks (CNNs), it has become feasible to just train the image snippet for each word. No need to derive the features based on the lines and circles and so on. Basically no need to explain what the properties of the word are. Let the CNN figure that out on its own. Just show the CNN enough examples of every word and let it learn how to understand which word it is.
This approach requires by far the largest set of labeled samples. I am talking about many millions, if not billions of samples. Why so many? With OCR and ICR for constrained text, you only need to teach the AI 1000 samples per character (ballpark, of course, the more the better). That’s about 60,000 for all characters (upper case and lower case) plus another 50000 for all special characters, but it is manageable. For this approach though, you need a few hundred examples for every word of the language. English has about 500,000 words (add to that possible declinations and plural), so you can see how you need many million samples. And someone has to label them!
This approach is therefore limited to companies who have a labeled sample set like that or who can afford to create or buy one. But it seems to get the highest levels of accuracy.