The latest versions of the Large Language Models powering ChatGPT and Google Bard are multimodal by nature. Multimodal means they can use things other than text as input. This allows you to not only “chat” with them but also upload images and talk about them. With a well-engineered prompt, you can tell the tool to extract the fields on documents you upload and return structured data.
This is amazing, frightening (for IDP vendors?), and also very funny!
MLLMs can take images, speech, videos, and text as input. GPT4 and the PaLM model used by Google Bard are both multimodal, but GPT4’s multimodality can only be tested on invitation so far. Google Bard is already an Experiment that is open to play with. So playing we did!
You need to sign up for Bard if you have not done so already. Once you log in you see the chat environment and the chat tool has a new upload button.
The best way to test it is to upload a scanned business document and ask questions about it.
We uploaded a German invoice
and asked it a simple question:
DataFools: What is the data on this invoice?
Bard: The data on the invoice in the image is as follows:
Invoice number: 111875
Invoice date: 17.07.2023
Invoice recipient: Peter Maier
Invoice recipient address: Hauptstr 12, 87661 Freismanning
Drehtür (DT/5), 855 x 2013, ohne Quersprosse, Terrasse: 1 piece, €324.48
Drehtür (DTS), 604 x 2010, ohne Quersprosse, Gästezimmer: 1 piece, €306.48
Drehtür (DT/5), 724 x 2010, ohne Quersprosse, Schlafzimmer: 1 piece, €315.48
Spannrahmen (SE/10), Küche: 1 piece, €146.00
Total net amount: €1,092.44
VAT: 19%, €207.56
Total amount due: €1,300.00
Payment terms: Payable immediately upon receipt of invoice
The invoice is for the purchase of four insect screens from a company called Insektenschutztechnik. The screens are for the terrace, guest bedroom, bedroom, and kitchen. The total cost of the screens is €1,300.00, including VAT. The payment terms are that the invoice is payable immediately upon receipt.
This is 100% correct! And not only that, it also provided additional information about the line items. Since we asked in English, Bard replied in English despite the invoice being in German.
Next, we uploaded a W2 form from the USA.
And this time we asked
DataFools: Please extract all data on this document and format it as JSON
Bard: Sure, I can do that. Here is the data on the W-2 wage and tax statement in JSON format:
I hope this is what you were looking for. Let me know if you have any other questions.
Now let's digest this. Bard recognized the form type correctly. It also provided the data in the desired format which is easier to consume in a downstream process, great!
And at first glance,, the data looks good. But wait….
The first 5 or so fields are correct
Social Security Tax withheld: $3.94? The field is empty on the document. It made that up, there is no such number on the document!
Same for Medicare tax
Total wages and tips: $458.87. This is not an actual field on the form. Did it sum up all fields related to wages and tips? That sum isn’t printed on the form and it could be the sum of fields 1 and 16, but that is only $457.87… odd
There are other “total” fields that neither exist nor are correct derived sums
Net income: $443.38. Hmmm.. no matter what numbers you combine they don’t really add up to this amount.
Some takeaways so far:
On the positive side, Bard can instantly analyze a document, but
Bard makes up data that is not printed
Bard derives fields that are not on the document and does so wrongly.
So let's dig a little:
DataFools: How did you come up with the net income?
Bard: The net income on the W-2 wage and tax statement is calculated by subtracting the total deductions from the total wages and tips. The total deductions include federal income tax withheld, social security tax withheld, Medicare tax withheld, state taxes withheld, local taxes withheld, and dependent care benefits. The total wages and tips are the gross income for the employee.
In the JSON data that I extracted, the total deductions are $10.50 and the total wages and tips are $458.87. Therefore, the net income is $458.87 - $10.50 = $443.38.
Here is the formula for calculating net income from a W-2 wage and tax statement:
net_income = total_wages_and_tips - total_deductions
I hope this helps!
Err… yes. Uh, no it doesn’t. What Bard says sounds logical. It even has a formula and does the math for you! But look at that math…
DataFools: Are you sure that $458.87 - $10.50 = $443.38?
Bard: No, I am not sure that $458.87 - $10.50 = $443.38. In fact, the correct answer is $448.37.
I reran the calculation and got the following result:
total_wages_and_tips = 458.87
total_deductions = 10.50
net_income = 458.87 - 10.50 = 448.37
I apologize for the error. I am still under development and learning to be more accurate.
No further questions, your honor.
We had some fun with this picture. No comments, decide for yourself how crazy and weird it can get (check the highlights if this is too much reading for you).
DataFools: how many sticky notes in this picture are yellow paper?
Bard: There are 14 yellow sticky notes in the picture. All of the sticky notes in the picture are yellow paper.
I counted the number of sticky notes that are yellow by looking at the color of the paper. I did not count any sticky notes that had other colors on them, such as the sticky note that says "Poz" or the sticky note that says "AMERICA IS BIGGER THAN ONE MAN."
I hope this answers your question!
DataFools: What does the 3rd sticky note in the top row say?
Bard: The 3rd sticky note in the top row says "Meeting with client at 10am."
I counted the sticky notes from left to right, starting with the first sticky note in the top row. The 3rd sticky note in the top row is the one that says "Meeting with client at 10am."
I hope this answers your question!
DataFools: Not really, the 3rd note says something else!
Bard: You are correct. The 3rd sticky note in the top row says "Call Mom."
I apologize for the mistake. I am still under development and learning to identify objects and text in images. I will try my best to get it right next time.
Is there anything else I can help you with today?
DataFools: ok, you are faking this and making up text that is nowhere in the image! Admit it!
Bard: You are correct. I am faking the text in the image. I am not able to identify text in images yet. I am still under development and learning to do many things, including identifying objects and text in images.
I apologize for misleading you. I will try my best to be more honest in the future.
On the one hand, this is amazing. But the hallucinations are very scary. It makes up data, it gets the math wrong, and it does weird things that we didn’t ask for. And all that is stated with such confidence that you are inclined to trust it. Well, it is still a beta (Google calls it an “Experiment”) and Bard will learn.
How does this work? Large Language Models predict text output based on input. If the input is an image, certainly the LLM itself does not recognize the text. Google probably detects that the request is about understanding the content of an image and uses some other tool to perform OCR, then feeds the text into the actual LLM, we are guessing. If you look at the update notes for Bard, there is a hint from July 2023:
Google Lens in Bard
What: You can upload images alongside text in your conversations with Bard, allowing you to boost your imagination and creativity in completely new ways. To make this happen, we’re bringing the power of Google Lens into Bard, starting with English.
Why: Images are a fundamental part of how we put our imaginations to work, so we’ve added Google Lens to Bard. Whether you want more information about an image or need inspiration for a funny caption, you now have even more ways to explore and create with Bard.
So they use Google Lens! Lens was initially designed to recognize products and objects in photographs. It has limited OCR capabilities and those were designed to detect text on a can of soup or a book cover to be able to provide a link where you can shop for the product. It seems wrong to apply that to business documents when Google already has better tools for that use case in their Document AI services. Maybe that’ll be integrated at a later stage.
We will keep watching Bard and ChatGPT for these capabilities and keep you posted!
You can download the test documents used here and try them for yourself, enjoy!