[ad_1]
Extracting data from receipts is essential for businesses since millions of employees are submitting their work related expenses via receipts.
With the latest developments in generative AI and large language models, data extraction accuracy has reached approximately human levels.
Benchmark results
We used Claude 3.5 Sonnet to measure the receipt data extraction accuracy of LLMs:
Dataset
We divided our dataset into two parts:
-
High quality: Scanned, high resolution receipts. These images are aligned well, with high contrast.
-
Low quality: Photographed, low quality receipts. These images are not aligned properly, with no pre-processing to make contrast higher.
Our aim is to cover real-life cases as much as possible.
We asked for a JSON output to make evaluation easier. Our prompt is: Please output the text on the PDFs in a proper JSON format.
Methodology
Results were evaluated at key-value pair level:
-
If a field includes the correct label and value, it is marked as correct.
-
If there are any character differences vs the ground truth in the label or the value, that row is marked as false.
Extraction accuracy: Number of correctly extracted key-value pairs divided by the total number of key-value pairs.
Next steps
We will add more LLMs (ChatGPT etc.) to this benchmark to examine their ability to data extraction better.
What is receipt OCR?
Receipt OCR (Optical Character Recognition) is a technology that extracts data from scanned and digital receipts using artificial intelligence and machine learning algorithms. Receipt OCR parses the data, converts it to a structured format and captures details in the receipt, like date, items, and prices.
To increase the accuracy of the OCR, the images should be:
-
In higher resolution
-
Aligned well
-
Free of printing errors
You should be aware of:
Most of the receipt OCR tools fail in matching the correct item with correct price when there is a note about the item in the next line with no pricing listed. In that case, it is common for tools to read the next item’s price as the note’s price. To see clearly, let’s look at the example:
In such cases, the output of OCR may match “SpcyDlx +PJ” with the price 0.40, which is not correct. It is possible especially in the cases where image resolution and quality is low, and the image is not aligned straight.
We noticed that in the case of low resolution or printing errors (ink does not cover the letter completely etc.), tools are having trouble in fully identifying similar letters and numbers. Like “8” and “9” or “5” and “6”. Also having trouble in identifying “/” and “1” is a common case, especially in dates.
-
Receipt number
-
Date
-
Vendor name
-
Subtotal amount
-
Tax amount
-
Total amount
-
Purchased items
-
Receipt scanning: Scanning the receipt with high resolution. OCR receipt scanning helps getting more high quality images than taking photographs of the receipts.
-
Receipt processing: To increase contrast and readability of the input image, processing receipts may be needed.
-
Receipt parsing: Parsing the receipt image is essential to analyze and capture data, it breaks down data into more organized portions.
-
Using structured data: Structured data can be used to automate data entry in existing systems like accounting software. Relevant data can be used in so many cases like following the transaction date in financial records and expense management. By automatically extract data from receipts by using LLMs or receipt OCR apis can reduce errors and manual entry and increases overall efficiency with high accuracy.
FAQ
What are the business benefits of OCR receipt scanning?
OCR technology helps expense tracking, and identifying spending patterns. Line items on json response can provide key information and help saving time by automatically extracting raw text from documents and invoices. Businesses can fine tune an ocr engine according to project needs. Business numbers from different countries like australian business number and VAT number can be extracted from receipts.
External Links
Source link
#Receipt #OCR #Benchmark #LLMs
[ad_2]