Back to blog

Why we use human in the loop data capture process for our API clients?

Why we use human in the loop data capture process for our API clients?
December 16, 2022

Why chose a data capture API that uses humans as part of the data capture process? And what do these human intelligences actually do? What approach suites your use case?

In recent years there has been an explosion of demand for data capture services. There are various approaches to capture the data from bills and invoices.

  • The old way is to capture invoice data manually either directly into the ERP software or use an outsourced team to capture into spreadsheets and import.
  • You can also use an on premise solution that you setup and maintain. This is classical OCR technology where you setup templates for each of your suppliers. It is costly to setup and your IT team has to maintain the templates and keep them up to date.  Your staff then validates the OCR results.
  • Now there is the possibility to use cloud services to capture invoice data and use an API to integrate the capture results with your backend process. At Datamolino we have started offering an API for invoice data capture 7 years ago. In recent years there are many new players in the market offering AI data capture. This is sparking interest of many companies. Many are exploring how to implement an API that will allow them to automate part of their Accounts Payable process.

It is quite hard to choose a data capture provider that fits your needs. Many people are looking to AI solutions for an easy fix. There are high expectations around speed of capture and accuracy.

It turns out that an AI with a generic engine gives quick results but there is still an error rate where you may need to validate each data capture output manually. With a dedicated AI engine you can achieve high accuracy, but the AI engine needs to be trained on your suppliers. This means that you need to annotate between 5 to 15 invoice variations for each supplier before the training kicks in [annotating / validating means giving the AI engine the labels and results for each field and sometimes also the position of the data on the invoice]. For companies with a fixed number of suppliers and high invoice volume this may be a good fit.

The reality of AI is sometimes many shades of grey. At the core the AI is an algorithm that gives you predictions with varying confidence. This may mean that you still need to validate the data capture results that the AI predicts.

When choosing an AI vendor it is good to ask how is the data trained, how often is the engine that you will be using re-trained on fresh data. And also what data is valid and can be used for re-training. Some AI approaches learn only from the combination of field name and results while others require also the exact position of the correct result to be marked on the document so the AI can learn from that. This is of course a very simplified view but you get the picture.

So how is Datamolino different to the AI competitors that have recently sprung up?

Datamolino uses a combination of tools and approaches to get near 100% results for our customers. Our customers are mostly accounting firms or software providers that need high quality invoice data capture on a wide variety of invoice issuers (typically a SaaS product).

The typical approach of a purely AI based API is that your users validate the results and over time the results improve or become automated. This means that at the beginning you may need to validate every single document until the AI has seen enough repetitions to be re-trained and the capture results improve.

Our approach combines automation efforts and human-in-the-loop service to ensure high data quality.  The human service part is a differentiator compared to “pure AI” where the client is expected to validate the results “so the system can learn” or provide “position data” in order to improve the accuracy over time. In Datamolino this is part of our service. And when automation fails our humans fill in the blanks to give you the results that you expect. This means that high automation rates are in our best interest. And the best part is that our users get the data capture results without the manual validation part required to train an AI model.

To give you a bit more detail, the data capture process on our side works as follows:

  • Incoming files are matched with “fingerprints”. If there is a match the document is processed and data is returned via API.
  • On multi page scans the fingerprint technology can also automatically detect “known” pages and automatically split the multipage scan into individual transactions with the correct number of pages (say 50 page scan split into 23 invoices). This is aided by human operators. In the case of “unknown” pages the quality assurance team creates fingerprints for new layouts or manually amends docs where necessary.
  • If new supplier enters our processing a fingerprint is created on the fly. We do not require prior notification. Automating of new layouts is baked into the processing.
  • If a document cannot be automated (due to input quality) basic data is captured manually by Datamolino operators.
  • It is in our interest to automate as much as possible, because that drives our profitability. Also, the more we are able to automate, the better the resulting user experience.
  • For automated docs the turn around time depends on current server load. We can prioritise customer queues if required.
  • For manual process (automation – fingerprint training and manual retype where required) the turnaround time can be adjusted based on project needs.

Our customers usually chose Datamolino to simplify their back office operations. They keep a core team that oversees the quality of the process and make sure the captured data is correctly merged with master data. And where double matching / triple matching checks on their side fail they notify us to improve fingerprints for the affected supplier. The benefit they quote is:

  • No need to create or manage templates.
  • No need to validate results or capture position data in order to train an AI engine
  • No need to maintain an on-premise solution
  • No need to manage a large team, hire, train and oversee quality of human labor. Only keep a core team that is highly specialised and keeps the know how inside the company.
  • Due to the combination of automation and human approach, clients consider Datamolino to be a co-worker that is never sick.

The Datamolino API can be consumed as pure API or as a combination of our web UI and API access. Datamolino API supports webhooks so fetching the results is easy and fast. In he simplest integration form, you can send documents through email and data capture results are delivered on the webhook.