Back to blog

Why we use human in the loop data capture process for our API clients?

Why chose a data capture API that uses humans as part of the data capture process? And what do these human intelligences actually do? What approach suites your use case?

Why we use human in the loop data capture process for our API clients?

Why choose a data capture API that uses humans as part of the process? What do those humans actually do? And which approach fits your use case?

Demand for data capture services has exploded over the past few years, and there are now several ways to extract data from bills and invoices.

  • The old way: capture invoice data manually, either directly into your ERP or through an outsourced team that retypes into spreadsheets and imports.
  • The on-premise way: classical OCR with a template per supplier. Costly to set up, your IT team has to maintain the templates, and your staff still validates the OCR results.
  • The cloud way: use a service that captures invoice data and exposes an API to your backend. We have offered an API for invoice data capture for over seven years. The recent wave of AI-only entrants has sparked interest from companies looking to automate part of their accounts payable process.

Choosing the right provider is harder than it looks. Expectations around speed and accuracy are high, and AI marketing has not made the trade-offs easier to see.

What you actually get with AI-only capture

A generic AI engine returns results quickly, but with an error rate that means you typically need to validate every output manually. A dedicated AI engine can hit high accuracy, but only after it has been trained on your suppliers, usually 5 to 15 invoice variations per supplier, with someone labelling each field and sometimes the position of the data on the document. If you have a fixed supplier base and high invoice volume, that can work.

The reality of AI is shades of grey. Underneath the marketing, the engine is an algorithm producing predictions with varying confidence, which means you may still be validating the output. When you talk to an AI vendor, it is worth asking how the engine is trained, how often it is re-trained on fresh data, and what data is actually usable for re-training. Some approaches learn from field-name and result pairs, others need the exact position of the correct value on the page. The picture is more complicated than the pitch.

How Datamolino is different

We use a combination of automation and a human-in-the-loop service to deliver near 100% results. Most of our customers are accounting firms or software providers who need consistent capture across a wide variety of invoice issuers, the kind of long-tail supplier base where a pure AI approach struggles.

With pure AI, your users validate every result so the model can improve over time. At the start, that means validating every document. With Datamolino, our team handles validation as part of the service. When automation falls short, our operators fill in the gaps so you get the result you expect, without having to train anything yourself. High automation is in our interest, because it drives our margins, and it is what makes the experience reliable for you.

What happens when a document arrives

  • Incoming files are matched against fingerprints. If there is a match, the document is processed and the data is returned via API.
  • On multi-page scans, fingerprints detect known pages and split the scan into individual transactions automatically. A 50-page scan can come back as 23 separate invoices. Operators step in for unknown pages, creating fingerprints for new layouts or amending where needed.
  • When a new supplier shows up, a fingerprint is created on the fly. We do not need prior notice. Onboarding new layouts is part of the standard processing flow.
  • If a document cannot be automated because of input quality, the basic data is captured manually by our operators.
  • Turnaround on automated documents depends on current load, and we can prioritise customer queues when the situation calls for it. For manual processing, including fingerprint training and retyping, turnaround is set based on project needs.

Why customers move to this model

The teams that pick Datamolino usually want to simplify their back office. They keep a small core team that oversees quality and makes sure the captured data merges correctly with their master data. When their double or triple matching checks fail, they flag it to us and we improve the fingerprint for that supplier. The benefits they tell us about:

  • No templates to create or maintain.
  • No need to validate results or capture position data to train an AI engine.
  • No on-premise solution to keep alive.
  • No large team to hire, train, and supervise. A small specialist team is enough to keep the know-how inside the company.
  • As the research notes, the combination of automation and human input means clients tend to describe Datamolino as a co-worker that is never sick.

How the API works in practice

You can use Datamolino as a pure API or as a combination of our web UI and API access. The API supports webhooks, so fetching results is fast. In the simplest setup, you send documents through email and the capture results land on your webhook.

If you want to dig deeper into how the API plugs in, see the Invoice OCR API page. For line-level extraction across goods received notes, statements, and similar documents, the line item data extraction overview walks through the detail. The feature overview covers the rest of the platform.

Try Datamolino free, process 100 documents at no cost.

Frequently asked questions 

What is human-in-the-loop data capture?

Human-in-the-loop data capture is a process where automation handles what it can, and trained operators step in for everything else. The aim is consistent output, regardless of how messy or new the document is. The customer does not have to validate or label results to keep accuracy up. That responsibility sits with the provider.

How is Datamolino different from pure AI capture?

Pure AI vendors return predictions and rely on your team to validate them so the model can improve. Until the model has seen enough volume, you are validating every document. Datamolino delivers the validated result up front, because our operators close the gaps that automation leaves. You receive captured data, not raw predictions to review.

Do I need to validate results coming out of Datamolino?

No. Most clients keep a small quality oversight team that runs their own matching checks against master data. If a supplier shows a recurring issue, they let us know and we adjust the fingerprint for that supplier. The day-to-day validation of capture output is not something you take on.

What is the turnaround time?

Automated documents are returned based on current server load, which is fast in most cases. We can prioritise customer queues when needed. For documents that need manual processing, including fingerprint training for new layouts or retyping where input quality is poor, turnaround is agreed up front based on project requirements.