Leading Financial Services Provider

Intelligent document processing

Automated data extraction from complex financial documents: OCR, layout-aware models, and human review queues feeding downstream systems.

2024 3 months 5 engineers completed

85%Faster processing

99.2%Accuracy rate

1M+Documents processed

The challenge

A financial services provider had more than 200 employees doing manual data entry on contracts, invoices, statements, and regulatory filings. Manual extraction ran at roughly a 5% error rate, which in a regulated business is not a quality problem, it is a compliance problem. Turnaround on critical documents was 48 to 72 hours, and the process simply could not absorb quarter-end peaks: volume spiked, backlogs grew, and the only lever was hiring more people to type faster.

The brief was direct. Take documents in whatever form they arrive, turn them into structured data the downstream financial systems can trust, and do it faster and more accurately than the manual operation.

What we built

An end-to-end pipeline, built in Python on FastAPI, PostgreSQL, MinIO for document storage, and RabbitMQ as the queue backbone between stages. Each stage in the diagram above is an independent service consuming from a queue, so a slow OCR job never blocks ingestion and the system degrades gracefully under load.

Ingestion and classification

Documents arrive through three channels: email, direct upload, and API. Everything converges on a classification model that sorts incoming documents into 30+ types and routes them accordingly. We put classification first deliberately. Knowing the document type up front lets the extraction stage apply the right expectations about fields and layout, and it lets us route unsupported documents to a holding queue instead of producing garbage downstream.

OCR and extraction, including tables

The extraction stage combines OCR with NLP models (spaCy and BERT-based) for context-aware field extraction, plus dedicated table-structure understanding, because in financial documents the data you actually want usually lives in tables. We made extraction template-free: the models learn document structure rather than matching against predefined layouts. The reason is maintenance cost. With 30+ document types and vendors changing their invoice formats whenever they like, a template library becomes a treadmill. Template-free extraction means a new layout degrades to lower confidence rather than failing outright.

Validation and the human review loop

Nothing reaches downstream systems on model output alone. Extracted data is cross-referenced against existing databases and checked against business rules, and every field carries a confidence score. High-confidence documents go straight through to the financial systems: ERP, general ledger, and compliance reporting. Low-confidence documents land in a human review queue for exception handling. The reviewers’ corrections feed back into model retraining, so the system improves on exactly the documents it currently handles worst. The confidence threshold is the safety valve: when in doubt, a human decides.

How it was delivered

The engagement ran 12 weeks with a team of five.

Weeks 1 to 2 were discovery and design. We analyzed more than 100,000 sample documents from the client’s archive, identified the 30+ document types and their extraction requirements, and designed the validation rules and workflows with the operations team.

Weeks 3 to 8 were core development: the OCR pipeline with preprocessing, the classification models, and extraction tuned per document type.

Weeks 9 to 12 covered integration with the client’s existing document management systems, security and compliance work, and the part we consider non-negotiable for this kind of system: a parallel run. The pipeline processed live documents alongside the manual operation, and we compared outputs until accuracy was demonstrated on real production traffic, not a test set. Only then did the client start retiring manual steps.

What shipped

Processes 50,000+ documents daily across 30+ document types
Reduced manual data entry by 90%
Template-free extraction that adapts to new document layouts without per-template engineering
Confidence scoring on every extracted field, with automatic routing between straight-through processing and human review
Validation layer that cross-references existing databases and enforces business rules before data reaches the ERP, general ledger, or compliance systems
A correction-to-retraining loop that turns reviewer fixes into model improvements
API-first design with a full audit trail of every processing step

The system has been in production since 2024 and has processed more than a million documents. The review queue and retraining loop are why it stuck: instead of accuracy decaying as document formats drift, the operations team’s everyday corrections keep the models current, and the audit trail gives compliance a complete record that the old manual process never had.

PythonOCRComputer VisionspaCyBERTFastAPIPostgreSQLMinIORabbitMQ

Want something like this running against your data?

Start a prototype sprint