Leading Financial Services Provider
Intelligent document processing
Automated data extraction from complex financial documents: OCR, layout-aware models, and human review queues feeding downstream systems.
The challenge
A financial services provider had more than 200 employees doing manual data entry on contracts, invoices, statements, and regulatory filings. Manual extraction ran at roughly a 5% error rate, which in a regulated business is not a quality problem, it is a compliance problem. Turnaround on critical documents was 48 to 72 hours, and the process simply could not absorb quarter-end peaks: volume spiked, backlogs grew, and the only lever was hiring more people to type faster.
The brief was direct. Take documents in whatever form they arrive, turn them into structured data the downstream financial systems can trust, and do it faster and more accurately than the manual operation.
What we built
An end-to-end pipeline, built in Python on FastAPI, PostgreSQL, MinIO for document storage, and RabbitMQ as the queue backbone between stages. Each stage in the diagram above is an independent service consuming from a queue, so a slow OCR job never blocks ingestion and the system degrades gracefully under load.
Ingestion and classification
Documents arrive through three channels: email, direct upload, and API. Everything converges on a classification model that sorts incoming documents into 30+ types and routes them accordingly. We put classification first deliberately. Knowing the document type up front lets the extraction stage apply the right expectations about fields and layout, and it lets us route unsupported documents to a holding queue instead of producing garbage downstream.
OCR and extraction, including tables
The extraction stage combines OCR with NLP models (spaCy and BERT-based) for context-aware field extraction, plus dedicated table-structure understanding, because in financial documents the data you actually want usually lives in tables. We made extraction template-free: the models learn document structure rather than matching against predefined layouts. The reason is maintenance cost. With 30+ document types and vendors changing their invoice formats whenever they like, a template library becomes a treadmill. Template-free extraction means a new layout degrades to lower confidence rather than failing outright.
Validation and the human review loop
Nothing reaches downstream systems on model output alone. Extracted data is cross-referenced against existing databases and checked against business rules, and every field carries a confidence score. High-confidence documents go straight through to the financial systems: ERP, general ledger, and compliance reporting. Low-confidence documents land in a human review queue for exception handling. The reviewers’ corrections feed back into model retraining, so the system improves on exactly the documents it currently handles worst. The confidence threshold is the safety valve: when in doubt, a human decides.
How it was delivered
The engagement ran 12 weeks with a team of five.
Weeks 1 to 2 were discovery and design. We analyzed more than 100,000 sample documents from the client’s archive, identified the 30+ document types and their extraction requirements, and designed the validation rules and workflows with the operations team.
Weeks 3 to 8 were core development: the OCR pipeline with preprocessing, the classification models, and extraction tuned per document type.
Weeks 9 to 12 covered integration with the client’s existing document management systems, security and compliance work, and the part we consider non-negotiable for this kind of system: a parallel run. The pipeline processed live documents alongside the manual operation, and we compared outputs until accuracy was demonstrated on real production traffic, not a test set. Only then did the client start retiring manual steps.
What shipped
- Processes 50,000+ documents daily across 30+ document types
- Reduced manual data entry by 90%
- Template-free extraction that adapts to new document layouts without per-template engineering
- Confidence scoring on every extracted field, with automatic routing between straight-through processing and human review
- Validation layer that cross-references existing databases and enforces business rules before data reaches the ERP, general ledger, or compliance systems
- A correction-to-retraining loop that turns reviewer fixes into model improvements
- API-first design with a full audit trail of every processing step
The system has been in production since 2024 and has processed more than a million documents. The review queue and retraining loop are why it stuck: instead of accuracy decaying as document formats drift, the operations team’s everyday corrections keep the models current, and the audit trail gives compliance a complete record that the old manual process never had.
Want something like this running against your data?
Start a prototype sprint