I've been working on this for about a year and a half, and decided to finally open source it.
I wanted an intelligent document processing SaaS (Document AI, Form Recognizer, the various PDF-to-JSON tools) that you could run on your own hardware.
The interesting bits:
- Three-tier extraction: PyMuPDF for digital PDFs (~50ms), Docling layout-only for scanned-but-readable, Docling+OCR for the rough stuff. Auto-fallback based on extracted character count.
- Smart templates use vector similarity (Qdrant) to classify docs, then LLM extraction for fields — no regex, so layout drift doesn't break templates.
- Local Ollama or Azure OpenAI, switchable per-user.
Built on top of Cole Medin's local-ai-packaged. Apache 2.0.
I wanted an intelligent document processing SaaS (Document AI, Form Recognizer, the various PDF-to-JSON tools) that you could run on your own hardware.
The interesting bits:
- Three-tier extraction: PyMuPDF for digital PDFs (~50ms), Docling layout-only for scanned-but-readable, Docling+OCR for the rough stuff. Auto-fallback based on extracted character count. - Smart templates use vector similarity (Qdrant) to classify docs, then LLM extraction for fields — no regex, so layout drift doesn't break templates. - Local Ollama or Azure OpenAI, switchable per-user.
Built on top of Cole Medin's local-ai-packaged. Apache 2.0.
https://github.com/nickyeager/fetchtext