Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've been working on this for about a year and a half, and decided to finally open source it.

I wanted an intelligent document processing SaaS (Document AI, Form Recognizer, the various PDF-to-JSON tools) that you could run on your own hardware.

The interesting bits:

- Three-tier extraction: PyMuPDF for digital PDFs (~50ms), Docling layout-only for scanned-but-readable, Docling+OCR for the rough stuff. Auto-fallback based on extracted character count. - Smart templates use vector similarity (Qdrant) to classify docs, then LLM extraction for fields — no regex, so layout drift doesn't break templates. - Local Ollama or Azure OpenAI, switchable per-user.

Built on top of Cole Medin's local-ai-packaged. Apache 2.0.

https://github.com/nickyeager/fetchtext



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: