Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've done only one pipeline trying parse actual PDF structure and the least surprising part of it is that some documents have top-to-bottom layout and others have bottom-to-top, flipped, with text flipped again to be readable. It only goes worse from there. Absurd is correct.


That means you have to put the text (each infividual letter) into its correct place by rendering pdf, but doesnt justify actual OCR which goes one step further and back by rendering and backguessing the glyphs. But thats just text, tables and structure are also somewhere there to be recovered.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: