One of the things that Textract purports to do is also detect structured data, e.g. if the scanned images has tables, like a spreadsheet.
I tested it out when it became available to the general public, and included the API call and JSON response for their default and relatively simple example:
I tried it against what I consider one of the harder real-world types of scanned tabular data, a Senate personal financial disclosure form [0], and it didn't do great. In fact, I found that it did substantially worse than ABBYY FineReader. Or rather, ABBYY did much better than I would've expected for any software, even managing to read the vertically-oriented column headers accurately [1].
I was hoping that they had taken the time to test tables, nested tables, and irregular table rows with headers every so often.
A good example of what I’m talking about are invoices. In my line of work we extract data from thousands of different types of invoices that are the only place to find key data from your services provider (waste & recycling bills, soon other parts of a building’s expenses), and normalize that information.
There is a long-tail of industries that are years away from having an API to transact information…but that information is available on their monthly statements in some sort of crazy table format!
In my experience Amazon Textract has been the best in terms of processing speed, ease of use, and table extraction accuracy. However post processing is almost always needed with any OCR implementation.
Edit: Its important to note that Microsoft and Google don’t even support table extraction in the APIs listed in this article!
Yep! It's great, but is maybe 60% there, so I'm looking for something that can extract much more structure from a document. I doubt what I'm looking for will exist for another 10 years, though.
is it feasible to create loose templates for where the data is and extract that way? i have a mothballed project that did pretty well. it was able to discern different templates from a mass of documents.
I found a relatively recent (Feb. 2019) comparison of OCR software and services [1]. It may be interesting to you, especially the "Texas Campaign Filing" comparison, which seems to have (nested) tables.
That's a good point not explained in the article: there are a huge number of use cases for OCR. In this case, the use is extracting words that can be used in full-text search, so structural extraction isn't a key criteria.
Edit: and now it's hopefully clarified in the article itself. :)
In this case, the use is extracting words that can be used in full-text search, so structural extraction isn't a key criteria.
In case someone wants to know more, the former is known as "full page OCR" and the latter as "data capture"/"document processing" (or IDP, intelligent document processing).
Full page OCR for machine printed text is considered a solved problem (but not for handwritten text).
Data capture is hard to do and involves extracting specific fields from documents.
The first big cloud company going into data capture territory was Amazon with AWS Textract (calling it OCR++). There's also Document Understanding AI (Google) and Azure Form Recognizer in Beta, as mentioned by others in this thread.
The big 3 RPA companies (UiPath, Automation Anywhere, Blue Prism) have also gone into data capture (calling it cognitive or intelligent RPA).
ABBYY (with FlexiCapture) and Kofax (who recently acquired Nuance's imaging division, the 2nd most popular OCR engine after ABBYY's) are the traditional IDP players.
just on the note of tables, have you tried the table extraction and OCR on Microsoft Excel on iPhone? It may also be on Android, i just tried it on iPhone, and man! it works great!
What kind of post processing? I'm not an expert, but usually we have done a lot of preprocessing (like binarization,etc) before we call cloud ocr services.
Tesseract[0] is the classic example. There's a bunch of advice for improving your accuracy with it, like making your images larger (literally just scale it up x2 or x4).
It would be interesting to the benchmark from the article repeated with different scaling options (or other preprocessing, depending on platform).
Running tesseract (4.0.0 using the LSTM engine) on the same images leaves a lot to be desired for handwriting, but does well on the (non-handwriting) website image (the source images are linked in the "OCR Image Processing Results" section).
Tesseract works really good for literal OCR but I haven't had much luck before with more common work (like documents with tables and such). Has anything changed as of late?
Since October there is a new version, v4: “Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. [..] Added trained data that includes LSTM models to 123 languages.”
No idea how well it does for structured content like tables.
There seems to be a recent v2 of a javascript port (i.e. Tesseract v4 compiled to wasm), if anyone wants to do OCR in their browser:
Typically OCR accuracy is measured in two ways, CER (Character error rate) and WER (Word error rate). If just one number is provided, it's typically CER.
"Finding words in images" is a bit ambiguous. It can mean "word spotting" (intended for retrieval rather than transcription) but also "text segmentation" (part of preprocessing step before OCR).
I wonder if some implementations just use nearest neighbour words to increase accuracy in the common case of normal text. - Decreasing performance on random strings considerably.
The tested corpus only contains relatively common words, so this aspect is not tested.
Have a specific image you'd be interested in seeing tested? The article only contains a few examples that could be freely used, but images with sparse random text (e.g. [1]) do tend to have good results across all the services.
I tried running the source images through FineReader Online, but the images with handwriting resulted in "was not processed: the recognized document contains errors". The website image worked, but was missing a few elements, like the other headings on the line with "Minimalist editor".
All OCR services, BTW, have the same threading problem. They work really well for sequential text, but as soon as you start getting into more complicated "marketing" formats, they don't work at all.
They best use of the online OCR services I found was figuring out published dates for news articles that typically don't show up in anything other than images. Even w/ both Azure, AWS, and Google, I still needed a post-capture regex to figure the stuff out.
https://drive.google.com/a/bolcer.org/file/d/16vujemgD91Ebuu...
Once a year when I do my taxes I go through every single bank transaction and classify it into a proper column in a spreadsheet. The problem is that my bank provides CSV transaction data for the past 6 months only. But it does provide years of bank statements in PDF format.
I decided I can save a lot of time by using Amazon Textract to extract the tables from the PDF's and convert them to CSV files.
The problem is that while Textract works really well for well defined tabular data it does not work for tables where the rows and columns are implied with white space, instead of lines. When I reached out to aws they confirmed this problem and suggested that I draw the table lines into the PDF and then run textract again on this modified pdf. This felt like a dirty hack so I did not proceed with this suggestion.
Textract is a great tool when it works well, but unfortunately when it doesn't there are no ways to make adjustments in order to improve the results. In the end I managed to complete my project and get a lot better results by using the excellent Camelot Python table extraction library.
"The problem is that while Textract works really well for well defined tabular data it does not work for tables where the rows and columns are implied with white space, instead of lines."
This is what Tabula and Camelot call "Stream" and "Lattice" parsing methods.
Whitespace between cells is Stream, demarcated lines is Lattice.
Vicarious has text recognition technology that is invariant to, i.e., letter spacing, but they haven't released it due to fear of CAPTCHA fraud. See, i.e., [here](https://i.imgur.com/lN4AzmE.png).
I wonder if they could modify it to release a near 100% accurate OCR service?
Letter spacing seems like an arbitrary distinction. IF you design a neural net that contains operators that make certain things invariant (e.g. size) then of course it would perform better than a neural net without it.
unless i am having a hard time reading the text on the site, they all are $1500 for 1M images, but Amazon Textract, which has the second lowest average result, is "cheapest" compared to Azure Congnitive Services, which has the highest average result... Did i miss something? How is 4 items, all the same price, ranked arseways?
I’m new to the OCR space. Given the the successes in self-driving (given no systems are production ready, Waymo and Tesla have MVPs), what makes reading text so hard that cloud providers struggle to have human-level accuracies?
Same problems as with self-driving cars and speech recognition: it's very very hard (if not impossible) for software to parse the universe of scenarios. Said another way, it's never accurate enough to "work" when it's trained on the outside world. Usable output requires the kind of accuracy you only get when the process is trained in controlled environments (limited "vocabulary": fonts, layouts, languages, words, accents, geography, etc.).
This is one of those things that badly needs an open source solution, and for which technology exists to solve it really well, but nobody wants to do it because it's a ton of really boring, high-maintenance work.
That couldn't do OCR at all, right? It had no camera. It could do handwriting recognition on stylus input (where it has full perfect stroke data, not just pixelated images of strokes), which is a very different and easier problem.
The Apple Newton was capable of online OCR (doesn't have to do with internet connectivity in this case).
As you mention, online OCR is when you input the strokes directly on the device vs offline OCR where the input is an image.
Some trivia:
The first version of handwriting recognition engine of the Newton was developed by ParaGraph International (founded by the founder of Evernote).
Another version (Print Recognizer) was later developed by Apple.
I tested it out when it became available to the general public, and included the API call and JSON response for their default and relatively simple example:
https://gist.github.com/dannguyen/1a71c8fb98ddd1df6abdc08bc7...
I tried it against what I consider one of the harder real-world types of scanned tabular data, a Senate personal financial disclosure form [0], and it didn't do great. In fact, I found that it did substantially worse than ABBYY FineReader. Or rather, ABBYY did much better than I would've expected for any software, even managing to read the vertically-oriented column headers accurately [1].
[0] https://gist.github.com/dannguyen/1a71c8fb98ddd1df6abdc08bc7...
[1] https://github.com/dannguyen/abbyy-finereader-ocr-senate#les...