Examples to compare OCR services: Amazon vs. Google vs. Microsoft

danso · on July 18, 2019

One of the things that Textract purports to do is also detect structured data, e.g. if the scanned images has tables, like a spreadsheet.

I tested it out when it became available to the general public, and included the API call and JSON response for their default and relatively simple example:

https://gist.github.com/dannguyen/1a71c8fb98ddd1df6abdc08bc7...

I tried it against what I consider one of the harder real-world types of scanned tabular data, a Senate personal financial disclosure form [0], and it didn't do great. In fact, I found that it did substantially worse than ABBYY FineReader. Or rather, ABBYY did much better than I would've expected for any software, even managing to read the vertically-oriented column headers accurately [1].

[0] https://gist.github.com/dannguyen/1a71c8fb98ddd1df6abdc08bc7...

[1] https://github.com/dannguyen/abbyy-finereader-ocr-senate#les...

cdolan · on July 18, 2019

I was hoping that they had taken the time to test tables, nested tables, and irregular table rows with headers every so often.

A good example of what I’m talking about are invoices. In my line of work we extract data from thousands of different types of invoices that are the only place to find key data from your services provider (waste & recycling bills, soon other parts of a building’s expenses), and normalize that information.

There is a long-tail of industries that are years away from having an API to transact information…but that information is available on their monthly statements in some sort of crazy table format!

In my experience Amazon Textract has been the best in terms of processing speed, ease of use, and table extraction accuracy. However post processing is almost always needed with any OCR implementation.

Edit: Its important to note that Microsoft and Google don’t even support table extraction in the APIs listed in this article!

andrejk · on July 18, 2019

Azure has a separate service to read forms and formatted docs. https://azure.microsoft.com/en-us/services/cognitive-service...

cdolan · on July 19, 2019

Thanks I was not aware! I’ll have to figure out how to easily test it (AWS tectrsct is actually really nice in that regard)

bpchaps · on July 18, 2019

Do you know how good it is? I have a LOT of structured documents that I need to OCR.

CorneliaKara · on July 18, 2019

MSFT person here - give it a try! sign up, and you get a free trial that can allow you to easily benchmark.

Havoc · on July 18, 2019

Interesting. That is actually something I've been looking for and I do have a msdn sub.

fyi that defaulted to indian rupee as default currency for me (UK based & zero indian connections). Weird

dav43 · on July 18, 2019

Have you tried the tabula free program? I use it for some finance work reading filings.

http://tabula.technology/

bpchaps · on July 18, 2019

Yep! It's great, but is maybe 60% there, so I'm looking for something that can extract much more structure from a document. I doubt what I'm looking for will exist for another 10 years, though.

leeoniya · on July 19, 2019

is it feasible to create loose templates for where the data is and extract that way? i have a mothballed project that did pretty well. it was able to discern different templates from a mass of documents.

ocrcustomserver · on July 19, 2019

I'm curious, if you email me a sample I can tell you what's possible.

itcrowd · on July 18, 2019

I found a relatively recent (Feb. 2019) comparison of OCR software and services [1]. It may be interesting to you, especially the "Texas Campaign Filing" comparison, which seems to have (nested) tables.

[1] https://source.opennews.org/articles/so-many-ocr-options/

jordoh · on July 18, 2019

That's a good point not explained in the article: there are a huge number of use cases for OCR. In this case, the use is extracting words that can be used in full-text search, so structural extraction isn't a key criteria.

Edit: and now it's hopefully clarified in the article itself. :)

ocrcustomserver · on July 19, 2019

  In this case, the use is extracting words that can be used in full-text search, so structural extraction isn't a key criteria.

In case someone wants to know more, the former is known as "full page OCR" and the latter as "data capture"/"document processing" (or IDP, intelligent document processing).

Full page OCR for machine printed text is considered a solved problem (but not for handwritten text). Data capture is hard to do and involves extracting specific fields from documents.

The first big cloud company going into data capture territory was Amazon with AWS Textract (calling it OCR++). There's also Document Understanding AI (Google) and Azure Form Recognizer in Beta, as mentioned by others in this thread.

The big 3 RPA companies (UiPath, Automation Anywhere, Blue Prism) have also gone into data capture (calling it cognitive or intelligent RPA).

ABBYY (with FlexiCapture) and Kofax (who recently acquired Nuance's imaging division, the 2nd most popular OCR engine after ABBYY's) are the traditional IDP players.

eitally · on July 18, 2019

So does Google Cloud: https://cloud.google.com/solutions/document-understanding/

cdolan · on July 19, 2019

Thnaks - looks like table extraction is in Alpha. I submitted a request to check it out!

tiernano · on July 18, 2019

just on the note of tables, have you tried the table extraction and OCR on Microsoft Excel on iPhone? It may also be on Android, i just tried it on iPhone, and man! it works great!

cdolan · on July 19, 2019

I have - its really good but for our use case (thousands of PDFs a day) that wouldn’t be reasonable :-(

sandGorgon · on July 19, 2019

What kind of post processing? I'm not an expert, but usually we have done a lot of preprocessing (like binarization,etc) before we call cloud ocr services.

formercoder · on July 18, 2019

This comes up a lot with financial statements and government data as well. The government stuff is getting much better though.

ocrcustomserver · on July 20, 2019

What is the accuracy that you're getting from it?

fartcannon · on July 18, 2019

Maybe compare to some of the free-non-hosted options?

Like where are we with OCR? Last I checked it was CTC magic. Any progress?

steventhedev · on July 18, 2019

Tesseract[0] is the classic example. There's a bunch of advice for improving your accuracy with it, like making your images larger (literally just scale it up x2 or x4).

It would be interesting to the benchmark from the article repeated with different scaling options (or other preprocessing, depending on platform).

[0]: https://github.com/tesseract-ocr/tesseract

jordoh · on July 18, 2019

Running tesseract (4.0.0 using the LSTM engine) on the same images leaves a lot to be desired for handwriting, but does well on the (non-handwriting) website image (the source images are linked in the "OCR Image Processing Results" section).

ocrcustomserver · on July 20, 2019

From the Tesseract FAQ:

"Can I use Tesseract for handwriting recognition?

You can, but it won’t work very well, as Tesseract is designed for printed text. Look for projects focused on handwriting recognition."

https://github.com/tesseract-ocr/tesseract/wiki/FAQ#can-i-us...

syntaxing · on July 18, 2019

Tesseract works really good for literal OCR but I haven't had much luck before with more common work (like documents with tables and such). Has anything changed as of late?

jacobolus · on July 19, 2019

Since October there is a new version, v4: “Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains. [..] Added trained data that includes LSTM models to 123 languages.”

No idea how well it does for structured content like tables.

There seems to be a recent v2 of a javascript port (i.e. Tesseract v4 compiled to wasm), if anyone wants to do OCR in their browser:

https://observablehq.com/@tmcw/tesseract-js-v2-alpha

https://github.com/naptha/tesseract.js

ocrcustomserver · on July 20, 2019

  (literally just scale it up x2 or x4).

Just to clarify, the input image should be 300dpi.

ocrcustomserver · on July 20, 2019

For the state of the art check the ICDAR, DAS and ICFHR (for handwriting) conferences.

For an overview: https://github.com/handong1587/handong1587.github.io/blob/ma...

bduerst · on July 18, 2019

Would have been nice to see % accuracy as well as or instead of just % words matched. It seems like the author decided to forego false positives.

You can train models that are good at finding words in images but can have terrible accuracy at perceiving the actual text of the word.

ocrcustomserver · on July 19, 2019

Typically OCR accuracy is measured in two ways, CER (Character error rate) and WER (Word error rate). If just one number is provided, it's typically CER.

"Finding words in images" is a bit ambiguous. It can mean "word spotting" (intended for retrieval rather than transcription) but also "text segmentation" (part of preprocessing step before OCR).

rhizome · on July 19, 2019

What distinction are you making? I think it might be a little too subtle written that way.

SethTro · on July 18, 2019

They tested a single image of each type. :/

ysleepy · on July 18, 2019

I wonder if some implementations just use nearest neighbour words to increase accuracy in the common case of normal text. - Decreasing performance on random strings considerably.

The tested corpus only contains relatively common words, so this aspect is not tested.

jordoh · on July 18, 2019

Have a specific image you'd be interested in seeing tested? The article only contains a few examples that could be freely used, but images with sparse random text (e.g. [1]) do tend to have good results across all the services.

[1] https://www.gettyimages.com/detail/news-photo/ken-griffey-jr...

DanHulton · on July 18, 2019

Dang, I was hoping to see a comparison to ABBYY's FineReader Online in there: https://finereaderonline.com Maybe in the 2020 review?

jordoh · on July 18, 2019

I tried running the source images through FineReader Online, but the images with handwriting resulted in "was not processed: the recognized document contains errors". The website image worked, but was missing a few elements, like the other headings on the line with "Minimalist editor".

gbolcer · on July 19, 2019

Nice! I did the same thing for ASR technologies a while back. Results here. https://drive.google.com/a/bolcer.org/file/d/1CJTHikHldMYTMv... (methodology: same parts of speech analysis on exact transcripts ran through different services).

All OCR services, BTW, have the same threading problem. They work really well for sequential text, but as soon as you start getting into more complicated "marketing" formats, they don't work at all.

They best use of the online OCR services I found was figuring out published dates for news articles that typically don't show up in anything other than images. Even w/ both Azure, AWS, and Google, I still needed a post-capture regex to figure the stuff out. https://drive.google.com/a/bolcer.org/file/d/16vujemgD91Ebuu...

We've got a long ways to go.

poxrud · on July 19, 2019

Once a year when I do my taxes I go through every single bank transaction and classify it into a proper column in a spreadsheet. The problem is that my bank provides CSV transaction data for the past 6 months only. But it does provide years of bank statements in PDF format.

I decided I can save a lot of time by using Amazon Textract to extract the tables from the PDF's and convert them to CSV files.

The problem is that while Textract works really well for well defined tabular data it does not work for tables where the rows and columns are implied with white space, instead of lines. When I reached out to aws they confirmed this problem and suggested that I draw the table lines into the PDF and then run textract again on this modified pdf. This felt like a dirty hack so I did not proceed with this suggestion.

Textract is a great tool when it works well, but unfortunately when it doesn't there are no ways to make adjustments in order to improve the results. In the end I managed to complete my project and get a lot better results by using the excellent Camelot Python table extraction library.

ocrcustomserver · on July 19, 2019

"The problem is that while Textract works really well for well defined tabular data it does not work for tables where the rows and columns are implied with white space, instead of lines."

This is what Tabula and Camelot call "Stream" and "Lattice" parsing methods.

Whitespace between cells is Stream, demarcated lines is Lattice.

georgewsinger · on July 18, 2019

Vicarious has text recognition technology that is invariant to, i.e., letter spacing, but they haven't released it due to fear of CAPTCHA fraud. See, i.e., [here](https://i.imgur.com/lN4AzmE.png).

I wonder if they could modify it to release a near 100% accurate OCR service?

ipsum2 · on July 18, 2019

Letter spacing seems like an arbitrary distinction. IF you design a neural net that contains operators that make certain things invariant (e.g. size) then of course it would perform better than a neural net without it.

tiernano · on July 18, 2019

unless i am having a hard time reading the text on the site, they all are $1500 for 1M images, but Amazon Textract, which has the second lowest average result, is "cheapest" compared to Azure Congnitive Services, which has the highest average result... Did i miss something? How is 4 items, all the same price, ranked arseways?

osrec · on July 18, 2019

For 1 million images, it's $1500 across all, but as you scale to 5m images, the prices must differ.

tiernano · on July 18, 2019

it doesn't mention that... thats what confused me...

osrec · on July 18, 2019

Look at the table sub-headings (the blue row)

tiernano · on July 18, 2019

(faceplam) missed that! thanks!

ndm000 · on July 18, 2019

I’m new to the OCR space. Given the the successes in self-driving (given no systems are production ready, Waymo and Tesla have MVPs), what makes reading text so hard that cloud providers struggle to have human-level accuracies?

rhizome · on July 19, 2019

Same problems as with self-driving cars and speech recognition: it's very very hard (if not impossible) for software to parse the universe of scenarios. Said another way, it's never accurate enough to "work" when it's trained on the outside world. Usable output requires the kind of accuracy you only get when the process is trained in controlled environments (limited "vocabulary": fonts, layouts, languages, words, accents, geography, etc.).

visarga · on July 18, 2019

Many fonts, characters, languages and low quality scans.

catchmeifyoucan · on July 18, 2019

Are there any open source alternatives that are competitive in this space?

noahster11 · on July 18, 2019

Google is behind tesseract[0], not sure if they use this for Google Cloud Vision OCR

[0]: https://opensource.google.com/projects/tesseract

ocrcustomserver · on July 19, 2019

Accuracy of Google OCR vs Tesseract v4 (on books and web): https://imgur.com/a/Tj0TASf

dzink · on July 18, 2019

Toutanova, one of the co-authors of the BERT paper worked for Microsoft for 12+ years until shifting to Google recently.

m0zg · on July 19, 2019

This is one of those things that badly needs an open source solution, and for which technology exists to solve it really well, but nobody wants to do it because it's a ton of really boring, high-maintenance work.

citizenpaul · on July 21, 2019

I did this a while back with speech to text services. Microsoft's offerings by far beat out the others in terms of accuracy and performance.

hn23 · on July 19, 2019

Malicious gossip has it that you have to perform these tests with Google/Amazon when the Mechanical Turk is not sleeping.

georgewsinger · on July 18, 2019

For what it's worth the Wolfram text recognition function does a poor job parsing the whiteboard photo: http://www.wolframcloud.com/obj/user-900a994f-78ab-4931-b18e...

(Not a critique of Wolfram technologies here; their services are otherwise amazing).

aw3c2 · on July 18, 2019

FYI that link just gives a log-in screen.

jeffrogers · on July 18, 2019

How does iOS Vision + Core ML compare?

johnmalatras · on July 18, 2019

My app, Quotable, uses Azure if you were curious about perf. Works great!

WalterBright · on July 18, 2019

It's nice to see progress on handwriting recognition.

LeonB · on July 19, 2019

Comparison should include the Apple Newton.

mkl · on July 19, 2019

That couldn't do OCR at all, right? It had no camera. It could do handwriting recognition on stylus input (where it has full perfect stroke data, not just pixelated images of strokes), which is a very different and easier problem.

ocrcustomserver · on July 19, 2019

The Apple Newton was capable of online OCR (doesn't have to do with internet connectivity in this case).

As you mention, online OCR is when you input the strokes directly on the device vs offline OCR where the input is an image.

Some trivia:

The first version of handwriting recognition engine of the Newton was developed by ParaGraph International (founded by the founder of Evernote). Another version (Print Recognizer) was later developed by Apple.

https://web.archive.org/web/20120324055221/http://www.beanbl...