Google Cloud Platform is previewing Document AI, a new service for automatic extraction of data from documents, such as key/value pairs in forms, with a choice of parsers and an option to use a custom model.
The thinking behind Document AI is that businesses handle thousands of documents, many of which contain structured or semi-structured data but in different formats, and that there is benefit in extracting structured data from them so it can be processed and analysed.
This includes converting data into a standard form, so that synonyms like “last name” “surname” and “family name” are treated as representing the same thing. “We take your unstructured documents across a variety of formats and turn them into cleanly structured data,” is the pitch.
Parsing an invoice in Document AI: what happened to the invoice line called Package? The service is not so good with this kind of elaborate layout
Parts of this have already been previewed, but the new piece – introduced yesterday – is a unified API that enables developers to process documents using a variety of different parsers or “processors”. The API is REST or gRPC, wrapped by client libraries for Java, Node.js or Python.
Available processors cover OCR (optical character recognition), generic forms, generic tables, and document splitter which “uses machine learning to separate documents on logical boundaries,” such as breaking a collection of scanned documents into logical paragraphs and pages. There are also special parsers for invoices, receipts, lending documents, a US Uniform Residential Loan application, and several US federal tax forms.
Developers and data scientists can also build their own custom Auto ML Natural Language model to parse and analyse documents. This approach is the most powerful, enabling such things as identifying entities and assessing attitudes within documents.
AutoML Natural Language models are not new, but the integration into Document API is the part that is now in beta. That said, both table parsing and AutoML are marked with a no entry sign in the docs, indicating a closed beta, and some other features, including document splitter, are “limited access” which means pre-approval by Google is required.
How well does it work?
The design of the document makes a big difference. If it is plain and straightforward in its layout the chances of accurate parsing are good, but if it has a more elaborate layout, the parser can get confused. We tried the invoice parser on an old invoice with fair results, but the parser did for some reason ignore one of the invoice lines, and also decided that a couple of lines of marketing on the invoice were structured data to be extracted, extracting a key called “Cheaper calls to friends,” for example.
The service did better on a plainer invoice, where it successfully read a line called VAT Registration No and parsed it as supplier_tax_id, exactly the kind of intelligent parsing that is useful.
Document AI could save a huge amount of manual effort, but some human checking also looks necessary. This is especially true for scanned documents which might be smudged or dirty.
Google is sensitive to worries about confidentiality and said that after processing on its service “the stored document is typically deleted right after the processing is done.”
The company added that “Google also temporarily logs some metadata about your Document AI API requests …to improve our service and combat abuse.” The data, said Google, is not used to train and improve the Document AI machine learning model.
Prices vary, but the form parser for example costs $65 per 1,000 pages, and the Document OCR processor costs £1.50 per 1,000 pages.
Finally, note that AWS has a similar service called Textract, while Microsoft Azure has an “AI-powered document extraction service” called Form Recognizer, so it looks like there is an element of catch-up in Google Cloud’s latest move. ®