Too frequently in the technology world we assume that everyone already knows all the lingo for whatever topic we happen to be talking about. As much as I try to avoid speaking in jargon, even I am guilty of not correctly establishing the basic concepts and vocabulary.
In the Content Management world, the difference between the terms Capture and Advance Capture are seldom discussed. Hopefully this post will clear things up.
History of Capture
When you hear the word Capture, what vendors are usually talking about scanning software that allows you to convert paper documents to electronic formats, usually PDF or TIF, and apply tags or metadata so that you can find the documents at a later date.
Capture’s Modest Beginning
In the beginning, Capture consisted of scanning paper and storing the electronic documents on a file share. The only metadata was typically the directory structure. You might have a top level folder for Accounting, with sub-folders for Invoices, Purchase Orders, and Expense Reports. Under those folders, you might have a folder for the Year and Month.
This structure might work well if you have very few clients and vendors, but quickly becomes unwieldy if you need to find all the documents for one vendor. Needs quickly evolved where different business users needed better ways to find documents based on their specific use cases.
Capture Classification and Indexing
The next obvious capture evolution was to have scan user classify documents (what type of document is this) and apply metadata to that document (what is important about this document) so that the business users can find the documents when they need them.
This methodology is called “Key from Image” and is still very popular in low volume environments. Once the documents have been classified and metadata applied, instead of using a file server, the documents are usually stored in an Enterprise Content Management system such as Onbase, Alfresco, FileNet, Documentum, SharePoint, or Nuxio. These systems provide search functionality, document workflow automation, and document life-cycle management.
As the volumes of documents continued to grow, it became expensive to have teams of scanning and indexing users. Advanced Capture solutions such as Kofax and Ephesoft were introduced to automatically classify the documents as well as extract the meaningful information off of the documents.
These systems are usually set up where humans still look at document classifications and extracted metadata for any documents that the system is not completely confident about.
Common Advanced Capture Features
There are a variety of Advanced Capture solutions on the market, many focusing on a specific niche or optimizing themselves for specific applications. Most Advanced Capture solutions provide some level of the following features.
Document Classification answers the question, “What is it?” Is this an Invoice, an I9, a W2, a Purchase Order, a Resume, etc. Most systems require some level of training whereby someone defines the document types and provides the system with samples of those documents.
Various solutions may use text, images, barcodes or a combination of these to determine what type of document they are looking at.
Once the type of document is determined, the system will extract metadata using one or more of the following methods.
OCR (Optical Character Recognition)
OCR is used to extract Machine Printed Characters. What you are reading now are examples of machine print characters that could be OCR’d. Advanced Capture solutions can use position, relative position, or data structure to extract the data.
ICR (Intelligent Character Recognition)
ICR is used to extract Hand Printed Characters. If you fill out a form by hand, ICR can be used to determine what you wrote. As with OCR, Advanced Capture solutions can use position, relative position, or data structure to extract the data.
OMR (Optical Mark Recognition)
OMR is used for Checked or filled boxes/circles. The standardized test you took in school where graded using OMR technology.
Database assisted extraction
Sometimes Advanced Capture solutions will use information they extract from a form and add additional metadata using a database query. For example, if the system reads an account number off of a form, it may be able to match that to a specific customer or vendor using a database lookup.
Most Advanced Capture solutions have some form of automated validation. Validation can answer questions like “Is Employer ID Valid format?”,”Did they select enough or too many options for check boxes?”
Although it isn’t typically possible to read cursive handwriting, it is possible to analyze a signature area to see if a document has been signed.
Most Advanced Capture solutions are capable of reading 1D barcodes, like on a box of cereal, as well as 2D barcodes like you see on a Fed-Ex or UPS package. The barcode data may be used for document classification or for metadata.