Navigation: Scan2x Administrator's Guide > Scan Job Configuration > Metadata Tab > OCR Zones >

OCR Zones best practice

 

 

 

 

OCR Zones best practice

 

-Remember that during day-to-day scanning, paper documents will never be presented to the scanner in the same position as when you scanned the document template. As a result, you will find that Zone OCR will be much more reliable if you allow for positioning error when drawing your zone box, by leaving white space around the reference you are trying to read.

 

-If your document has pre-printed boxes on it, try to exclude vertical lines from the zone to be read by OCR, as these can be interpreted as letters or numbers.

 

-The default Scan2x OCR functionality becomes more reliable when it is presented with a chunk of text to recognize, and not just a handful of letters. For example, to OCR the invoice number in the document at right, draw the box as shown – this will give the OCR engine enough characters to work with, while keeping the box small enough to prevent slow-down of the entire operation. See the next section for details of how to isolate the invoice number from the rest of the text.

 

 

 

 

Once a metadata field has been linked to a zone for OCR, it is possible to qualify the results that you expect from the OCR operation by selecting one of the following options in the dropdown highlighted in the red box in the above screenshot:

 

Option

Description

Generic (Default)

The Generic setting accepts the entire text generated by the OCR process.

 

Text Only

Specifies that text is expected. The function therefore substitutes letters for digits where applicable, to ensure accuracy. These substitutions include lowercase “L” where 1 is read, “b” where 6 is read, “B” where 8 is read and “P” where 9 is read.

 

Numbers Only

Specifies that digits only are expected. The function therefore substitutes digits for letters where applicable, to ensure accuracy. These substitutions include 1 where lowercase “L” is read, 6 where “b” is read, 8 where “B” is read and 9 where “P” is read.

Decimals

Specifying this option will tell Scan2x not to eliminate the period (“.”) and comma characters from the captured text.

Date (Default)

Instructs Scan2x to expect text in the form of a date. The system will attempt to change the date it finds into a standardised format.

Date (US format)

As above, but specifies MM-DD-YY format

Date (UK format)

As above, but specifies DD-MM-YY format

 

Handwriting

When using the iDRS OCR Engine, you can select this option to enable handwriting recognition. Please read the additional guidelines below before attempting to recognize handwriting.

 

Handwriting (Boxed)

When using the iDRS OCR Engine, you can select this option to enable handwriting recognition of boxed letters. This is text that has been printed with individual letters in pre-printed boxes on a form. Please read the additional guidelines below before attempting to recognize handwriting.

Barcode

Drawing a zone around a barcode will interpret that barcode.

QR Code

Drawing a zone around a QR Code will interpret that QR Code.

 

Barcode or QR Code

Use this feature when documents could include either a barcode or a QR Code in the same position. This has been used where a project to scan past documents (on which barcodes were printed) together with current documents on which QR Codes are now printed.

Barcode Type

Use this feature when documents include a barcode, and you would like to know which specific barcode is being used in the document.

Aztec Code

Aztec Codes are a form of 2D barcode.

Data Matrix

Data Matrix codes are a form of 2D barcode.

PDF417

PDF417 are a form of 2D barcode, commonly used by the airline industry on some boarding card formats.

ID Card MRZ

Instructs Scan2x to attempt to locate structured data within a Machine Readable Zone, performing all the standard Checksum controls.

Using the MRZ selection, it is possible to instruct Scan2x to search an entire zone for the presence of an MRZ. Scan2x will automatically detect this data format, isolate it from the rest of the content of the document and process it into its component parts.

Cheque MICR

Enables Scan2x to read the Magnetic Ink Character Recognition (MICR) data on a cheque.

Table

Scan2x can process data presented on a document in tabular form and split it into rows and columns. This data is saved by Scan2x in XML format for further processing by downstream systems. An example of the use of this functionality is the line item extraction of data from invoices.

 

Table zones can be set to adjust their position and size dynamically on the document to span content that is typically different on each document submitted. For example, a batch of invoices may vary between those containing one-line item and those containing multiple line items spanning many pages. Using the Anchor functionality to allow Scan2x to automatically detect the start and end of the table for each document, it is possible for Scan2x to accurately capture all lengths of document automatically.

Table (with headers)

As above, but by selecting this option you instruct Scan2x to expect the first row of the OCR'd table to be a header

Table (flipped)

Scan2x will process a table like the example below

   

and from the above will create a conventional table like the one below.

 

Table (with headers, flipped)

Scan2x will process a table like the example below

   

and from the above will create a conventional table like the one below.

 

RAW OCR Data

This option outputs the raw OCR Engine data for the text analysed in this OCR Zone. Data about each character recognised together with its position found in the zone is included in the output. Normally used in advanced integration projects.

 

When you select this option, the field you select will not return the string value of the OCR zone you specify, but rather an XML data structure of every letter found by OCR together with their position information on the page in pixels.

 

The OCR Engine dropdown allows the administrator to select what OCR Engine they would like to use for specific metadata fields. If one leaves the option as 'Field Default', Scan2x will then use the OCR Engine that is specified in the Scan Settings tab. 

Once an OCR Engine is selected, the OCR Language option will be available. This option allows administrators to choose the language they will be OCRing. 

 

Once the expected result of the OCR process has been defined, it is also possible to instruct Scan2x to keep the entire result or only a portion of it. 

Selecting First Line Only in the drop box shown on the right in the screenshot below instructs Scan2x to populate the metadata field with the first line returned by the OCR process only, and discard the rest, while Last Line Only does the opposite. All in 1 Line tells Scan2x to put all captured text into one large string – this is commonly used to capture addresses from documents.

 

 

 

Finally, checking the Remove All Blank Spaces instructs Scan2x to delete all <space> characters from the OCR result. For example, the result “INV 1234” will be converted to “INV1234”.

 

The OCR Rotation option allows Scan2x to OCR text that is printed vertically up or down the page. This option is only available in the full version of Scan2x.

OCR the whole page puts all text on the page into a metadata field. This option is disabled if an anchor has been defined for the chosen metadata field.

 

 

 

Copyright © 2022 Avantech Software