Capturing consistent, normalized tabular data from different documents
When a metadata field is defined as a table, it is possible to define columns with column names. This is especially useful when capturing tabular data from different document formats.
Let’s take invoices as an example. Different companies’ invoices will be laid out differently and will have differing types of tables for the line items on the invoice.
Even columns containing the same information (e.g., quantity, or item description) will have differing column headers.
If our metadata table is left without a definition, the table headings output as metadata when the document is saved will have the column names as defined on the source document.
To ensure consistent column headers, Scan2x will allow you to define your own column names and you can then map the columns off the invoice to your own columns.
If you are using the metadata captured from the invoices to import into a downstream application, having your tabular data normalized into one standard table format is essential.
Click on the Define Table button in the Jobs Manager metadata tab – this appears when you select a metadata field defined as a table. See the following screenshot.
When you click the Define Table button, a grid appears like the one in the screenshot below. The shot below shows the table definition for the Delivery Notes job included with the Scan2x demo.
Add columns to your column definitions by either:
•Clicking on the Add all columns to definitions button as shown in the first screenshot below. Since there are 5 columns in the grid below, once this button is clicked, 5 column fields will automatically be added to the Column Definitions tab.
•Entering each one in the Column Definitions tab, as shown in the second screenshot below.
Sorting. You can use this value to sort your table columns as required.
Field. Enter the name of your table column.
Type. Choose between “Custom” and “Expression (VBScript)”. Custom is the default – this will take the OCR’ed data from the document exactly as read from the document. Using an expression allows you to use VBScript to enhance or enrich the data OCR’ed from the document.
Map to OCR Column. Use this value to indicate to Scan2x which document column you would like to map to your table.
Line Mode. You can specify how much of the data read from the document is to be inserted into your column.
As shown in the screenshot above, in the OCR Table section of the Table Definition Builder, an administrator can specify how they would like the data in the table to show with the following options:
Rows per record and Separate columns of merged rows.
•Rows per record
This option allows the administrator to show the tabular data in a specified amount of rows.
For example, the first screenshot below shows each record in their own row. In the second screenshot, with the Rows per record set to '2', the records are now being shown with two records in a row.
•Separate columns of merged rows
This option allows administrators to separate columns of any merged rows.
For example, this can be used in the case of invoices when the line item OCR results are 2 rows per line item (i.e 1 line item of data along 2 rows) because of the way the invoice is structured. With this option, administrators can specify that, for example, every 2 rows are actually 1 row of data.