Search

Familiarize the Document

Last Updated: Jan 21, 2021

Articles

When the Doc Reader node is executed, the document is processed, and the fields are extracted by the Doc Reader.
A user interface is provided for confirming and verifying the machine interpretations.
The user can simply point and correct any changes; both label as well as data and train the Doc Reader to extract the correct fields from the document. Rule-based model gets applied to all documents of a similar format.

In Datasets listing page, click the Document Table name selected in the Properties of the Doc Reader node.
Click the icon next to the processed row. The Document familiarization window opens.

The PDF displays on the left-hand side panel. The Fields predicted by the Doc Reader are auto-populated in the right-hand side panel.

If any corrections, for the fields predicted by the Doc Reader, or if any fields not predicted by the Doc Reader,

Click the field in the right-hand side panel that needs to be corrected and select the required Label from the PDF in the left panel.
Correct the value of the fields by selecting the required value from the PDF in the left-hand side panel.

If the Invoice Date field value is not captured correctly, click the Invoice Date field value in the right panel and select the required value for Invoice Date in the PDF from the left panel. The correct Invoice Date is now extracted.

Data Capture Rule

In scenarios where the field occures:

in multiple pages, and you need to extract the first or last occurrence of the field.
at multiple places in a page, and you need to provide a reference position to extract the field.

These additional rules can be specified for extraction, using the Data Capture Rule option.

Click Image description icon against the field in the right panel for which you want to add any additional rules. The Data Capture Rule window opens for that field.

Data Capture Rule for Invoice Date

Either of the following Data Capture Rules can be applied.

Occurrence of the Data

You can choose the occurrence of the data from the drop-down. It can be either First Occurrence or Last Occurrence.

Consider a PDF with ten pages having Total Amount on all pages, the Last Occurrence is selected in the drop-down and the Total Amount on the last page is extracted.

Relative Reference

If a document has multiple occurrences of labels that you are extracting, use this option to identify the one to be extracted.

Consider a PDF where the GST Number occurs at multiple places, use this option to extract the desired GST Number which is displayed below the Address.

Manual Review

In cases where the document being processed is not of the desired format and you want to skip it from the auto category identification algorithm.

Click the Manual Review button in the document familiarization window to skip the document training process. All the fields become non-editable.

You choose not to train the Doc Reader for that document and the status is updated to MANUAL_ INTERVENTION_FOR_REVIEW.

Familiarizing Inline table

If the document has tables, Doc Reader automatically identifies the tabular structures and extracts the contents as tables in the right-hand side panel under Line-Item details.

Verify the columns predicted are correct, else familiarize the correct columns from the left panel.
Click the icon and select the End of the table option. Select the immediate text below the table from the left panel.
Click the Update Data button. Table data is extracted from the PDF and populated in the table in the right panel.

Row and Column Definition

If the rows and columns in the table of the document are not aligned properly, Doc Reader cannot identify the rows of the table correctly.

You can use Row and Column Definition to identify the rows and columns for extraction. Based on the parameters provided, the rows are marked, and Doc Reader identifies the row to be extracted.

Click the icon and select the Row and Column Definition option. The Row and Column Definition window is displayed in the right panel.
1. Key Column : Any column that is properly aligned can be selected as the reference or row marker.
2. Alignment : You can select Top or Bottom from the drop-down.
  - Top : Row marker starts from above the text in the Key Column record and extends to the top of the text in the next record.
  - Bottom : Row marker starts from the bottom of the text in the Key Column record and extends to the bottom of the text in the previous record.
3. Column Starts After : Defines the offset for the row marker. This is used when the data in the columns are misaligned with the Key Column data.
  Depending on the position of data in the table, the row lines are automatically captured. If the row lines are not separating the rows correctly, you can use this option to define the exact location of the row separator.
Provide the values for the required parameters Key Column, Alignment , Column Starts After based on the alignment of data in the table
Click the Update Data button. The rows are identified, and the data is updated in the table.

In the below PDF, the Quantity column is selected as the Key Column field; Top is selected in the Alignment field.
Row marker starts from the top of each text in the Quantity column and extends to the top of the text in next record.
The rows are correctly identified, and the Description column is also correctly displayed.

Change the Alignment field to Bottom. Row marker starts from the bottom of each text in the Quantity column and extends to the bottom of the text in the previous record. The rows are identified, but the Description column is misaligned. So, for the below PDF Alignment must be selected as Top.

SAVE AND APPROVE

After all, the required fields are familiarized and verified, click the SAVE AND APPROVE button to save the changes and approve the category of the document processed.

The extracted fields are populated into the Document Table and the status is updated to EXTRACTED_SUCCESSFULLY.

Click the view button to view the Data in the Inline Table.

Behavior when Similar Template Document is Processed

When a document of the similar template is processed, the same category gets assigned to the document as the already approved one. The existing rule-based model for the category is applied, and the status is auto-populated to EXTRACTED_SUCCESSFULLY. All the required data from the document is extracted and auto-populated in the Document Table.

Base Document

The earlier approved PDF for the category, referred to as the Base Document , is displayed in the Document Familiarization page along with the current processed PDF.

The currently processed PDF document is auto-approved, and the Approved Date is displayed.

Auto Processing

You can enable the Auto Processing option for Document Tables with predefined schemas. The values for predifined columns get auto-predicted by the prebuilt ML models. Using the predicted data, the documents are auto-approved without a need of manual interventions. The status is updated to EXTRACTED_SUCCESSFULLY automatically.