Last Updated: Jan 21, 2021
When the Doc Reader node is executed, the document is processed, and the fields are extracted by the Doc Reader. A user interface is provided for confirming and verifying the machine interpretations. The user can simply point and correct any changes; both label as well as data and train the Doc Reader to extract the correct fields from the document. Rule-based model gets applied to all documents of a similar format.
The PDF displays on the left-hand side panel. The Fields predicted by the Doc Reader are auto-populated in the right-hand side panel.
If any corrections, for the fields predicted by the Doc Reader, or if any fields not predicted by the Doc Reader,
If the Invoice Date field value is not captured correctly, click the Invoice Date field value in the right panel and select the required value for Invoice Date in the PDF from the left panel. The correct Invoice Date is now extracted.
In scenarios where the field occures:
These additional rules can be specified for extraction, using the Data Capture Rule option.
Click icon against the field in the right panel for which you want to add any additional rules. The Data Capture Rule window opens for that field.
Data Capture Rule for Invoice Date
Either of the following Data Capture Rules can be applied.
You can choose the occurrence of the data from the drop-down. It can be either First Occurrence or Last Occurrence.
Consider a PDF with ten pages having Total Amount on all pages, the Last Occurrence is selected in the drop-down and the Total Amount on the last page is extracted.
If a document has multiple occurrences of labels that you are extracting, use this option to identify the one to be extracted.
Consider a PDF where the GST Number occurs at multiple places, use this option to extract the desired GST Number which is displayed below the Address.
In cases where the document being processed is not of the desired format and you want to skip it from the auto category identification algorithm.
Click the Manual Review button in the document familiarization window to skip the document training process. All the fields become non-editable.
You choose not to train the Doc Reader for that document and the status is updated to MANUAL_ INTERVENTION_FOR_REVIEW.
If the document has tables, Doc Reader automatically identifies the tabular structures and extracts the contents as tables in the right-hand side panel under Line-Item details.
If the rows and columns in the table of the document are not aligned properly, Doc Reader cannot identify the rows of the table correctly.
You can use Row and Column Definition to identify the rows and columns for extraction. Based on the parameters provided, the rows are marked, and Doc Reader identifies the row to be extracted.
Click the icon and select the Row and Column Definition option. The Row and Column Definition window is displayed in the right panel.
Depending on the position of data in the table, the row lines are automatically captured. If the row lines are not separating the rows correctly, you can use this option to define the exact location of the row separator.
Provide the values for the required parameters Key Column, Alignment , Column Starts After based on the alignment of data in the table
Click the Update Data button. The rows are identified, and the data is updated in the table.
In the below PDF, the Quantity column is selected as the Key Column field; Top is selected in the Alignment field. Row marker starts from the top of each text in the Quantity column and extends to the top of the text in next record. The rows are correctly identified, and the Description column is also correctly displayed.
Change the Alignment field to Bottom. Row marker starts from the bottom of each text in the Quantity column and extends to the bottom of the text in the previous record. The rows are identified, but the Description column is misaligned. So, for the below PDF Alignment must be selected as Top.
After all, the required fields are familiarized and verified, click the SAVE AND APPROVE button to save the changes and approve the category of the document processed.
The extracted fields are populated into the Document Table and the status is updated to EXTRACTED_SUCCESSFULLY.
Click the view button to view the Data in the Inline Table.
When a document of the similar template is processed, the same category gets assigned to the document as the already approved one. The existing rule-based model for the category is applied, and the status is auto-populated to EXTRACTED_SUCCESSFULLY. All the required data from the document is extracted and auto-populated in the Document Table.
The earlier approved PDF for the category, referred to as the Base Document , is displayed in the Document Familiarization page along with the current processed PDF.
The currently processed PDF document is auto-approved, and the Approved Date is displayed.
You can enable the Auto Processing option for Document Tables with predefined schemas. The values for predifined columns get auto-predicted by the prebuilt ML models. Using the predicted data, the documents are auto-approved without a need of manual interventions. The status is updated to EXTRACTED_SUCCESSFULLY automatically.