Keyword Classification in Ephesoft Transact
By Ian Sprod
*This article makes the assumption you have some familiarity with Ephesoft Transact.
Ephesoft Transact has a number of different classification methods including search, barcode and image classification, as well as a more recent addition called keyword classification. It is also possible to write your own custom classification algorithm in Java using the document assembler scripting plugin. It is usually more efficient to use the out-of-the-box classification methods instead of having to write, debug, and test your own custom code. Recently, I came across a case that I thought required custom classification but it turned out I could use keyword classification instead.
Keyword classification allows the user to set up two rules that enable classification:
- Create key-value (KV) extraction rules that run during page processing (these are like regular KV extraction rules that just run in an earlier module)
- Specify classification rules that use these extracted values
The business case had scanned audit documents with distinct cover sheets that were followed by pages that could contain any text. The cover sheet contained a unique key field with a label “AB Id:” with an integer value like “1234567” however, in some cases the cover sheets had the label “AB Id:” but no value. In these cases, the cover sheet should be ignored.
Because of the special “do not split if the AB Id key has no value” requirement, I initially thought custom code would be required. Upon investigation, it turned out it was possible to use KV extraction with no custom script required.
My batch class already contains a document called AuditDocument, so first of all I added the KeyValue Page Processing plugin to the Page Processing module.
Then I edited the Document Assembly plugin in the Document Assembly module to change the DA Classification Type to “KeywordClassification”.
Once I had the plugins added to the workflow, I needed to create the rules.
- I set up a KV rule to extract a variable called “ABID” by looking for the key text “AB Id” followed by the value of a seven-digit integer (regular expression for this is \d{7}).
- Using the classification rules section, I then created a rule to classify the document as an AuditDocument [CM3] if the ABID value was greater than zero. There is a UI widget to create the rule:
You can create additional rules if necessary. Setting the rule as “First Page” means the page is classified as the first page of a document if the rule is true.
The “multipage” check box indicates the document can have multiple pages. By using “is exists” my special case where “ABID” has no value is handled as the expression will evaluate to false and hence the cover sheet will be ignored. You can create as many KV rules as you want—I only needed one in this case. The classification confidence is calculated by how many of the classification rules are true, so in my case confidence is either 0% or 100%.
It is always nice to use out-of-the-box features when possible, so I was very glad to find this. These occurrences remind me to always read the release notes for useful new features.
To learn more, or for assistance, contact us today.