You may have run into some problems with Ephesoft classification. This blog outlines six best practices for training. We find that we get the best results when we follow these rules and we hope that you find value in them as well.
1. Start With the Blank Form
The best strategy is to start with a blank form. For standard forms, there are often sources where you can download high quality originals (e.g., the IRS website for tax forms). This allows you to train for the words that are common in the populated documents being processed. It also provides good fidelity for best OCR results.
2. When You Don’t Have Access to Blank Documents
When this happens, I recommend finding an example with the best fidelity and redact the populated information from that variation. Pay attention to the OCR results of the example you choose and make sure your sample is as accurate as possible. While this may be time consuming, it produces the best results for your classification. Not redacting the samples could pose several issues. First, it could include a populated document that has indeterminate fidelity. The index and the file system may include personally identifiable information (PII), swaying some samples to be incorrectly classified. Second, it may pose a challenge to getting support. Specifically, Zia and Ephesoft may only be able to work on the issue after it is replicated. Often, replication requires getting a backup of the batch class configuration. That configuration archive could contain samples with PII, posing a security risk.
3. Less is More
While many customers use a ton of samples to make Ephesoft Transact work well, it is not required. It is human nature to use different documents to train, especially when you are working with different vendors or forms from different years. However, the content is what really matters. In general, you can proceed If the content is not drastically different on the first page. Remember, large training sets will lead to bulky archives, slow imports, and sometimes bad classification results.
4. Don’t Use Operator Assisted Machine Learning
I know that this is a controversial position in a day of artificial intelligence, machine learning, and cognitive capture. That is why it is important to explain how this works in the current implementation. When operators have permission to train, they can select a document that is not automatically classified and specify that it be used for training purposes. This moves the sample document into the training set and updates the index used for classification. When this happens, you are faced with the issues related to not having a blank document. In addition, the classification training diverges from the training in lower environments (e.g., development, staging, UAT, etc.). While you can compare the samples in the different environments, it poses a challenge to keeping these environments in sync.
Please note the information in this post relates to version 2020.1.05. The machine learning for Ephesoft is constantly being improved and my position will change with better functionality.
5. Sometimes You Have to Change the Training
When you import a new document into Ephesoft Transact for classification training, it separates the pages into first, middle, and last pages. While this may work for many document types, it does not work well for those with a lot of variation. Take for instance a document that includes an addendum. If you train the document type without the addendum, and then again with the addendum, it will most likely separate the document incorrectly. It creates the addendum as a new document because Ephesoft ends the document when it sees a last page. When there is variation like this, I suggest that you move the last page samples to the middle pages and retrain the index. This may lower the confidence of the document, but it assembles the documents more accurately. You can always change the confidence thresholds for changed documents using this type of configuration.
6. Use Keyword Classification When Possible
There are cases when there is a very specific term, or set of terms, on a document type. When that occurs, don’t be afraid to use keyword classification. It is quick and accurate. We often use a combination of Searchable, or MultiDimensional and Keywords for our implementations. You can use multiple classification methods by selecting Automatic for the DA Classification Type in the Document Assembler plugin. That will use the most confident classification from multiple classification plugins.
We believe these are some of the best practices of Ephesoft classification training. That said, we invite you to share what has worked best for you. Of course, feel free to reach out to us anytime if you have additional thoughts or questions.