2023.1 Migration Guide
New OCR and Office Conversion Engines
Our previous OCR engine was based on the Tesseract open source project, and we used LibreOffice as the core of our Office conversion tools. This allowed us to produce quality results, but we found it lacking in certain aspects due to the nature of these two fundamental dependencies that were powering it. The main issue with our OCR engine was the performance, which was only acceptable at best. In the case of Office conversion, our main pain point was that we were unable to effectively improve the conversion quality itself.
Both new engines bring improved performance and accuracy, with documents being processed more quickly and accurately. The OCR performance gain is especially considerable. We measured improved performance of up to 7× when compared to the previous engine — all while delivering the same or sometimes even better accuracy.
Usage of these new engines requires a license key update. If your license already includes OCR or Office conversion components, you qualify to get access to the updated engines for free. In that case, to enable them, retrieve the updated license key from the customer portal and update it in Processor’s configuration.
If you encounter any changes in behavior or regressions that break your workflow, revert back to the old engines with the following Processor configuration options:
OCR_ENGINE— The OCR engine defaults to
coreto revert to the old engine.
CONVERSION_ENGINE— The Office conversion engine defaults to
libreofficeto revert to the old LibreOffice-based Office conversion engine.
Usage of the old OCR and Office conversion engines is deprecated, and we’ll drop support for them in a future version. Please submit any issues you encounter with the new engines to Support.
For a complete list of changes in this release, see the Processor changelog.