Want to ensure that your Office 365 environment is as secure as possible? Register for our “All Access Tour: Office 365 Security and Governance Features” today!


​Background

More and more data has continued to originate from digital formats over the last decade. However, there are still a lot of cases where physical documents need to be used or preserved. Healthcare and financial industries in particular typically scan or fax a lot of physical documents into TIFF or PDF formats. Unstructured content analysis is already a bit of a challenge, and these formats end up being an even harder nut to crack.

What is Optical Character Recognition?

Optical character recognition essentially allows users to extract text content from images of physical documents so that it’s in an editable format. This can apply to pages of a book, scanned PDF files and even handwritten content (though this functionality is more limited). Compliance Guardian’s OCR implementation is possible in large part due to Google’s Tesseract library.

Considerations

Aside from the exciting new capability to see text in images, there are also some considerations to keep in mind.

Check out how the introduction of optical character recognition will change how companies think about compliance: Click To Tweet

Performance

OCR requires a lot of computation and will have a significant impact on CPUs. As a result, the speed at which documents can be scanned will be considerably slower than before. For instance, it may take 5 seconds to process a 300-dpi scanned page (depending on CPU power).

Accuracy

Although OCR technology has advanced a lot in recent years, it’s still far from perfect. It’s rare for OCR to yield 100% accurate results. The clearer the original image, the more accurate the result will be.

In the following section, we will expand on some common factors that can affect accuracy. To help improve accuracy, pre-processing is very important. Common approaches include things like converting an image to grayscale, increasing contrast, noise reduction, and more. In special cases, more complex pre-processing may be needed (e.g. computer vision, contour detection, rotate/crop/anchors).

Common Factors That Impact OCR Accuracy

Typically OCR works better for documents that are:

  • Scanned with flatbed scanners
  • Scanned with good resolution and lighting conditions
  • Scanned with high contrast
  • Text centric
  • Using common fonts
  • Well aligned.

Documents produced by a dedicated scanner or fax machine can meet most of these conditions, but not all documents can.

Following are some details about how common factors can impact OCR results.

Nature of the Image

Scanned documents have much better accuracy than photos because photos typically have less contrast, more noise, blurriness (e.g. out of focus for edge area, or due to camera shaking), distortion (not flat), not well aligned and so on.

The same principal applies to images in scanned documents. The text centric content will be much better than scanned pictures of driver’s licenses and ID cards, for instance.

Resolution

From testing, we found that images with a resolution of 300 dpi will typically have better results. If image resolution is too low (less than 100 dpi), even with some pre-processing to enlarge the image to improve accuracy a bit, but it would still not be as good as higher resolution images. On the other hand, images with too high resolution will take longer to process.

 

Font

Font size also plays a part in resolution. Larger font size could be fine with low dpi, but smaller font size will require higher resolution to be recognized. For example, font size 10.5pt could work fine with 300 dpi images, but for images at 200 dpi, a font size smaller than 12pt may not work well.

Font type is another factor. Google’s Tesseract library is pre-trained with the most common font types. If the font used in the document is not common, the accuracy will be lower.

Handwriting results are typically poor due to the same reasoning. It’s also important to note that input and handwriting OCR are a bit different in that handwriting input tracks movement while only the final image is available in handwriting OCR.

Contrast

The OCR engine works best on high contrast images. Most well-scanned, text-centric documents can satisfy this. For less than ideal situations, pre-processing may be used to increase the contrast.

Alignment

Google’s Tesseract library has minimal tolerance when it comes to the alignment of scanned images. Based on our testing, the accuracy of a scan will drop if its alignment is more than 5 degrees off. Again, several complex pre-processing techniques could help overcome this.


Webinar: Compliant Migration with DocAve Migrator


Other Factors

There are several other factors that can degrade image accuracy such as blurriness, images not being flat when scanned, and blemishes being on images.​

How AvePoint Can Help

AvePoint’s Compliance Guardian product already has an extensive framework of technologies to help customers with deep content analysis. In our newest update to version 4.4, optical character recognition (OCR) for scanned documents will further expand our technology stack in a major way.

With the help of OCR, Compliance Guardian will allow users to analyze physical documents much more efficiently via several image enhancement techniques that will significantly improve OCR results.

Compliance Guardian’s out-of-the-box optical character recognition (OCR) functionality is targeted towards more common scanned text document situations. We’re excited that users will finally be able to get text content from images and scan for compliance violations directly on the platform.

​​That said, there are still challenges optimizing OCR to work seamlessly for all use cases. As of now, accuracy is still being evaluated on a case-by-case basis. We’ll continue to work hard on accuracy and optimization improvements, so please stay tuned!


Want more great compliance-related content? Be sure to subscribe to our blog!