Technologies
- Class 3
- Preprocess

OCR PRE-PROCESSING

Image Preprocessing for OCR and Data Extraction Environments
Neurogy Corp.

CONTENTS

Introduction
Line Removal
Noise Removal
Dot Matrix Enhancement
Text Line Separation
Conclusion

Introduction

OCR in a commercial domain such as free-form invoice processing faces significant challenges in image quality. Since every operation performed on an image subsequent to its acquisition hinges on the quality of the raw data, it is important to maximize the semantic content at the blob level, even before OCR takes place. We have found that data extraction operations have error rates equal to about twice the OCR error rates in normal domains.

Neurogy's OCR pre-process technology enhances class 1 semantic content immediately before images are sent to an OCR engine. Semantic content in a class 1 sense refers to the correct classification of a blob of pixels as a specific character (or part of such), a line, a part of an image, or noise. By implementing several powerful and proprietary noise removal and character enhancement algorithms, Neurogy's pre-processors achieve a considerable improvement in OCR accuracy, and hence better results in the data extraction phase (class 3).

There are four main types of semantic enhancement in this module - line removal and tagging, noise removal, dot matrix enhancement (otherwise known as blob assembly), and text line separation.

Line RemovalReturn to Top

In invoice processing as well as some other applications, graphical lines that interfere with text present a significant problem. Even a small misalignment against pre-printed forms can result in a majority of the text on a page being partially obscured by horizontal or vertical lines. Our enhancement technology removes lines using a three-stage approach: initial detection, line removal, and obscured text enhancement. We choose to remove lines before noise because lines in documents are often broken into pieces, and removing small pieces from lines reduces the chance of their detection in this phase.

The line detection phase is actually folded into the image loading phase to speed processing. As the image is loaded and straightened, a horizontal and vertical profile of consecutive pixel counts is maintained. Wherever that count exceeds a certain value, the likelihood of a line is very high. The detection algorithm starts at those points and works along the axis of the line candidate in an attempt to find the true extent of the line. A certain threshold is allowed for gaps, and line similaritty (to the initially detected segment) is enforced. Once the endpoints of the line have been determined, the line is submitted for removal and tagging if certain length and relative position criteria are met.

One of the hardest problems facing line removal algorithms is how to remove all the pixels in a line without removing semantically significant blobs that are morphologically connected. We take a statistical approach to this problem. By measuring average line thickness and allowing for smooth local variance, we can avoid removing the spike-like protrusions that attached letters cause. By taking a centroid and constraining the eraser to follow the straight line from start to finish, with allowance for some curvature, we avoid the problems associated with localized following approaches. By constraining removel in areas of maximum uncertainty, we leave as much data as possible for obscured letter enhancement while not leaving too much for the noise removal to clean up.

The hardest problem facing line removal algoriths is how to pick up the pieces (read "connect") the broken up letters obscured by the line in the first place. One should note that it is not neccessary to flawlessly restore the character, but only to restore the morphology of the character so that the OCR algorithm will produce a correct result. By using a combination of thickness and spike adjacency analysis, we have created an effective solution to this problem as shown in the image below.

Lines Image

Lines Cleaned Image

Noise RemovalReturn to Top

Noise removal greatly aids the OCR process, as many engines try to interpret every half-height blob on a document as some sort of low-confidence character. Neurogy's noise removal algorithms take advantage of known characteristics of the domain in question, and can be altered to match the likely types of noise to be encountered. Noise comes in three flavors - patterned noise, associated noise, and random noise. Patterned noise comes from graphical patterns, especially halftome shaded areas, that appear on many scanned forms. Associated noise occurs when a scanned document is incorrectly thresholded and artifacts surrounding valid blobs appear in the image. Random noise comes from a bad global threshold or a dirty source document or scanner.

All types of noise are removed using a combination of global and local statistical analysis of blob sizes and shapes. Obviously, some documents are crisp and high quality, and any smal blob not associated closely with a line should be removed. Others are dot matrix prints or poor quality, and removal of noise should only occur where there is no question that the blob under consideration is actually noise. This automatically determined statistical threshold allows for a very flexible noise removal system. Patterned noise is removed by performing a pyramidal statistical segmentation of the image. Since pattern fields are generally of a certain minimum size, the segmentation can be performed at a realtively high level (low resolution) of the image pyramid and not adversely impact time performance.

The following image shows an example of noise removal in a patterned area.

Lines Image

Lines Image

Dot Matrix Enhancement (Blob Aggregation)Return to Top

Once noise and lines have been removed, any blob that is adjacent to another blob by less than the average horizontal character separation distance is likely to be a good candidate for aggregation, However, it is less desirable for non-dot matrix documents to be aggregated than it is for dot matrix documents to suffer a performance penalty. Therefore, a statistical analysis is performed on all text lines over a certain height to determine if the form has been filled out with a dot matrix printer. If it has not, then the aggregation routine is bypassed, because the proportional fonts used in non-dot matrix printers suffer greatly under this algorithm.

The actual blob aggregation is a simple dilate with gravity, so that adjacent blobs are most prone to expand in the direction that they are adjacent. This prevents loss of interior hole information while dramatically increasing OCR accuracy for engines that are looking for a generic character instead of a series of dots.

Text Line SeparationReturn to 
Top

OCR engines perform best when they are passed a block of text that is all in the same font. To that end, we pass zones of the document into the OCR engine one at a time. Each zone consists of a single line of text, identified through blob analysis as shown below.

Once again, we take advantage of an assumption that all lines of text on a page are horizontal lines of text, and that it is all right to cut true lines of text into multiple passed lines of text as long as two vertically adjacent lines are not joined. (A second pass identifies vertical lines of text in low confidence separation areas.) To extend a line of text, it is required that a blob exist within a certain height range, and within a certain multiple of the height of the line from the end of the line. Once a blob is acquired for the line, every other blob that could be connected to it (i.e., within the height parameters of the line) is added to the text line. The height and descent of the line is adjusted quickly at first, then more and more slowly as the length of the line increases.

Agg Image

ConclusionReturn to 
Top

Neurogy's OCR preprocessing and image enhancement greatly increases the accuracy of OCR, even on lower quality images. Since every point decline in OCR accuracy causes a two point decline in data extraction accuracy, this part of the process is critical to the productivity enhancements realized when using Neurogy's complete system for data extraction.

[ This page was last updated Jun-7-2004 ]