
NeurosemanticsTM in Class 3 Data Extraction
Neurogy Corp.
CONTENTS
Introduction
Preprocessing
Building a Semantic Model for a Domain
- Example: European-Style Date
Data Extraction
Performance Issues
Conclusion
Introduction
The extraction of data from paper and archived invoices is a significant problem facing
today's businesses. Often, data vital to understanding the historical and present operation
of a business is only available in graphical or paper form. The sheer variety of the data formats
precludes the use of conventional OCR software with form definitions. In the past, the only recourse for a business
that needs this data was to employ human data entry operators. We replace this costly and time-consuming
methodology with a new technology, neurosemantic data extraction.
Neurogy's Class 3 neurosemantic technology allows users to specify the data of interest as semantic
relationships, rather than positioned elements on the form. The underlying engine allows for OCR errors
and badly designed forms. Form vocabulary and positional relationships are easily taught. The default output
format is XML, although other formats can be supplied as necessary.
There are three main stages to extracting data from a document - scanning and OCR (preprocessing), domain specification,
and data extraction. This paper covers each of those areas in detail.
Preprocessing
Before entry into the actual system, images of documents must be converted to TIFF format at a
resolution of at least 200 DPI. Higher resolution gives better OCR results, and hence better extraction results.
Images that have been previously scanned or stored electronically can usually be converted to the
required format. Actual paper documents must be scanned using a high-quality scanner such as the Kodak
3500. Our workflow software is capable of controlling this and other similar scanners. Once the image is
acquired in the proper format, it is inserted into the working database and marked as ready for segmentation and OCR.
Today's OCR applications function best when they are presented with input areas that contain a uniform font, at uniform
height, and laid out in regular non-overlapping lines. Few invoices supply this across the entire invoice. Also,
many invoices and other layout-type data sources have lines, graphics, and other non-text items that confuse
OCR engines. Finally, many invoices have shaded regions which present themselves in a bi-tonal image as areas of
noise. Optimal results can only be achieved if all of these artifacts are removed, and if individual lines of text
are located and presented in isolation, one at a time, to the OCR system.
Neurogy's preprocessing system removes lines, even when they are drawn through text, without disturbing the
surrounding (or underlying) text. The location of these lines are stored for later use in the semantic portion
of the process, since lines and the boxes they form often impute attributes to the data they contain without
specifically naming those attributes in text.
Nothing confuses an OCR system more than small patches of noise surrounding legitimate letters. Neurogy's
preprocessor adaptively determines the characteristics of the letters of text present in each
document, then removes non-conforming patches of noise in each locale. The system also identifies blobs that
are too large, or wrongly shaped, and classifies them as images.
Once all the extraneous information is removed from the document, the remaining characters are grouped into
single lines of text and passed to the OCR engine of choice, such as Prime. This engine returns confidences and
positions for each character. This character data is then used to construct words for use in the
extraction phase.
Building a Semantic Model for a Domain
Neurogy's semantic domain models are expressed in the neurons and connections between them. A domain is a
hierarchical relationship of tokens and semantic entities, tied together with explicit weighted links between
the neurons themselves. At the lowest level, words and connectors are used on a purely local basis to form elementary tokens.
On succeeding levels, these tokens are joined together by semantic relationships to form semantic entities.
As the complexity of the semantic entity increases, its spatial extent also increases. An underlying
positional assertion grammar (PAG) is an integral part of the extraction engine, and can be influenced
by attributes set in the neurons themselves. The PAG is used to regulate the spatial extent and character of each
neuron that the engine has access to.
Every possible neuron is constructed at every possible location in the document. The highest level of neurons, or
output neurons, have the capability of arbitrating between the various instances of themselves to determine which
(if any) is the final result that will be passed to the workflow system. Output neurons also have the capability of
synonym mapping, where a given output value is automatically translated to a value that does not occur in the text
of the document. These neurons can also allow multiple instances of themselves to output, or can vote based on summed
activations or simple instance count.
There are over 40 different parameters that can be set in a neuron. The following basic example shows how some of
those parameters work together to build a date.
Example: European-Style Date
Here are three low-level neurons, designed to find every possible instance of a part of a European-style (dd-mm-yyyy)
date. Only a few of the available parameters are shown, to aid discussion.
<neuron>
NAME:EuroDayOfMonth
TYPE: Token
OUTPUTACTIVATION:0.60000 // 0.6 activation must be achieved or this neuron will not persist
EQUALSEPACTIVATION:0.300000 // the separator to the left of the day should not be the same as the separator to the right (dd.mm.yyyy)
ICOUNT:1|0|-1|0.000000|0.500000|0|0|0 // one range only
INTLIST0:1.000000|31.000000|0.900000 // any number from 1 to 31
LSEPCOUNT:1|0|0|0.000000|0.000000 // only one separator (a space) possible to the left
LSEPLIST0:1.000000|
RSEPCOUNT:4|0|0|0.000000|0.000000 // right separators can include a space, period, dash, or slash
RSEPLIST0:0.400000|
RSEPLIST1:0.900000|.
RSEPLIST2:0.400000|-
RSEPLIST3:0.400000|/
</neuron>
<neuron>
NAME:EuroMonthOfYear
TYPE: Token
OUTPUTACTIVATION:0.60000 // 0.6 activation must be achieved or this neuron will not persist
EQUALSEPACTIVATION:0.800000 // the separator to the left of the month should be the same as the separator to the right (dd.mm.yyyy)
ICOUNT:1|0|-1|0.000000|0.500000|0|0|0
INTLIST0:1.000000|12.000000|0.600000 // any number from 1 to 12
WCOUNT:13|0|0.000000|0.900000
WORDLIST0:0.600000|JANUAR // or any word from the following list
WORDLIST1:0.600000|FEBRUAR
WORDLIST2:0.600000|MARZ
WORDLIST3:0.600000|MERZ
WORDLIST4:0.600000|APRIL
WORDLIST5:0.600000|MAI
WORDLIST6:0.600000|JUNI
WORDLIST7:0.600000|JULI
WORDLIST8:0.600000|AUGUST
WORDLIST9:0.600000|SEPTEMBER
WORDLIST10:0.600000|OKTOBER
WORDLIST11:0.600000|NOVEMBER
WORDLIST12:0.600000|DEZEMBER
LSEPCOUNT:4|0|0|0.000000|0.000000 // left separators can include a space, period, dash, or slash
LSEPLIST0:0.400000|
LSEPLIST1:0.900000|.
LSEPLIST2:0.400000|-
LSEPLIST3:0.400000|/
RSEPCOUNT:4|0|0|0.000000|0.000000 // right separators can include a space, period, dash, or slash
RSEPLIST0:0.400000|
RSEPLIST1:0.900000|.
RSEPLIST2:0.400000|-
RSEPLIST3:0.400000|/
<neuron>
<neuron>
NAME:Year
OUTPUTACTIVATION:0.60000 // 0.6 activation must be achieved or this neuron will not persist
EQUALSEPACTIVATION:0.300000 // the separator to the left of the year should not be the same as the separator to the right (dd.mm.yyyy)
ICOUNT:2|0|-1|0.000000|0.500000|0|0|0 // any number from 0 to 99, or 1900 to 2010
INTLIST0:1900.000000|2010.000000|0.900000
INTLIST1:0.000000|99.000000|0.600000
WCOUNT:0|0|0|0|0
LSEPCOUNT:4|0|0|0.000000|0.000000 // left separators can include a space, period, dash, or slash
LSEPLIST0:0.500000|
LSEPLIST1:1.000000|.
LSEPLIST2:0.400000|-
LSEPLIST3:0.400000|/
RSEPCOUNT:1|0|0|0.000000|0.000000 // only one separator (a space) possible to the right
RSEPLIST0:1.000000|
<neuron>
With these three neurons, we have the basis for turning parsed tokens into the lowest level semantic entities. The number
20, for example, may be the day of a month, the year, an amount of money, or any number of other things. The job of
each neuron is to ask the question, "Could this token or entity be an instance of myself, and if so, how closely does
it and its environment match the ideal instance of myself?" The year could still be activated even if it violates one
of the three separator rules encoded. It might even be activated if it violates two of them. It will not if it violates
all three, or if the number falls outside the range specified. However, if the original number on the form was 2003 and
the OCR returns 003 because the 2 was too faint to read, the neuron will still activate because the value hits the second
range.
With these basic token neurons activated, it is time to demonstrate a slightly more complex neuron - the actual date
assembler:
NAME:EuroLongDate
GROW:HORIZONTAL_ONLY // allow horizontal growth only
FAILONSKIP:1.000000 // Do not allow misc neurons to come between the pieces
FAILONORDER:1.000000 // Strictly enforce the stated order of subneurons
OUTPUTFUNC:5 // Output to higher level in date format
OUTACT:0.100000// 0.1 activation must be achieved or this neuron will not persist
EQUALSEPACTIVATION:0.900000// the separator to the left of the date should be the same as the separator to the right
ACTIVATIONCOUNT:3|0.500000|0.000000|0.000000// Three subneurons activate this neuron
ACTIVATION0:0|0.500000|1.000000|-1|1|1|0.000000|0.000000// The first neuron is the day, output weight .5, mandatory is 1 (absolutely required)
ACTIVATION1:1|0.900000|1.000000|-1|1|1|0.000000|0.000000// The second neuron is the month, output weight .7, mandatory is 1 (absolutely required)
ACTIVATION2:2|0.700000|1.000000|-1|1|1|0.000000|0.000000// The last neuron is the year, output weight .9, mandatory is 1 (absolutely required)
<neuron>
In the domain described by this neuron, no date can exist unless it has all three of the required subelements. In other
domains, it is possible that order of element presence could be less strictly enforced. In that case, there would be
more instances of the date neuron available to bind to attributes as described in the next section. When describing
a domain, the best results will be achieved when the tightest restrictions are placed on low and mid-level neurons
such as those in this example.
Data Extraction
Once we have built the basic semantic entities, such as dates, amounts, quantities, part numbers, names, places,
addresses, and others, useful data must be extracted by assigning attributes to those entities. This is accomplished
through the use of attribute neurons. Attribute neurons make much more extensive and complex use of the positional
assertion grammar. They also enforce competition and superposition rules to iteratively decide which attribute
each semantic entity actually has, if any, in the domain.
To maximize accuracy, text in a two-dimensional field is segmented into visual blocks using a moderately complex set
of alignment and separation rules. Most attributes must be assigned within a block or within adjacent blocks.
Only certain types of neurons, such as the InvoiceTotal neuron, can jump multiple blocks to associate value with
attribute. The attribute behavior is encoded in the neuron itself. This
eliminates the distance association problem that is common in programs that attempt freeform extraction.
As higher level neurons are constructed from lower level neurons or tokens, the instances of these neurons track
the text used in their contributors with a simple concatenation operation. Neurons designated as output neurons
by the input XML file use three different selection mechanisms and nine different
output methods to transfer their data to the output database. The selection mechanisms are:
Highest activation (the strongest neuron on the page wins)
Highest count (the neurons with the most occurrences of the same string of text win)
Full output (all neurons on the page are output)
When all neurons of a certain type are output for a given page, selection generally takes place at the document
level, as described below. Position information is lost for the second selection mechanism, since it becomes
ambiguous in a voting scenario. The output methods used to decide which text to gather from the low level
neurons and how to present it are as follows:
Separated strings (all strings at the bottom of the activation tree become part of the text, each separated by its original separator)
Joined strings (all strings at the bottom of the activation tree become part of the text, without separation)
Numeric (the leftmost neuron's text is converted to an integer)
Float (the leftmost neuron's text is converted to a float)
Date (the three contributing date neurons are converted to a standard date format)
TextArea (all text in the visual block is output with return characters)
Textarea, no header (all text in the visual block except the first line, if it is a header, is output with return characters)
Synonym (outputs are translated to predefined values)
Since a document may be composed of multiple pages, the final step in providing output data to the workflow system is
to decide which of the multiple occurrences of an output neuron on multiple pages is the best one to use as the final
result. We have found that a modified highest-activation rule provides the best results, where neurons from the middle
pages of a document are used only if their activation is significantly higher than those on the first and last pages
of a document, and neurons from the first page are slightly preferred over those from the last. For purposes of each
neuron class, the "first" page is the first page where that particular type of neuron occurs, not the first page of
the document, since some documents may be preceded by header pages that contain no useful information.
In many cases, the same kind of information (such as invoice total) can be obtained from several different sources
at the same time, and some of these sources are much more reliable than others. The system can select groups of neurons
as synonyms, and suppress instances of one class when instances of a different class exist, using various weighting
mechanisms to determine the strength of that suppression. A global winner is then selected and presented as output
for that information class.
The final output format for the system is an XML file with fields selected by the user's input (configuration)
XML file. At this point, extraction is finished and the workflow system moves the document to the correction phase
if confidences are low, or to the final phase if they are sufficiently high.
Performance Issues
Even when running with a full complement of 80-100 neurons on an older Pentium 2, this system can complete an extraction
on a page in well under a second. Such a system is actually in use commercially, and extracts over 20 fields of interest,
including three kinds of invoice dates, supplier address, orderer, sales tax number, supplier number, and others. The
system also performs lookups into customer databases to verify and correct extracted information. The system footprint
is about 200 MB of memory and about 10MB of disk, so it is important to supply the system with a large amount of RAM
to optimize performance.
Most field types can be extracted with an accuracy in excess of 90%, and some in excess of 99%. The limiting factor
for the system's accuracy is the input OCR accuracy, which is in turn limited by the quality of the documents. Where
extracted fields rely on the successful extraction of three or more tokens (the field value as well as the attribute), all
of those tokens must have reasonable OCR accuracy or the output neuron will not be activated. Prime gets results close to
100% with clean typewritten documents, and results close to 70% with faxed documents. For every percent of accuracy
lost in OCR, about 2% of accuracy is lost in the extraction.
In human terms, the performance of the system is high. A single person entering data from 20 fields of an invoice can
process about 200 to 400 invoices an hour, since low-confidence fields are highlighted for their benefit.
Conclusion
Neurogy's neurosemantic extraction system is a strong addition to the data acquisition efforts of any enterprise. It
delivers sufficient results to eliminate 75% of a manual workforce performing the same tasks. As we continue to
optimize both the extraction technology and the OCR preprocessing technology, the extraction accuracy will continue
to rise.