How it works
Here is a broad outline of how the
Quixate text location and reading system works. The
description is divided into
two main sections: first we address the problem of finding the text
within an image and then we look at how to read the text that we
Part of a typical photographic image, magnified to show
A. Finding text
Stage A.1: Preprocessing
We first subject the photograph to a number of image processing algorithms
that help to eliminate the background along with
many non-text features while preserving the
characteristics of the interesting parts of the picture.
The baseline of a piece of text is not necessarily straight
Stage A.2: Preliminary text finding
We have two independent approaches to finding candidate regions of an image
that might contain text. The approaches are based on different particular
geometrical characteristics that distinguish text from other objects that
typically occur in photographs. Of course, these methods are not foolproof:
for example, it is easy to imagine natural or man-made features that superficially
resemble the letters ‘H’ or ‘I’. We therefore
allow a number of false positives into the next stage of processing, which
applies a more sophisticated idea of what constitutes text.
The output of this stage is a list of candidate baselines for rows of text,
each with a very rough estimate (within a factor of about 3)
of the character height of that text. The baselines are represented as
general curves within the original image rather than as simple straight lines.
Estimating the height of the text
Stage A.3: Refinement
In the next stage each candidate row of text is inspected more closely.
Several statistical measures are computed in order to make a more accurate
judgement as to whether the row actually contains text or whether it
is some other feature in the photograph. For those rows that pass this
test, we also obtain a better estimate of the character height.
Again, these tests are not perfect, and so we let some false positives through
to the next stage. Some of the rows that are definitely rejected as text are
nevertheless still judged to be useful
in the next stage (where we analyse perspective), and so we retain them
separately from the successful rows.
We build a three-dimensional model of the scene
Rectified view of an object in the image
Stage A.4: Perspective
Perspective is one of the main features that distinguishes text in photographs
from text in a scanned document. It is very common for the text to be in a plane that
is not quite perpendicular to the axis of the camera, with the result that
the characters in a line of text appear bigger at one end
than at the other. This difference alone is enough to make many traditional
OCR (optical character recognition) systems fail on photographic images.
Our system builds up a three-dimensional model of the scene in the photograph.
Each row of text is assigned to a plane lying in three dimensions (and which can be
at any orientation with respect to the axis of the camera). We take advantage
of the fact that a single plane may contain several rows of text: consider a road
sign, for example. Each plane can then be ‘rectified’: in other words,
we can work out what the plane would look like if its photograph had been taken
square-on. The images here show rectification in action.
Once we have our rectified rows of text (plus a few false positives that
we have let through) we can proceed to try to read them.
Naïve thresholding gives poor results
B. Reading text
Stage B.1: Character segmentation
At this point in the process we have a collection of rectified rows of text.
The first job is to separate, or ‘segment’
this row into its constituent characters. The natural
approach of thresholding the image to separate the characters into connected
components is unfortunately very poor at doing this. Traditional OCR systems
that use thresholding usually only work satisfactorily with very large characters
(typically many tens of pixels high)
or employ a range of ad hoc techniques to try to undo the damage done
to the image by the thresholding process. While these techniques
may work for a scanned
image, the various artefacts and distortions present in photographs make
them unsuitable in our case. As the example here illustrates, there is often
no threshold value that you can choose which will separate all the characters
from one another without causing some of the characters to break up.
We use a more sophisticated segmentation algorithm that
is intimately bound with the character recognition algorithms described below
and which takes full advantage of the dynamic range of the original image.
In this way we avoid making irrevocable
segmentation decisions until the last possible
moment, and as a result our system is capable of reading text down to just
a few pixels in height.
An assortment of ‘a’s
Stage B.2: Character recognition
We recognise characters using a flexible statistical model that is not specific
to a given font. Among other things, the model is capable of modelling bold,
condensed, warped and blurred characters directly. The model is often working
at close to information-theoretic limits, especially where the text in the original
image is only a few pixels high. The model can easily be adapted to non-Roman
Word modelling based on state machines
Stage B.3: From characters into words
The final step is to assemble the recognised characters into words.
In general we find that
text in photographs, unlike text in scanned images, tends to include
a large number of comparatively rare words, and in particular proper names. It is
therefore not adequate to take the top-ranking characters from the character
recogniser and pass them through a spell-checker because no reasonably-sized
dictionary will contain even 95% of the words we might want to read.
We therefore have two statistical models, based on
state machines, for the text we read: one for
‘in-vocabulary’ words (based on English language corpora) and
one for ‘out-of-vocabulary’ (‘OOV’) words.
Our OOV model also deals with numbers and punctuation. Again, the
word models can easily be changed to suit languages other than English.
The result of this stage is a string (the text that has been read),
accompanied by a score which represents how confident the system is that
its answer is correct.
Sample of typical output
Stage B.4: Filtering and output
Filtering involves first rejecting any rows that are determined by the
reading process to be false positives that have slipped through the earlier stages
of processing. Second, it can happen that the system finds two rows that
cover the same region of the image. Normally in this case the reading
process is much more confident of one of these results than it is of the other,
and the poorer row is discarded.
Finally, the filtered results are output in the form of a plain text database
containing details of the location and size of each row of text found, and
the text it contains.