OCR toolkit for Gamera

Last modified:

Contents

Editor:Rene Baston, Christoph Dalitz
Version:1.1.0

Use the 'Addons' section on the Gamera home page for access to file releases of this toolkit.

Overview

The purpose of the OCR Toolkit is to help building optical character recognition (OCR) systems for standard text documents. Even though it can be used as is, it is specifically designed to make individual steps of the recognition system customizable and replacable. The toolkit is based on and requires the Gamera framework for document analysis and recognition. As an addon package for Gamera, it provides

A comprehensive overview of design, usage and customization of the OCR toolkit can be found in the paper

C. Dalitz, R. Baston: Optical Character Recognition with the Gamera Framework. In C. Dalitz (Ed.): "Document Image Analysis with the Gamera Framework." Schriftenreihe des Fachbereichs Elektrotechnik und Informatik, Hochschule Niederrhein, vol. 8, pp. 53-65, Shaker Verlag (2009)

The recognition process

Optical character recognition (OCR) means the extraction of a machine readable text code from bitmap images of text documents. This process typically consists of the following steps:

Preprocessing:
Includes binarization, skew correction, image enhancement, text/graphics separation
Segmentation:
Segmentation of the page in text lines (page segmentation) and characters (character segmentation)
Classification:
Identification of the individual characters
Postprocessing:
Includes the generation of the output string and maybe detection and correction of possible errors

The OCR toolkit only covers the process from segmentation to postprocessing. For preprocessing, the standard routines shipped with Gamera must be used beforehand, e.g. rotation_angle_projections for skew correction, or despeckle for noise removal.

For classification, the kNN classifier shipped with Gamera must be used. This means in particular, that you must train some sample pages before doing the classification. At present, the toolkit does not include training databases for common fonts.

Provided Components

The toolkit consists of two python modules, a plugin image function and one end user application.

The modules are

  • classes which contains all class definitions
  • ocr_toolkit for global functions used across the classes

The end user application is

  • ocr4gamera.py is a script that acts as a basic OCR-system

There is also one image plugin bbox_seg for textline segmentation which is simply a wrapper around the Gamera core plugin bbox_segmentation.

Limitations

As the segmentation of the individual characters is based on a connected component analysis, the toolkit cannot deal with touching characters, unless they have been trained as ligaturae. It is therefore in general only applicable to printed documents, rather than handwritten documents.

From a user's perspective, there are some points to beware in this toolkit:

  • It does not provide methods for text/graphics separation. Hopefully, some generic methods for this purpose will be added at some point in the Gamera core.
  • It does not provide prototypes of latin characters. This means that characters must be trained on sample pages before using the toolkit.
  • The standard page segmentation algorithm for textline separation is currently very basic.

User's Manual

This documentation is written for those who want to use the toolkit for OCR, but are not interested in extending the toolkit itself.

Developer's Manual

This documentation is for those who want to extend the functionality of the OCR toolkit, or who want to customize specific steps of the recognition process.

Installation

We have only tested the toolkit on Linux and MacOS X, but as the toolkit is written entirely in Python, the following instructions should work for any operating system.

Prerequisites

First you will need a working installation of Gamera 3.x. See the Gamera website for details. It is strongly recommended that you use a recent version, preferably from SVN.

If you want to generate the documentation, you will need two additional third-party Python libraries:

  • docutils for handling reStructuredText documents.
  • pygments for colorizing source code.

Note

It is generally not necessary to generate the documentation because it is included in file releases of the toolkit.

Building and Installing

To build and install this toolkit, go to the base directory of the toolkit distribution and run the setup.py script as follows:

# 1) compile
python setup.py build

# 2) install
sudo python setup.py install

Command 1) compiles the toolkit from the sources and command 2) installs it. As the latter requires root privilegue, you need to use sudo on Linux and MacOS X. On Windows, sudo is not necessary.

Note that the script ocr4gamera is installed into /usr/bin on Linux, but into /System/Library/Frameworks/Python.framework/Versions/2.x/bin on MacOS X. As the latter directory is not in the standard search path, you could either add it to your search path, or install the scripts additionally into /usr/bin on MacOS X with:

# install scripts into standard path (MacOS X only)
sudo python setup.py install_scripts -d /usr/bin

If you want to regenerate the documentation, go to the doc directory and run the gendoc.py script. The output will be placed in the doc/html/ directory. The contents of this directory can be placed on a webserver for convenient viewing.

Note

Before building the documentation you must install the toolkit. Otherwise gendoc.py will not find the plugin documentation.

Installing without root privileges

The above installation with python setup.py install will install the toolkit system wide and thus requires root privileges. If you do not have root access (Linux) or are no sudoer (MacOS X), you can install the MusicStaves toolkit into your home directory. Note however that this also requires that Gamera is installed into your home directory. It is currently not possibole to install Gamera globally and only toolkits locally.

Here are the steps to install both Gamera and the OCR toolkit into ~/python:

# install Gamera locally
mkdir ~/python
python setup.py install --prefix=~/python

# build and install the OCR toolkit locally
export CFLAGS=-I~/python/include/python2.3/gamera
python setup.py build
python setup.py install --prefix=~/python

Moreover you should set the following environment variables in your ~/.profile:

# search path for python modules
export PYTHONPATH=~/python/lib/python

# search path for executables (eg. gamera_gui)
export PATH=~/python/bin:$PATH

Uninstallation

The installation uses the Python distutils, which do not support uninstallation. Thus you need to remove the installed files manually:

  • the installed Python library files of the toolkit
  • the installed standalone scripts

Python Library Files

All python library files of this toolkit are installed into the gamera/toolkits/ocr subdirectory of the Python library folder. Thus it is sufficient to remove this directory for an uninstallation.

Where the python library folder is depends on your system and python version. Here are the folders that you need to remove on MacOS X and Debian Linux ("2.3" stands for the python version; replace it with your actual version):

  • MacOS X: /Library/Python/2.3/gamera/toolkits/ocr
  • Debian Linux: /usr/lib/python2.3/site-packages/gamera/toolkits/ocr

Standalone Scripts

The standalone scripts are installed into /usr/bin (linux) or /System/Library/Frameworks/Python.framework/Versions/2.3/bin (MacOS X), unless you have explicitly chosen a different location with the options --prefix or --home during installation.

For an uninstall, remove the following script:

  • ocr4gamera.py

Note

In older versions (1.0.0 and 1.0.1) this script was named ocr4gamera. Remove this old script, if you are upgrading from one of these versions.

About this documentation

The documentation was written by Rene Baston and Christoph Dalitz. Permission is granted to copy, distribute and/or modify this documentation under the terms of the Creative Commons Attribution Share-Alike License (CC-BY-SA) v3.0. In addition, permission is granted to use and/or modify the code snippets from the documentation without restrictions.