Last modified: February 13, 2012
This documentation is for those who want to use the toolkit for OCR, but are not interested in extending the toolkit itself.
The toolkit provides the functionality to segment an image page into text lines, words and characters, to sort them in reading-order, and to generate an output string.
Before you can use the OCR toolkit, you must first train characters from sample pages, which will then be used by the toolkit for classifying characters:
Hence the proper use of this toolkit requires the following two steps:
There are two options to use this toolkit: you can either use the script ocr4gamera.py as provided by the toolkit, or you can build your own recognition scripts with the aid of the python library functions provided by the toolkit. Both alternatives are described below.
The ocr4gamera.py script takes an image and already trained data and segments the picture into single glyphs. The training-data is used to classify those glyphs and converts them into strings. The final text is written to standard-out or can optionally be stored in a textfile. Also a word by word correction can be performed on the recognized text.
The end user application ocr4gamera.py will be installed to /usr/bin unless you habe explicitly chosen a different location. Its synopsis is:
ocr4gamera.py -x <trainingdata> [options] <imagefile>
Options can be in short (one dash, one character) or long form (two dashes, string). When called with -h, --help or an invalid option, a usage message will be printed. The other options are:
Use a user defined translation table of class names to character strings. The csv_file must contain a list of comma separated pairs (classname, output) one pair per line as in the following example (the output string after the comma can be any string consisting of unicode characters):
latin.small.ligature.st,st latin.small.ligature.ft,ft latin.small.letter.long.s,s
If you want to write your own scripts for recognition, you can use ocr4gamera.py as a good starting point.
In order to access the OCR Toolkit classes and functions, you must import them at the beginning of your script:
from gamera.toolkits.ocr.ocr_toolkit import *
from gamera.toolkits.ocr.classes import Textline,Page,ClassifyCCs
After that you can segment an image with the Page class and its method segment():
img = load_image("image.png")
if img.data.pixel_type != ONEBIT:
img = img.to_onebit()
result_page = Page(img)
result_page.segment()
The Page object result_page now contains all segment information like textlines, words and characters in reading order. You can then classify the characters line-per-line with a knn classifier and print the document text:
# load training data into classifier
cknn = knn.kNNInteractive([], \
["aspect_ratio", "moments", "volume64regions"], 0)
cknn.from_xml_filename("trainingdata.xml")
# classify characters and create output text
for line in page.textlines:
line.glyphs = \
cknn.classify_and_update_list_automatic(line.glyphs)
line.sort_glyphs()
print "Text of line", textline_to_string(line)
Note that the function textline_to_string is global and not bound to a class instance. This function requires that class names for characters have been chosen according to the standard unicode character names, as in the examples of the following table:
Character | Unicode Name | Class Name |
---|---|---|
! | EXCLAMATION MARK | exclamation.mark |
2 | DIGIT TWO | digit.two |
A | LATIN CAPITAL LETTER A | latin.capital.letter.a |
a | LATIN SMALL LETTER A | latin.small.letter.a |
For more information on how to fine control the segmentation process, see the developer's manual.