My own OCR-program in Python My own OCR-program in Python python python

My own OCR-program in Python


OCR is not an easy task indeed. That's why text CAPTCHAs still work :)

To talk only about the letter extraction and not the pattern recognition, the technique you are using to separate the letters is called Connected Component Labeling. Since you are asking for a more efficient way to do this, try to implement the two-pass algorithm that's described in this article. Another description can be found in the article Blob extraction.

EDIT: Here's the implementation for the algorithm that I have suggested:

import sysfrom PIL import Image, ImageDrawclass Region():    def __init__(self, x, y):        self._pixels = [(x, y)]        self._min_x = x        self._max_x = x        self._min_y = y        self._max_y = y    def add(self, x, y):        self._pixels.append((x, y))        self._min_x = min(self._min_x, x)        self._max_x = max(self._max_x, x)        self._min_y = min(self._min_y, y)        self._max_y = max(self._max_y, y)    def box(self):        return [(self._min_x, self._min_y), (self._max_x, self._max_y)]def find_regions(im):    width, height  = im.size    regions = {}    pixel_region = [[0 for y in range(height)] for x in range(width)]    equivalences = {}    n_regions = 0    #first pass. find regions.    for x in xrange(width):        for y in xrange(height):            #look for a black pixel            if im.getpixel((x, y)) == (0, 0, 0, 255): #BLACK                # get the region number from north or west                # or create new region                region_n = pixel_region[x-1][y] if x > 0 else 0                region_w = pixel_region[x][y-1] if y > 0 else 0                max_region = max(region_n, region_w)                if max_region > 0:                    #a neighbour already has a region                    #new region is the smallest > 0                    new_region = min(filter(lambda i: i > 0, (region_n, region_w)))                    #update equivalences                    if max_region > new_region:                        if max_region in equivalences:                            equivalences[max_region].add(new_region)                        else:                            equivalences[max_region] = set((new_region, ))                else:                    n_regions += 1                    new_region = n_regions                pixel_region[x][y] = new_region    #Scan image again, assigning all equivalent regions the same region value.    for x in xrange(width):        for y in xrange(height):                r = pixel_region[x][y]                if r > 0:                    while r in equivalences:                        r = min(equivalences[r])                    if not r in regions:                        regions[r] = Region(x, y)                    else:                        regions[r].add(x, y)    return list(regions.itervalues())def main():    im = Image.open(r"c:\users\personal\py\ocr\test.png")    regions = find_regions(im)    draw = ImageDraw.Draw(im)    for r in regions:        draw.rectangle(r.box(), outline=(255, 0, 0))    del draw     #im.show()    output = file("output.png", "wb")    im.save(output)    output.close()if __name__ == "__main__":    main()

It's not 100% perfect, but since you are doing this only for learning purposes, it may be a good starting point. With the bounding box of each character you can now use a neural network as others have suggested here.


OCR is very, very hard. Even with computer-generated characters, it's quite challenging if you don't know the font and font size in advance. Even if you're matching characters exactly, I would not call it a "beginning" programming project; it's quite subtle.

If you want to recognize scanned, or handwritten characters, that's even harder - you'll need to use advanced math, algorithms, and machine learning. There are quite a few books and thousands of articles written about this topic, so you don't need to reinvent the wheel.

I admire your effort, but I don't think you've gotten far enough to hit any of the actual difficulties yet. So far you're just randomly exploring pixels and copying them from one array to another. You haven't actually done any comparison yet, and I'm not sure the purpose of your "random walk".

  • Why random? Writing correct randomized algorithms is quite difficult. I would recommend starting with a deterministic algorithm first.
  • Why are you copying from one array to another? Why not just compare directly?

When you get the comparison, you'll have to deal with the fact that the image is not exactly the same as the "prototype", and it's not clear how you'll deal with that.

Based on the code you've written so far, though, I have an idea for you: try writing a program that finds its way through a "maze" in an image. The input would be the image, plus the start pixel and the goal pixel. The output is a path through the maze from the start to the goal. This is a much easier problem than OCR - solving mazes is something that computers are great for - but it's still fun and challenging.


Most OCR algorithms these days are based on neural network algorithms. Hopfield networks are a good place to start. Based on the Hopfield Model available here in C, I built a very basic image recognition algorithm in python similar to what you describe. I've posted the full source here. It's a toy project and not suitable for real OCR, but can get you started in the right direction.

The Hopfield model is used as an autoassociative memory to store and recall a set of bitmap images. Images are stored by calculating a corresponding weight matrix. Thereafter, starting from an arbitrary configuration, the memory will settle on exactly that stored image, which is nearest to the starting configuration in terms of Hamming distance. Thus given an incomplete or corrupted version of a stored image, the network is able to recall the corresponding original image.

A Java applet to toy with an example can be found here; the network is trained with example inputs for the digits 0-9. Draw in the box on the right, click test and see the results from the network.

Don't let the mathematical notation intimidate you, the algorithms are straightforward once you get to source code.