Improving OCR performance on multi-paragraph scans Improving OCR performance on multi-paragraph scans python python

Improving OCR performance on multi-paragraph scans


Tesseract is very good on clean input text (like your example) if you tinker a bit. some suggestions:

  • Before automating, start with tesseract at the command line
  • Restrict your character set if possible (e.g. take a look in /usr/local/share/tessdata/configs at ./digits - configure it for English characters upper/lower case etc) and provide it as a command line argument
  • Only use PNG or TIFF images (TIFF for older versions) as JPG introduces artefacts
  • Upsample the image so your text is larger than the current tiny font. Tesseract lines >10 pixel high characters (if memory serves), it certainly performs worse with tiny characters
  • No need to do thresholding if you're bi-level already but it won't hurt if you do and you can see exactly the same image that tesseract will see

I'll check back here to see if I can help more but do join the tesseract mailing list, they're really helpful.

Sidenote - I have some patches for pytesseract which I ought to publish for getting characters & confidences & words via the API (which wasn't possible a couple of months back). Shout if they might be useful.


The first example reads the file as a buffer and then relay it to tesseract-ocr without doing any modification while the second one reads file into opencv format which will then allow you to do some image touch up like changing the aspect ratio, gray scale and etc using the cv library. The second method is very useful if u want to do the image manipulation before passing the image to tesseract.

BTW, I am the owner of python-tesseract. If u want to ask question, you could always welcome to forward your question to http://code.google.com/p/python-tesseract

Joe