Improving OCR performance on multi-paragraph scans

Tesseract is very good on clean input text (like your example) if you tinker a bit. some suggestions:

Before automating, start with tesseract at the command line
Restrict your character set if possible (e.g. take a look in /usr/local/share/tessdata/configs at ./digits - configure it for English characters upper/lower case etc) and provide it as a command line argument
Only use PNG or TIFF images (TIFF for older versions) as JPG introduces artefacts
Upsample the image so your text is larger than the current tiny font. Tesseract lines >10 pixel high characters (if memory serves), it certainly performs worse with tiny characters
No need to do thresholding if you're bi-level already but it won't hurt if you do and you can see exactly the same image that tesseract will see

I'll check back here to see if I can help more but do join the tesseract mailing list, they're really helpful.

Sidenote - I have some patches for pytesseract which I ought to publish for getting characters & confidences & words via the API (which wasn't possible a couple of months back). Shout if they might be useful.

python ocr tesseract

The first example reads the file as a buffer and then relay it to tesseract-ocr without doing any modification while the second one reads file into opencv format which will then allow you to do some image touch up like changing the aspect ratio, gray scale and etc using the cv library. The second method is very useful if u want to do the image manipulation before passing the image to tesseract.

BTW, I am the owner of python-tesseract. If u want to ask question, you could always welcome to forward your question to http://code.google.com/p/python-tesseract

Joe

CodeHunter

Improving OCR performance on multi-paragraph scans

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last