Python OCR Module in Linux?

python ocr

You can just wrap tesseract in a function:

import osimport tempfileimport subprocessdef ocr(path):    temp = tempfile.NamedTemporaryFile(delete=False)    process = subprocess.Popen(['tesseract', path, temp.name], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)    process.communicate()    with open(temp.name + '.txt', 'r') as handle:        contents = handle.read()    os.remove(temp.name + '.txt')    os.remove(temp.name)    return contents

If you want document segmentation and more advanced features, try out OCRopus.

python ocr

In addition to Blender's answer, that just executs Tesseract executable, I would like to add that there exist other alternatives for OCR that can also be called as external process.

ABBYY comand line OCR utility: http://ocr4linux.com/en:start

It is not free, so worth to consider only if Tesseract accuracy is not good enough for your task, or you need more sophisticated layout analisys or you need to export PDF, Word and other files.

Update: here's comparison of ABBYY and tesseract accuracy: http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison

Disclaimer: I work for ABBYY

python ocr

python tesseract

http://code.google.com/p/python-tesseract

import cv2.cv as cvimport tesseractapi = tesseract.TessBaseAPI()api.Init(".","eng",tesseract.OEM_DEFAULT)api.SetPageSegMode(tesseract.PSM_AUTO)image=cv.LoadImage("eurotext.jpg", cv.CV_LOAD_IMAGE_GRAYSCALE)tesseract.SetCvImage(image,api)text=api.GetUTF8Text()conf=api.MeanTextConf()

CodeHunter

Python OCR Module in Linux?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last