Python OCR Module in Linux?
You can just wrap tesseract
in a function:
import osimport tempfileimport subprocessdef ocr(path): temp = tempfile.NamedTemporaryFile(delete=False) process = subprocess.Popen(['tesseract', path, temp.name], stdout=subprocess.PIPE, stderr=subprocess.STDOUT) process.communicate() with open(temp.name + '.txt', 'r') as handle: contents = handle.read() os.remove(temp.name + '.txt') os.remove(temp.name) return contents
If you want document segmentation and more advanced features, try out OCRopus.
In addition to Blender's answer, that just executs Tesseract executable, I would like to add that there exist other alternatives for OCR that can also be called as external process.
ABBYY comand line OCR utility: http://ocr4linux.com/en:start
It is not free, so worth to consider only if Tesseract accuracy is not good enough for your task, or you need more sophisticated layout analisys or you need to export PDF, Word and other files.
Update: here's comparison of ABBYY and tesseract accuracy: http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison
Disclaimer: I work for ABBYY
python tesseract
http://code.google.com/p/python-tesseract
import cv2.cv as cvimport tesseractapi = tesseract.TessBaseAPI()api.Init(".","eng",tesseract.OEM_DEFAULT)api.SetPageSegMode(tesseract.PSM_AUTO)image=cv.LoadImage("eurotext.jpg", cv.CV_LOAD_IMAGE_GRAYSCALE)tesseract.SetCvImage(image,api)text=api.GetUTF8Text()conf=api.MeanTextConf()