Tesseract OCR on AWS Lambda via virtualenv Tesseract OCR on AWS Lambda via virtualenv python python

Tesseract OCR on AWS Lambda via virtualenv


The reason it's not working is because these python packages are only wrappers to tesseract. You have to compile tesseract using a AWS Linux instance and copy the binaries and libraries to the zip file of the lambda function.

1) Start an EC2 instance with 64-bit Amazon Linux;

2) Install dependencies:

sudo yum install gcc gcc-c++ makesudo yum install autoconf aclocal automakesudo yum install libtoolsudo yum install libjpeg-devel libpng-devel libpng-devel libtiff-devel zlib-devel

3) Compile and install leptonica:

cd ~mkdir leptonicacd leptonicawget http://www.leptonica.com/source/leptonica-1.73.tar.gztar -zxvf leptonica-1.73.tar.gzcd leptonica-1.73./configuremakesudo make install

4) Compile and install tesseract

cd ~mkdir tesseractcd tesseractwget https://github.com/tesseract-ocr/tesseract/archive/3.04.01.tar.gztar -zxvf 3.04.01.tar.gzcd tesseract-3.04.01./autogen.sh./configuremakesudo make install

5) Download language traineddata to tessdata

cd /usr/local/share/tessdatawget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/eng.traineddataexport TESSDATA_PREFIX=/usr/local/share/

At this point you should be able to use tesseract on this EC2 instance. To copy the binaries of tesseract and use it on a lambda function you will need to copy some files from this instance to the zip file you upload to lambda. I'll post all the commands to get a zip file with all the files you need.

6) Zip all the stuff you need to run tesseract on lambda

cd ~mkdir tesseract-lambdacd tesseract-lambdacp /usr/local/bin/tesseract .mkdir libcd libcp /usr/local/lib/libtesseract.so.3 .cp /usr/local/lib/liblept.so.5 .cp /usr/lib64/libpng12.so.0 .cd ..mkdir tessdatacd tessdatacp /usr/local/share/tessdata/eng.traineddata .cd ..cd ..zip -r tesseract-lambda.zip tesseract-lambda

The tesseract-lambda.zip file have everything lambda needs to run tesseract. The last thing to do is add the lambda function at the root of the zip file and upload it to lambda. Here is an example that I have not tested, but should work.

7) Create a file named main.py, write a lambda function like the one above and add it on the root of tesseract-lambda.zip:

from __future__ import print_functionimport urllibimport boto3import osimport subprocessSCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))LIB_DIR = os.path.join(SCRIPT_DIR, 'lib')s3 = boto3.client('s3')def lambda_handler(event, context):    # Get the bucket and object from the event    bucket = event['Records'][0]['s3']['bucket']['name']    key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf8')    try:        print("Bucket: " + bucket)        print("Key: " + key)        imgfilepath = '/tmp/image.png'        jsonfilepath = '/tmp/result.txt'        exportfile = key + '.txt'        print("Export: " + exportfile)        s3.download_file(bucket, key, imgfilepath)        command = 'LD_LIBRARY_PATH={} TESSDATA_PREFIX={} {}/tesseract {} {}'.format(            LIB_DIR,            SCRIPT_DIR,            SCRIPT_DIR,            imgfilepath,            jsonfilepath,        )        try:            output = subprocess.check_output(command, shell=True)            print(output)            s3.upload_file(jsonfilepath, bucket, exportfile)        except subprocess.CalledProcessError as e:            print(e.output)    except Exception as e:        print(e)        print('Error processing object {} from bucket {}.'.format(key, bucket))        raise e

When creating the AWS Lambda function on the AWS Console, upload the zip file and set the Hanlder to main.lambda_handler. This will tell AWS Lambda to look for the main.py file inside the zip and to call the function lambda_handler.

IMPORTANT

From time to time things change in AWS Lambda's environment. For example, the current image for the lambda env is amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2 (it might not be this one when you read this answer). If tesseract starts to return segmentation fault, run "ldd tesseract" on the Lambda function and see the output for what libs are needed (currently libtesseract.so.3 liblept.so.5 libpng12.so.0).

Thanks for the comment, SergioArcos.


Adapatations for tesseract 4:

Tesseract offers much improvements in version 4, thanks to a neural network. I've tried it with some scans and the improvements are quite substantial. Plus the whole package was 25% smaller in my case. Planned release date of version 4 is first half of 2018.

The build steps are similar to tesseract 3 with some tweaks, that's why I wanted to share them in full. I also made a github repo with ready made binary files (most of it is based on Jose's post above, which was very helpful), plus a blog post how to use it as a processing step after a raspberrypi3 powered scanner step.

To compile the tesseract4 binaries, do these steps on a fresh 64bit AWS AIM instance:

Compile leptonica

cd ~sudo yum install clang -ysudo yum install libpng-devel libtiff-devel zlib-devel libwebp-devel libjpeg-turbo-devel -ywget https://github.com/DanBloomberg/leptonica/releases/download/1.75.1/leptonica-1.75.1.tar.gztar -xzvf leptonica-1.75.1.tar.gzcd leptonica-1.75.1./configure && make && sudo make install

Compile autoconf-archive

Unfortunately, since some weeks tesseract needs autoconf-archive, which is not available for amazon AIMs, so you'd need to compile it on your own:

cd ~wget http://mirror.switch.ch/ftp/mirror/gnu/autoconf-archive/autoconf-archive-2017.09.28.tar.xztar -xvf autoconf-archive-2017.09.28.tar.xzcd autoconf-archive-2017.09.28./configure && make && sudo make installsudo cp m4/* /usr/share/aclocal/

Compile tesseract

cd ~sudo yum install git-core libtool pkgconfig -ygit clone --depth 1  https://github.com/tesseract-ocr/tesseract.git tesseract-ocrcd tesseract-ocrexport PKG_CONFIG_PATH=/usr/local/lib/pkgconfig./autogen.sh./configuremakesudo make install

Get all needed files and zip

cd ~mkdir tesseract-standalonecd tesseract-standalonecp /usr/local/bin/tesseract .mkdir libcp /usr/local/lib/libtesseract.so.4 lib/cp /usr/local/lib/liblept.so.5 lib/cp /usr/lib64/libjpeg.so.62 lib/cp /usr/lib64/libwebp.so.4 lib/cp /usr/lib64/libstdc++.so.6 lib/mkdir tessdatacd tessdatawget https://github.com/tesseract-ocr/tessdata_fast/raw/master/osd.traineddatawget https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata# additionally any other language you want to use, e.g. `deu` for Deutschmkdir configscp /usr/local/share/tessdata/configs/pdf configs/cp /usr/local/share/tessdata/pdf.ttf .cd ..zip -r ~/tesseract-standalone.zip *


Generate zip files using shell scripts to compile code Tesseract 4 for Python 3.7

I have been struggling through this issue for a few days trying to get Tesseract 4 to work on a Python 3.7 Lambda function. Finally I found this article and GitHub which describes how to generate zip files for tesseract, pytesseract, opencv, and pillow using shell scripts that generate the necessary .zip files using Docker images on EC2! This process takes less than 20 minutes using these steps and is reliably reproducible.

Summarized Steps:

Start an Amazon Linux EC2 instance (t2 micro will do just fine)

sudo yum updatesudo yum install git-core -ysudo yum install docker -ysudo service docker startsudo usermod -a -G docker ec2-user #allows ec2-user to call docker

After running the 5th command you will need to logout and log back in for the change to take effect.

git clone https://github.com/amtam0/lambda-tesseract-api.gitcd lambda-tesseract-api/bash build_tesseract4.sh #takes a few minutesbash build_py37_pkgs.sh

This will generate .zip files for tesseract, pytesseract, pillow, and opencv. In order to use with lambda you need to complete two more steps.

  1. Create Lambda layers, one for each zip file, and attach the layers to your Lambda function.
  2. Create an Environment Variable. Key : PYTHONPATH and Value : /opt/

(Note: you will probably need to increase your Memory allocation and Timeout)

At this point you are all set to upload your code and start using Tesseract on AWS Lambda! Refer back to the Medium article for a test script.