Doing OCR with R Doing OCR with R shell shell

Doing OCR with R


By using "tesseract", I created a sample script which works.Even it works for scanned PDF's too.

library(tesseract)library(pdftools)# Render pdf to png imageimg_file <- pdftools::pdf_convert("F:/gowtham/A/B/invoice.pdf", format = 'tiff',  dpi = 400)# Extract text from png imagetext <- ocr(img_file)write.table(text, "F:/gowtham/A/B/mydata.txt")

I'm new to R and Programming. Guide me if it's wrong. Hope this help you.


The newly released tesseract package might be worth checking out. It allows you to perform the whole process inside of R without the shell calls.

Taking the procedure as used in the help documentation of the tesseract package your function would look something like this:

lapply(myfiles, function(i){  # convert pdf to jpef/tiff and perform tesseract OCR on the image  # Read in the PDF  pdf <- pdf_text(i)  # convert pdf to tiff  bitmap <- pdf_render_page(news, dpi = 300)  tiff::writeTIFF(bitmap, paste0(i, ".tiff"))  # perform OCR on the .tiff file  out <- ocr(paste0, (".tiff"))  # delete tiff file  file.remove(paste0(i, ".tiff" ))})