Converting a PDF file to Base64 to index into Elasticsearch Converting a PDF file to Base64 to index into Elasticsearch elasticsearch elasticsearch

Converting a PDF file to Base64 to index into Elasticsearch


The encoding snippet is incorrect it is opening the pdf file in "text" mode.

Depending on the file size you could just open the file in binary mode and use the encode string methodExample:

def pdf_encode(pdf_filename):    return open(pdf_filename,"rb").read().encode("base64");

or if the file size is large you could have to break the encoding into chunks did not look into if there is module to do so but it could be as simple as the below example Code:

 def chunk_24_read(pdf_filename) :    with open(pdf_filename,"rb") as f:        byte = f.read(3)        while(byte) :            yield  byte            byte = f.read(3)def pdf_encode(pdf_filename):    encoded = ""    length = 0    for data in chunk_24_read(pdf_filename):        for char in base64.b64encode(data) :            if(length  and  length % 76 == 0):               encoded += "\n"               length = 0            encoded += char              length += 1    return encoded