How do I make a progress bar for loading pandas DataFrame from a large xlsx file? How do I make a progress bar for loading pandas DataFrame from a large xlsx file? pandas pandas

How do I make a progress bar for loading pandas DataFrame from a large xlsx file?


Will not work. pd.read_excel blocks until the file is read, and there is no way to get information from this function about its progress during execution.

It would work for read operations which you can do chunk wise, like

chunks = []for chunk in pd.read_csv(..., chunksize=1000):    update_progressbar()    chunks.append(chunk)

But as far as I understand tqdm also needs the number of chunks in advance, so for a propper progress report you would need to read the full file first....


DISCLAIMER: This works only with xlrd engine and is not thoroughly tested!

How it works? We monkey-patch xlrd.xlsx.X12Sheet.own_process_stream method that is responsible to load sheets from file-like stream. We supply own stream, that contains our progress bar. Each sheet has it's own progress bar.

When we want the progress bar, we use load_with_progressbar() context manager and then do pd.read_excel('<FILE.xlsx>').

import xlrdfrom tqdm import tqdmfrom io import RawIOBasefrom contextlib import contextmanagerclass progress_reader(RawIOBase):    def __init__(self, zf, bar):        self.bar = bar        self.zf = zf    def readinto(self, b):        n = self.zf.readinto(b)        self.bar.update(n=n)        return n@contextmanagerdef load_with_progressbar():    def my_get_sheet(self, zf, *other, **kwargs):        with tqdm(total=zf._orig_file_size) as bar:            sheet = _tmp(self, progress_reader(zf, bar), **kwargs)        return sheet    _tmp = xlrd.xlsx.X12Sheet.own_process_stream    try:        xlrd.xlsx.X12Sheet.own_process_stream = my_get_sheet        yield    finally:        xlrd.xlsx.X12Sheet.own_process_stream = _tmpimport pandas as pdwith load_with_progressbar():    df = pd.read_excel('sample2.xlsx')print(df)

Screenshot of progress bar:

enter image description here


This might help for people with similar problem.here you can get help

for example:

for i in tqdm(range(0,3), ncols = 100, desc ="Loading data.."):     df=pd.read_excel("some_file.xlsx",header=None)    LC_data=pd.read_excel("some_file.xlsx",'Sheet1', header=None)    FC_data=pd.read_excel("some_file.xlsx",'Shee2', header=None)    print("------Loading is completed ------")