How to obtain sheet names from XLS files without loading the whole file?

python excel pandas xlrd

you can use the xlrd library and open the workbook with the "on_demand=True" flag, so that the sheets won't be loaded automaticaly.

Than you can retrieve the sheet names in a similar way to pandas:

import xlrdxls = xlrd.open_workbook(r'<path_to_your_excel_file>', on_demand=True)print xls.sheet_names() # <- remeber: xlrd sheet_names is a function, not a property

python excel pandas xlrd

I have tried xlrd, pandas, openpyxl and other such libraries and all of them seem to take exponential time as the file size increase as it reads the entire file. The other solutions mentioned above where they used 'on_demand' did not work for me. The following function works for xlsx files.

def get_sheet_details(file_path):    sheets = []    file_name = os.path.splitext(os.path.split(file_path)[-1])[0]    # Make a temporary directory with the file name    directory_to_extract_to = os.path.join(settings.MEDIA_ROOT, file_name)    os.mkdir(directory_to_extract_to)    # Extract the xlsx file as it is just a zip file    zip_ref = zipfile.ZipFile(file_path, 'r')    zip_ref.extractall(directory_to_extract_to)    zip_ref.close()    # Open the workbook.xml which is very light and only has meta data, get sheets from it    path_to_workbook = os.path.join(directory_to_extract_to, 'xl', 'workbook.xml')    with open(path_to_workbook, 'r') as f:        xml = f.read()        dictionary = xmltodict.parse(xml)        for sheet in dictionary['workbook']['sheets']['sheet']:            sheet_details = {                'id': sheet['sheetId'], # can be @sheetId for some versions                'name': sheet['name'] # can be @name            }            sheets.append(sheet_details)    # Delete the extracted files directory    shutil.rmtree(directory_to_extract_to)    return sheets

Since all xlsx are basically zipped files, we extract the underlying xml data and read sheet names from the workbook directly which takes a fraction of a second as compared to the library functions.

Benchmarking: (On a 6mb xlsx file with 4 sheets)
Pandas, xlrd: 12 seconds
openpyxl: 24 seconds
Proposed method: 0.4 seconds

python excel pandas xlrd

From my research with the standard / popular libs this hasn't been implemented as of 2020 for xlsx / xls but you can do this for xlsb. Either way these solutions should give you vast performance improvements. for xls, xlsx, xlsb.

Below was benchmarked on a ~10Mb xlsx, xlsb file.

`xlsx, xls`

from openpyxl import load_workbookdef get_sheetnames_xlsx(filepath):    wb = load_workbook(filepath, read_only=True, keep_links=False)    return wb.sheetnames

Benchmarks: ~ 14x speed improvement

# get_sheetnames_xlsx vs pd.read_excel225 ms ± 6.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)3.25 s ± 140 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

`xlsb`

from pyxlsb import open_workbookdef get_sheetnames_xlsb(filepath):  with open_workbook(filepath) as wb:     return wb.sheets

Benchmarks: ~ 56x speed improvement

# get_sheetnames_xlsb vs pd.read_excel96.4 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)5.36 s ± 162 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Notes:

This is a good resource -http://www.python-excel.org/
xlrd is no longer maintained as of 2020

CodeHunter

How to obtain sheet names from XLS files without loading the whole file?

`xlsx, xls`

`xlsb`

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last