Is it possible to get an Excel document's row count without loading the entire document into memory?
Adding on to what Hubro said, apparently get_highest_row()
has been deprecated. Using the max_row
and max_column
properties returns the row and column count. For example:
wb = load_workbook(path, use_iterators=True) sheet = wb.worksheets[0] row_count = sheet.max_row column_count = sheet.max_column
The solution suggested in this answer has been deprecated, and might no longer work.
Taking a look at the source code of OpenPyXL (IterableWorksheet) I've figured out how to get the column and row count from an iterator worksheet:
wb = load_workbook(path, use_iterators=True)sheet = wb.worksheets[0]row_count = sheet.get_highest_row() - 1column_count = letter_to_index(sheet.get_highest_column()) + 1
IterableWorksheet.get_highest_column
returns a string with the column letter that you can see in Excel, e.g. "A", "B", "C" etc. Therefore I've also written a function to translate the column letter to a zero based index:
def letter_to_index(letter): """Converts a column letter, e.g. "A", "B", "AA", "BC" etc. to a zero based column index. A becomes 0, B becomes 1, Z becomes 25, AA becomes 26 etc. Args: letter (str): The column index letter. Returns: The column index as an integer. """ letter = letter.upper() result = 0 for index, char in enumerate(reversed(letter)): # Get the ASCII number of the letter and subtract 64 so that A # corresponds to 1. num = ord(char) - 64 # Multiply the number with 26 to the power of `index` to get the correct # value of the letter based on it's index in the string. final_num = (26 ** index) * num result += final_num # Subtract 1 from the result to make it zero-based before returning. return result - 1
I still haven't figured out how to get the column sizes though, so I've decided to use a fixed-width font and automatically scaled columns in my application.
This might be extremely convoluted and I might be missing the obvious, but without OpenPyXL filling in the column_dimensions in Iterable Worksheets (see my comment above), the only way I can see of finding the column size without loading everything is to parse the xml directly:
from xml.etree.ElementTree import iterparsefrom openpyxl import load_workbookwb=load_workbook("/path/to/workbook.xlsx", use_iterators=True)ws=wb.worksheets[0]xml = ws._xml_sourcexml.seek(0)for _,x in iterparse(xml): name= x.tag.split("}")[-1] if name=="col": print "Column %(max)s: Width: %(width)s"%x.attrib # width = x.attrib["width"] if name=="cols": print "break before reading the rest of the file" break