HDF5 min_itemsize error: ValueError: Trying to store a string with len [##] in [y] column but this column has a limit of [##]!
UPDATE:
you have misspelled data_columns
parameter: data_column
- it should be data_columns
. As a result you didn't have any indexed columns in your HDF Store and HDF store added values_block_X
:
In [70]: store = pd.HDFStore(r'D:\temp\.data\my_test.h5')
misspelled parameters will be ignored:
In [71]: store.append('no_idx_wrong_dc', df, data_column=df.columns, index=False)In [72]: store.get_storer('no_idx_wrong_dc').tableOut[72]:/no_idx_wrong_dc/table (Table(10,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1), "values_block_1": Int64Col(shape=(1,), dflt=0, pos=2), "values_block_2": StringCol(itemsize=30, shape=(1,), dflt=b'', pos=3)} byteorder := 'little' chunkshape := (1213,)
is the same as the following:
In [73]: store.append('no_idx_no_dc', df, index=False)In [74]: store.get_storer('no_idx_no_dc').tableOut[74]:/no_idx_no_dc/table (Table(10,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1), "values_block_1": Int64Col(shape=(1,), dflt=0, pos=2), "values_block_2": StringCol(itemsize=30, shape=(1,), dflt=b'', pos=3)} byteorder := 'little' chunkshape := (1213,)
let's spell it correctly:
In [75]: store.append('no_idx_dc', df, data_columns=df.columns, index=False)In [76]: store.get_storer('no_idx_dc').tableOut[76]:/no_idx_dc/table (Table(10,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "value": Float64Col(shape=(), dflt=0.0, pos=1), "count": Int64Col(shape=(), dflt=0, pos=2), "s": StringCol(itemsize=30, shape=(), dflt=b'', pos=3)} byteorder := 'little' chunkshape := (1213,)
OLD Answer:
AFAIK you can effectively set the min_itemsize
parameter on the first append only.
Demo:
In [33]: dfOut[33]: num s0 11 aaaaaaaaaaaaaaaa1 12 bbbbbbbbbbbbbb2 13 ccccccccccccc3 14 dddddddddddIn [34]: store = pd.HDFStore(r'D:\temp\.data\my_test.h5')In [35]: store.append('test_1', df, data_columns=True)In [36]: store.get_storer('test_1').table.descriptionOut[36]:{ "index": Int64Col(shape=(), dflt=0, pos=0), "num": Int64Col(shape=(), dflt=0, pos=1), "s": StringCol(itemsize=16, shape=(), dflt=b'', pos=2)}In [37]: df.loc[4] = [15, 'X'*200]In [38]: dfOut[38]: num s0 11 aaaaaaaaaaaaaaaa1 12 bbbbbbbbbbbbbb2 13 ccccccccccccc3 14 ddddddddddd4 15 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...In [39]: store.append('test_1', df, data_columns=True)...skipped...ValueError: Trying to store a string with len [200] in [s] column butthis column has a limit of [16]!Consider using min_itemsize to preset the sizes on these columns
now using min_itemsize
, but still appending to the existing store
object:
In [40]: store.append('test_1', df, data_columns=True, min_itemsize={'s':250})...skipped...ValueError: Trying to store a string with len [250] in [s] column butthis column has a limit of [16]!Consider using min_itemsize to preset the sizes on these columns
The following works if we will create a new object in our store
:
In [41]: store.append('test_2', df, data_columns=True, min_itemsize={'s':250})
Check column sizes:
In [42]: store.get_storer('test_2').table.descriptionOut[42]:{ "index": Int64Col(shape=(), dflt=0, pos=0), "num": Int64Col(shape=(), dflt=0, pos=1), "s": StringCol(itemsize=250, shape=(), dflt=b'', pos=2)}
I started to get this error around about the same time as updating Pandas from 18.1 to 22.0 (although this could be unrelated).
I fixed the error in the existing HDF5 file by manually reading the dataframe in, then writing a new HDF5 file with a larger min_itemsize
for the column mentioned in the error:
filename_hdf5 = "C:\test.h5"df = pd.read_hdf(filename_hdf5, 'table_name')hdf = HDFStore(filename_hdf5)hdf.put('table_name', df, format='table', data_columns=True, min_itemsize={'ColumnNameMentionedInError': 10})hdf.close()
I then updated the existing code to set min_itemsize
on key creation.
Extra for Experts
The error occurs because one is trying to append more rows to an existing dataframe with a fixed column width too narrow for the new data. The fixed column width was originally set based on the longest string in the column when the dataframe was first written.
Methinks that pandas should handle this error transparently, rather than leaving what is effectively a timebomb for all future appends. This issue could take weeks or even years to surface.