HDF5 min_itemsize error: ValueError: Trying to store a string with len [##] in [y] column but this column has a limit of [##]! HDF5 min_itemsize error: ValueError: Trying to store a string with len [##] in [y] column but this column has a limit of [##]! pandas pandas

HDF5 min_itemsize error: ValueError: Trying to store a string with len [##] in [y] column but this column has a limit of [##]!


UPDATE:

you have misspelled data_columns parameter: data_column - it should be data_columns. As a result you didn't have any indexed columns in your HDF Store and HDF store added values_block_X:

In [70]: store = pd.HDFStore(r'D:\temp\.data\my_test.h5')

misspelled parameters will be ignored:

In [71]: store.append('no_idx_wrong_dc', df, data_column=df.columns, index=False)In [72]: store.get_storer('no_idx_wrong_dc').tableOut[72]:/no_idx_wrong_dc/table (Table(10,)) ''  description := {  "index": Int64Col(shape=(), dflt=0, pos=0),  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),  "values_block_1": Int64Col(shape=(1,), dflt=0, pos=2),  "values_block_2": StringCol(itemsize=30, shape=(1,), dflt=b'', pos=3)}  byteorder := 'little'  chunkshape := (1213,)

is the same as the following:

In [73]: store.append('no_idx_no_dc', df, index=False)In [74]: store.get_storer('no_idx_no_dc').tableOut[74]:/no_idx_no_dc/table (Table(10,)) ''  description := {  "index": Int64Col(shape=(), dflt=0, pos=0),  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),  "values_block_1": Int64Col(shape=(1,), dflt=0, pos=2),  "values_block_2": StringCol(itemsize=30, shape=(1,), dflt=b'', pos=3)}  byteorder := 'little'  chunkshape := (1213,)

let's spell it correctly:

In [75]: store.append('no_idx_dc', df, data_columns=df.columns, index=False)In [76]: store.get_storer('no_idx_dc').tableOut[76]:/no_idx_dc/table (Table(10,)) ''  description := {  "index": Int64Col(shape=(), dflt=0, pos=0),  "value": Float64Col(shape=(), dflt=0.0, pos=1),  "count": Int64Col(shape=(), dflt=0, pos=2),  "s": StringCol(itemsize=30, shape=(), dflt=b'', pos=3)}  byteorder := 'little'  chunkshape := (1213,)

OLD Answer:

AFAIK you can effectively set the min_itemsize parameter on the first append only.

Demo:

In [33]: dfOut[33]:   num                 s0   11  aaaaaaaaaaaaaaaa1   12    bbbbbbbbbbbbbb2   13     ccccccccccccc3   14       dddddddddddIn [34]: store = pd.HDFStore(r'D:\temp\.data\my_test.h5')In [35]: store.append('test_1', df, data_columns=True)In [36]: store.get_storer('test_1').table.descriptionOut[36]:{  "index": Int64Col(shape=(), dflt=0, pos=0),  "num": Int64Col(shape=(), dflt=0, pos=1),  "s": StringCol(itemsize=16, shape=(), dflt=b'', pos=2)}In [37]: df.loc[4] = [15, 'X'*200]In [38]: dfOut[38]:   num                                                  s0   11                                   aaaaaaaaaaaaaaaa1   12                                     bbbbbbbbbbbbbb2   13                                      ccccccccccccc3   14                                        ddddddddddd4   15  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...In [39]: store.append('test_1', df, data_columns=True)...skipped...ValueError: Trying to store a string with len [200] in [s] column butthis column has a limit of [16]!Consider using min_itemsize to preset the sizes on these columns    

now using min_itemsize, but still appending to the existing store object:

In [40]: store.append('test_1', df, data_columns=True, min_itemsize={'s':250})...skipped...ValueError: Trying to store a string with len [250] in [s] column butthis column has a limit of [16]!Consider using min_itemsize to preset the sizes on these columns

The following works if we will create a new object in our store:

In [41]: store.append('test_2', df, data_columns=True, min_itemsize={'s':250})

Check column sizes:

In [42]: store.get_storer('test_2').table.descriptionOut[42]:{  "index": Int64Col(shape=(), dflt=0, pos=0),  "num": Int64Col(shape=(), dflt=0, pos=1),  "s": StringCol(itemsize=250, shape=(), dflt=b'', pos=2)}


I started to get this error around about the same time as updating Pandas from 18.1 to 22.0 (although this could be unrelated).

I fixed the error in the existing HDF5 file by manually reading the dataframe in, then writing a new HDF5 file with a larger min_itemsize for the column mentioned in the error:

filename_hdf5 = "C:\test.h5"df = pd.read_hdf(filename_hdf5, 'table_name')hdf = HDFStore(filename_hdf5)hdf.put('table_name', df, format='table', data_columns=True, min_itemsize={'ColumnNameMentionedInError': 10})hdf.close()

I then updated the existing code to set min_itemsize on key creation.


Extra for Experts

The error occurs because one is trying to append more rows to an existing dataframe with a fixed column width too narrow for the new data. The fixed column width was originally set based on the longest string in the column when the dataframe was first written.

Methinks that pandas should handle this error transparently, rather than leaving what is effectively a timebomb for all future appends. This issue could take weeks or even years to surface.