How to specify metadata for dask.dataframe How to specify metadata for dask.dataframe pandas pandas

How to specify metadata for dask.dataframe


The available basic data types are the ones which are offered through numpy. Have a look at the documentation for a list.

Not included in this set are datetime-formats (e.g. datetime64), for which additional information can be found in the pandas and numpy documentation.

The meta-argument for dask dataframes usually expects an empty pandas dataframe holding definitions for columns, indices and dtypes.

One way to construct such a DataFrame is:

import pandas as pdimport numpy as npmeta = pd.DataFrame(columns=['a', 'b', 'c'])meta.a = meta.a.astype(np.int64)meta.b = meta.b.astype(np.datetime64)

There is also a way to provide a dtype to the constructor of the pandas dataframe, however, I am not sure how to provide them for individual columns each. As you can see, it is possible to provide not only the "name" for datatypes, but also the actual numpy dtype.

Regarding your last question, the datatype you are looking for is "object". For example:

import pandas as pdclass Foo:    def __init__(self, foo):        self.bar = foodf = pd.DataFrame(data=[Foo(1), Foo(2)], columns=['a'], dtype='object')df.a# 0    <__main__.Foo object at 0x00000000058AC550># 1    <__main__.Foo object at 0x00000000058AC358>


Both Dask.dataframe and Pandas use NumPy dtypes. In particular, anything within that you can pass to np.dtype. This includes the following:

  1. NumPy dtype objects, like np.float64
  2. Python type objects, like float
  3. NumPy dtype strings, like 'f8'

Here is a more extensive list taken from the NumPy docs: http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html#specifying-and-constructing-data-types