How can I subclass a Pandas DataFrame?
There is now an official guide on how to subclass Pandas data structures, which includes DataFrame as well as Series.
The guide is available here: https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extending-subclassing-pandas
The guide mentions this subclassed DataFrame from the Geopandas project as a good example: https://github.com/geopandas/geopandas/blob/master/geopandas/geodataframe.py
As in HYRY's answer, it seems there are two things you're trying to accomplish:
- When calling methods on an instance of your class, return instances of the correct type (your type). For this, you can just add the
_constructor
property which should return your type. - Adding attributes which will be attached to copies of your object. To do this, you need to store the names of these attributes in a list, as the special
_metadata
attribute.
Here's an example:
class SubclassedDataFrame(DataFrame): _metadata = ['added_property'] added_property = 1 # This will be passed to copies @property def _constructor(self): return SubclassedDataFrame
For Requirement 1, just define _constructor
:
import pandas as pdimport numpy as npclass MyDF(pd.DataFrame): @property def _constructor(self): return MyDFmydf = MyDF(np.random.randn(3,4), columns=['A','B','C','D'])print type(mydf)mydf_sub = mydf[['A','C']]print type(mydf_sub)
I think there is no simple solution for Requirement 2. I think you need define __init__
, copy
, or do something in _constructor
, for example:
import pandas as pdimport numpy as npclass MyDF(pd.DataFrame): _attributes_ = "myattr1,myattr2" def __init__(self, *args, **kw): super(MyDF, self).__init__(*args, **kw) if len(args) == 1 and isinstance(args[0], MyDF): args[0]._copy_attrs(self) def _copy_attrs(self, df): for attr in self._attributes_.split(","): df.__dict__[attr] = getattr(self, attr, None) @property def _constructor(self): def f(*args, **kw): df = MyDF(*args, **kw) self._copy_attrs(df) return df return fmydf = MyDF(np.random.randn(3,4), columns=['A','B','C','D'])print type(mydf)mydf_sub = mydf[['A','C']]print type(mydf_sub)mydf.myattr1 = 1mydf_cp1 = MyDF(mydf)mydf_cp2 = mydf.copy()print mydf_cp1.myattr1, mydf_cp2.myattr1