In which file is a specified dataframe's attribution definition, such as columns, located?
>>> import pandas as pd>>> import inspect>>> inspect.getfile(pd.DataFrame)'/Users/.../lib/python3.7/site-packages/pandas/core/frame.py'
DataFrames would be initialized via __init__
:
https://github.com/pandas-dev/pandas/blob/v1.0.3/pandas/core/frame.py#L414
Specifically, when constucting a DataFrame from a dict, it uses the @classmethod
to instantiate the DF:
https://github.com/pandas-dev/pandas/blob/v1.0.3/pandas/core/frame.py#L1169
@classmethoddef from_dict(cls, data, orient="columns", dtype=None, columns=None) - "DataFrame": ... return cls(data, index=index, columns=columns, dtype=dtype)
Checked that file in github and think this is where the columns
attribute is set:
https://github.com/pandas-dev/pandas/blob/v1.0.3/pandas/core/frame.py#L8449
DataFrame._setup_axes( ["index", "columns"], docs={ "index": "The index (row labels) of the DataFrame.", "columns": "The column labels of the DataFrame.", },)
EDIT: Added reference to def __init__
, def from_dict
and changed paths to stable pandas version
columns
isn't defined in any single place. It is just an attribute on the DataFrame that points to an instance of another object. In particular, columns
must be an instance of pandas.core.indexes.base.Index
or one of its subclasses, which are also defined in submodules of pandas.core.indexes
but are also mostly accessible from the top-level module (e.g. pd.RangeIndex
).
I am distinguishing "defined" from two possibly-related ideas:
- Where the attribute is set. (e.g. the line where they go,
self.columns = ...
). - How the DataFrame object uses/interacts with the attribute.
Where is Index
defined?
The actual path to the base Index
class is at:
https://github.com/pandas-dev/pandas/blob/v1.0.3/pandas/core/indexes/base.py#L177
Likewise, on your local installation it will be at
[..]/python3.x/site-packages/pandas/core/indexes/base.py
.
Where is it written that columns
must be an instance of an Index
?
Since python isn't strongly typed this is kind of hard to prove/enforce. However, DataFrame
inherits from NDFrame
, which is its N-dimensional generalization (Series
is the 1D version). At the end of the day, NDFrame
stores data in an attribute called... _data
, which is an instance of BlockManager
. Here you can see that the typings on axes
(columns
is a kind of axis) are as an Index
. All (orthodox) modifications to these axes will be run through a function ensure_index
, which will convert, e.g., lists to proper indices.
How is the column
attribute set and retrieved?
(Maybe this was the main question?)
The index object that columns
refers to lives in pd.DataFrame._data.axes[0]
. Custom implementations of __getattr__
and __setattr__
then ensure that the call to DataFrame.columns
returns that element.
But let me back up.
The call to the _setup_axes
class method alters the DataFrame
class (not instance) to have attributes columns
and index
.
In particular, _setup_axes
sets the columns
attribute to be an AxisProperty
with argument axis=0
. You could maybe think of _setup_axes
as a promise that each instance of the DataFrame
will have labels for two axes and, further, that these axes have names.
So why do calls to df.columns
return an Index rather than an AxisProperty
?
A call to df.columns
will:
- Enter
__getattr__
. - Find
columns
among the entries inself._internal_names_set
so go to line 5270 - [5270]
return object.__getattribute__(self, name)
. - Triggers
__get__
method ofAxisProperty
. Notice that the second argument here (obj
) is our DataFrame instance(!). - On 63 access
obj._data.axes
, i.e. the_data[.axes]
attribute of the dataframe. - On 64 return the element of
obj._data.axes
corresponding toself.axis
. The call to_setup_axes
had setself.axis=0
so we get the 0th element.
Setting df.columns
(after initialization) works in a similar manner. When the DataFrame is initialized the columns are coerced into an Index
type, added to a list of axes, and passed as an argument to init a BlockManager
, which is then assigned to the _data
attribute.