In which file is a specified dataframe's attribution definition, such as columns, located? In which file is a specified dataframe's attribution definition, such as columns, located? pandas pandas

In which file is a specified dataframe's attribution definition, such as columns, located?


>>> import pandas as pd>>> import inspect>>> inspect.getfile(pd.DataFrame)'/Users/.../lib/python3.7/site-packages/pandas/core/frame.py'

DataFrames would be initialized via __init__:
https://github.com/pandas-dev/pandas/blob/v1.0.3/pandas/core/frame.py#L414

Specifically, when constucting a DataFrame from a dict, it uses the @classmethod to instantiate the DF:
https://github.com/pandas-dev/pandas/blob/v1.0.3/pandas/core/frame.py#L1169

@classmethoddef from_dict(cls, data, orient="columns", dtype=None, columns=None) - "DataFrame":    ...    return cls(data, index=index, columns=columns, dtype=dtype)

Checked that file in github and think this is where the columns attribute is set:
https://github.com/pandas-dev/pandas/blob/v1.0.3/pandas/core/frame.py#L8449

DataFrame._setup_axes(    ["index", "columns"],    docs={        "index": "The index (row labels) of the DataFrame.",        "columns": "The column labels of the DataFrame.",    },)

EDIT: Added reference to def __init__, def from_dict and changed paths to stable pandas version


columns isn't defined in any single place. It is just an attribute on the DataFrame that points to an instance of another object. In particular, columns must be an instance of pandas.core.indexes.base.Index or one of its subclasses, which are also defined in submodules of pandas.core.indexes but are also mostly accessible from the top-level module (e.g. pd.RangeIndex).

I am distinguishing "defined" from two possibly-related ideas:

  1. Where the attribute is set. (e.g. the line where they go, self.columns = ...).
  2. How the DataFrame object uses/interacts with the attribute.

Where is Index defined?

The actual path to the base Index class is at:

https://github.com/pandas-dev/pandas/blob/v1.0.3/pandas/core/indexes/base.py#L177

Likewise, on your local installation it will be at

[..]/python3.x/site-packages/pandas/core/indexes/base.py.

Where is it written that columns must be an instance of an Index?

Since python isn't strongly typed this is kind of hard to prove/enforce. However, DataFrame inherits from NDFrame, which is its N-dimensional generalization (Series is the 1D version). At the end of the day, NDFrame stores data in an attribute called... _data, which is an instance of BlockManager. Here you can see that the typings on axes (columns is a kind of axis) are as an Index. All (orthodox) modifications to these axes will be run through a function ensure_index, which will convert, e.g., lists to proper indices.

How is the column attribute set and retrieved?

(Maybe this was the main question?)

The index object that columns refers to lives in pd.DataFrame._data.axes[0]. Custom implementations of __getattr__ and __setattr__ then ensure that the call to DataFrame.columns returns that element.

But let me back up.

The call to the _setup_axes class method alters the DataFrame class (not instance) to have attributes columns and index.

In particular, _setup_axes sets the columns attribute to be an AxisProperty with argument axis=0. You could maybe think of _setup_axes as a promise that each instance of the DataFrame will have labels for two axes and, further, that these axes have names.

So why do calls to df.columns return an Index rather than an AxisProperty?

A call to df.columns will:

  1. Enter __getattr__.
  2. Find columns among the entries in self._internal_names_set so go to line 5270
  3. [5270] return object.__getattribute__(self, name).
  4. Triggers __get__ method of AxisProperty. Notice that the second argument here (obj) is our DataFrame instance(!).
  5. On 63 access obj._data.axes, i.e. the _data[.axes] attribute of the dataframe.
  6. On 64 return the element of obj._data.axes corresponding to self.axis. The call to _setup_axes had set self.axis=0 so we get the 0th element.

Setting df.columns (after initialization) works in a similar manner. When the DataFrame is initialized the columns are coerced into an Index type, added to a list of axes, and passed as an argument to init a BlockManager, which is then assigned to the _data attribute.