When does Pandas default to broadcasting Series and Dataframes? When does Pandas default to broadcasting Series and Dataframes? python python

When does Pandas default to broadcasting Series and Dataframes?


What is happening is pandas using intrinsic data alignment. Pandas almost always aligns the data on indexes, either row index or column headers. Here is a quick example:

s1 = pd.Series([1,2,3], index=['a','b','c'])s2 = pd.Series([2,4,6], index=['a','b','c'])s1 + s2#Ouput as expected:a    3b    6c    9dtype: int64

Now, let's run a couple other examples with different indexing:

s2 = pd.Series([2,4,6], index=['a','a','c'])s1 + s2#Ouputa    3.0a    5.0b    NaNc    9.0dtype: float64

A cartesian product happens with duplicated indexes, and matching is NaN + value = NaN.

And, no matching indexes:

s2 = pd.Series([2,4,6], index=['e','f','g'])s1 + s2#Outputa   NaNb   NaNc   NaNe   NaNf   NaNg   NaNdtype: float64

So, in your first example you are creating pd.Series and pd.DataFrame with default range indexes that match, hence the comparison is happening as expected. In your second example, you are comparing column headers ['cell2','cell3','cell4','cell5'] with a the default range index which is returning all 15 columns and no matches all values will be False, NaN comparison returns False.


Bottom line, Pandas compares each series value to the column with the title which matches the value index. The indices in your second example are 0..10, and the column names cell1..4, so no column name matches, and you just append new columns. This is essentially treating the series as a dataframe with the index as the column titles.


You can actually see part of what pandas does in your first example if you make your series longer than the amount of columns:

>>> my_ser = pd.Series(np.random.randint(0, 100, size=20))>>> my_df    0   1   2   3   40   9  10  27  45  711  39  61  85  97  442  34  34  88  33   53  36   0  75  34  694  53  80  62   8  615   1  81  35  91  406  36  48  25  67  357  30  29  33  18  178  93  84   2  69  129  44  66  91  85  39>>> my_ser0     921     362     253     324     425     146     867     288     209     8210    6811    2212    9913    8314     715    7216    6117    1318     519     0dtype: int64>>> my_ser>my_df      0      1      2      3      4      5      6      7      8      9   \0   True   True  False  False  False  False  False  False  False  False1   True  False  False  False  False  False  False  False  False  False2   True   True  False  False   True  False  False  False  False  False3   True   True  False  False  False  False  False  False  False  False4   True  False  False   True  False  False  False  False  False  False5   True  False  False  False   True  False  False  False  False  False6   True  False  False  False   True  False  False  False  False  False7   True   True  False   True   True  False  False  False  False  False8  False  False   True  False   True  False  False  False  False  False9   True  False  False  False   True  False  False  False  False  False      10     11     12     13     14     15     16     17     18     190  False  False  False  False  False  False  False  False  False  False1  False  False  False  False  False  False  False  False  False  False2  False  False  False  False  False  False  False  False  False  False3  False  False  False  False  False  False  False  False  False  False4  False  False  False  False  False  False  False  False  False  False5  False  False  False  False  False  False  False  False  False  False6  False  False  False  False  False  False  False  False  False  False7  False  False  False  False  False  False  False  False  False  False8  False  False  False  False  False  False  False  False  False  False9  False  False  False  False  False  False  False  False  False  False

Note what is happening - 92 is compared to the first column, so you get a single False at 93. Then 36 is compared to the second column etc. If your series matches in length your amount of columns, then you get the expected behavior.

But what happens when your series is longer? Well, you need to append a new fake column to the data frame to continue the comparison. What is it filled with? I found no documentation, but my impression is it just fills in False, since there is nothing to compare to. Hence you get extra columns to match the series length, all False.

But what about your example. You do not get 11 columns, but 4+11=15! Let's make another test:

>>> my_df = pd.DataFrame(np.random.randint(0, 100, size=100).reshape(10,10),columns=[chr(i) for i in range(10)])>>> my_ser = pd.Series(np.random.randint(0, 100, size=10))>>> (my_df>my_ser).shape(10, 20)

This time we got the sum of the dimensions, 10+10=20, as the amount of output columns!

What was the difference? Pandas compares each series index with the matching column title. In your first example, the index of my_ser and my_df titles matched, so it compared them. If there are extra columns - the above is what happens. If all columns have different names then the series indices, then all the columns are extra, and you get your result, and what happens in my example where the titles are now characters, and the index integers.