When does Pandas default to broadcasting Series and Dataframes?
What is happening is pandas using intrinsic data alignment. Pandas almost always aligns the data on indexes, either row index or column headers. Here is a quick example:
s1 = pd.Series([1,2,3], index=['a','b','c'])s2 = pd.Series([2,4,6], index=['a','b','c'])s1 + s2#Ouput as expected:a 3b 6c 9dtype: int64
Now, let's run a couple other examples with different indexing:
s2 = pd.Series([2,4,6], index=['a','a','c'])s1 + s2#Ouputa 3.0a 5.0b NaNc 9.0dtype: float64
A cartesian product happens with duplicated indexes, and matching is NaN + value = NaN.
And, no matching indexes:
s2 = pd.Series([2,4,6], index=['e','f','g'])s1 + s2#Outputa NaNb NaNc NaNe NaNf NaNg NaNdtype: float64
So, in your first example you are creating pd.Series and pd.DataFrame with default range indexes that match, hence the comparison is happening as expected. In your second example, you are comparing column headers ['cell2','cell3','cell4','cell5'] with a the default range index which is returning all 15 columns and no matches all values will be False, NaN comparison returns False.
Bottom line, Pandas compares each series value to the column with the title which matches the value index. The indices in your second example are 0..10, and the column names cell1..4
, so no column name matches, and you just append new columns. This is essentially treating the series as a dataframe with the index as the column titles.
You can actually see part of what pandas does in your first example if you make your series longer than the amount of columns:
>>> my_ser = pd.Series(np.random.randint(0, 100, size=20))>>> my_df 0 1 2 3 40 9 10 27 45 711 39 61 85 97 442 34 34 88 33 53 36 0 75 34 694 53 80 62 8 615 1 81 35 91 406 36 48 25 67 357 30 29 33 18 178 93 84 2 69 129 44 66 91 85 39>>> my_ser0 921 362 253 324 425 146 867 288 209 8210 6811 2212 9913 8314 715 7216 6117 1318 519 0dtype: int64>>> my_ser>my_df 0 1 2 3 4 5 6 7 8 9 \0 True True False False False False False False False False1 True False False False False False False False False False2 True True False False True False False False False False3 True True False False False False False False False False4 True False False True False False False False False False5 True False False False True False False False False False6 True False False False True False False False False False7 True True False True True False False False False False8 False False True False True False False False False False9 True False False False True False False False False False 10 11 12 13 14 15 16 17 18 190 False False False False False False False False False False1 False False False False False False False False False False2 False False False False False False False False False False3 False False False False False False False False False False4 False False False False False False False False False False5 False False False False False False False False False False6 False False False False False False False False False False7 False False False False False False False False False False8 False False False False False False False False False False9 False False False False False False False False False False
Note what is happening - 92 is compared to the first column, so you get a single False
at 93. Then 36 is compared to the second column etc. If your series matches in length your amount of columns, then you get the expected behavior.
But what happens when your series is longer? Well, you need to append a new fake column to the data frame to continue the comparison. What is it filled with? I found no documentation, but my impression is it just fills in False, since there is nothing to compare to. Hence you get extra columns to match the series length, all False
.
But what about your example. You do not get 11 columns, but 4+11=15! Let's make another test:
>>> my_df = pd.DataFrame(np.random.randint(0, 100, size=100).reshape(10,10),columns=[chr(i) for i in range(10)])>>> my_ser = pd.Series(np.random.randint(0, 100, size=10))>>> (my_df>my_ser).shape(10, 20)
This time we got the sum of the dimensions, 10+10=20, as the amount of output columns!
What was the difference? Pandas compares each series index with the matching column title. In your first example, the index of my_ser
and my_df
titles matched, so it compared them. If there are extra columns - the above is what happens. If all columns have different names then the series indices, then all the columns are extra, and you get your result, and what happens in my example where the titles are now characters, and the index integers.