Conditional mean over a Pandas DataFrame Conditional mean over a Pandas DataFrame pandas pandas

Conditional mean over a Pandas DataFrame


Conditional mean is indeed a thing in pandas. You can use DataFrame.groupby():

means = data2.groupby('voteChoice').mean()

or maybe, in your case, the following would be more efficient:

means = data2.groupby('voteChoice')['socialIdeology2'].mean()

to drill down to the mean you're looking for. (The first case will calculate means for all columns.) This is assuming that voteChoice is the name of the column you want to condition on.


If you're only interested in the mean for a single group (e.g. Clinton voters) then you could create a boolean series that is True for members of that group, then use this to index into the rows of the DataFrame before taking the mean:

voted_for_clinton = data2['voteChoice'] == 'Clinton'mean_for_clinton_voters = data2.loc[voted_for_clinton, 'socialIdeology2'].mean()

If you want to get the means for multiple groups simultaneously then you can use groupby, as in Brad's answer. However, I would do it like this:

means_by_vote_choice = data2.groupby('voteChoice')['socialIdeology2'].mean()

Placing the ['socialIdeology2'] index before the .mean() means that you only compute the mean over the column you're interested in, whereas if you place the indexing expression after the .mean() (i.e. data2.groupby('voteChoice').mean()['socialIdeology2']) this computes the means over all columns and then selects only the 'socialIdeology2' column from the result, which is less efficient.

See here for more info on indexing DataFrames using .loc and here for more info on groupby.