Converting a Pandas GroupBy output from Series to DataFrame
g1
here is a DataFrame. It has a hierarchical index, though:
In [19]: type(g1)Out[19]: pandas.core.frame.DataFrameIn [20]: g1.indexOut[20]: MultiIndex([('Alice', 'Seattle'), ('Bob', 'Seattle'), ('Mallory', 'Portland'), ('Mallory', 'Seattle')], dtype=object)
Perhaps you want something like this?
In [21]: g1.add_suffix('_Count').reset_index()Out[21]: Name City City_Count Name_Count0 Alice Seattle 1 11 Bob Seattle 2 22 Mallory Portland 2 23 Mallory Seattle 1 1
Or something like:
In [36]: DataFrame({'count' : df1.groupby( [ "Name", "City"] ).size()}).reset_index()Out[36]: Name City count0 Alice Seattle 11 Bob Seattle 22 Mallory Portland 23 Mallory Seattle 1
I want to slightly change the answer given by Wes, because version 0.16.2 requires as_index=False
. If you don't set it, you get an empty dataframe.
Aggregation functions will not return the groups that you are aggregating over if they are named columns, when
as_index=True
, the default. The grouped columns will be the indices of the returned object.Passing
as_index=False
will return the groups that you are aggregating over, if they are named columns.Aggregating functions are ones that reduce the dimension of the returned objects, for example:
mean
,sum
,size
,count
,std
,var
,sem
,describe
,first
,last
,nth
,min
,max
. This is what happens when you do for exampleDataFrame.sum()
and get back aSeries
.nth can act as a reducer or a filter, see here.
import pandas as pddf1 = pd.DataFrame({"Name":["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"], "City":["Seattle","Seattle","Portland","Seattle","Seattle","Portland"]})print df1## City Name#0 Seattle Alice#1 Seattle Bob#2 Portland Mallory#3 Seattle Mallory#4 Seattle Bob#5 Portland Mallory#g1 = df1.groupby(["Name", "City"], as_index=False).count()print g1## City Name#Name City#Alice Seattle 1 1#Bob Seattle 2 2#Mallory Portland 2 2# Seattle 1 1#
EDIT:
In version 0.17.1
and later you can use subset
in count
and reset_index
with parameter name
in size
:
print df1.groupby(["Name", "City"], as_index=False ).count()#IndexError: list index out of rangeprint df1.groupby(["Name", "City"]).count()#Empty DataFrame#Columns: []#Index: [(Alice, Seattle), (Bob, Seattle), (Mallory, Portland), (Mallory, Seattle)]print df1.groupby(["Name", "City"])[['Name','City']].count()# Name City#Name City #Alice Seattle 1 1#Bob Seattle 2 2#Mallory Portland 2 2# Seattle 1 1print df1.groupby(["Name", "City"]).size().reset_index(name='count')# Name City count#0 Alice Seattle 1#1 Bob Seattle 2#2 Mallory Portland 2#3 Mallory Seattle 1
The difference between count
and size
is that size
counts NaN values while count
does not.
The key is to use the reset_index() method.
Use:
import pandasdf1 = pandas.DataFrame( { "Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] , "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"] } )g1 = df1.groupby( [ "Name", "City"] ).count().reset_index()
Now you have your new dataframe in g1: