sorting by a custom list in pandas
I just discovered that with pandas 15.1 it is possible to use categorical series (http://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#categoricals)
As for your example, lets define the same data-frame and sorter:
import pandas as pddata = { 'id': [2967, 5335, 13950, 6141, 6169], 'Player': ['Cedric Hunter', 'Maurice Baker', 'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'], 'Year': [1991, 2004, 2001, 2009, 1997], 'Age': [27, 25, 22, 34, 31], 'Tm': ['CHH', 'VAN', 'TOT', 'OKC', 'DAL'], 'G': [6, 7, 60, 52, 81]}# Create DataFramedf = pd.DataFrame(data)# Define the sortersorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL', 'DEN', 'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL', 'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI', 'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN', 'WAS', 'WSB']
With the data-frame and sorter, which is a category-order, we can do the following in pandas 15.1:
# Convert Tm-column to category and in set the sorter as categories hierarchy# Youc could also do both lines in one just appending the cat.set_categories()df.Tm = df.Tm.astype("category")df.Tm.cat.set_categories(sorter, inplace=True)print(df.Tm)Out[48]: 0 CHH1 VAN2 TOT3 OKC4 DALName: Tm, dtype: categoryCategories (38, object): [TOT < ATL < BOS < BRK ... UTA < VAN < WAS < WSB]df.sort_values(["Tm"]) ## 'sort' changed to 'sort_values'Out[49]: Age G Player Tm Year id2 22 60 Ratko Varda TOT 2001 139500 27 6 Cedric Hunter CHH 1991 29674 31 81 Adrian Caldwell DAL 1997 61693 34 52 Ryan Bowen OKC 2009 61411 25 7 Maurice Baker VAN 2004 5335
Below is an example that performs lexicographic sort on a dataframe.The idea is to create an numerical index based on the specific sort.Then to perform a numerical sort based on the index.A column is added to the dataframe to do so, and is then removed.
import pandas as pd# Create DataFramedf = pd.DataFrame({'id':[2967, 5335, 13950, 6141, 6169], 'Player': ['Cedric Hunter', 'Maurice Baker', 'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'], 'Year': [1991, 2004, 2001, 2009, 1997], 'Age': [27, 25, 22, 34, 31], 'Tm': ['CHH' ,'VAN' ,'TOT' ,'OKC', 'DAL'], 'G': [6, 7, 60, 52, 81]})# Define the sortersorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL','DEN', 'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL', 'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI', 'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN', 'WAS', 'WSB']# Create the dictionary that defines the order for sortingsorterIndex = dict(zip(sorter, range(len(sorter))))# Generate a rank column that will be used to sort# the dataframe numericallydf['Tm_Rank'] = df['Tm'].map(sorterIndex)# Here is the result asked with the lexicographic sort# Result may be hard to analyze, so a second sorting is# proposed next## NOTE: ## Newer versions of pandas use 'sort_values' instead of 'sort'df.sort_values(['Player', 'Year', 'Tm_Rank'], ascending = [True, True, True], inplace = True)df.drop('Tm_Rank', 1, inplace = True)print(df)# Here is an example where 'Tm' is sorted first, that will # give the first row of the DataFrame df to contain TOT as 'Tm'df['Tm_Rank'] = df['Tm'].map(sorterIndex)## NOTE: ## Newer versions of pandas use 'sort_values' instead of 'sort'df.sort_values(['Tm_Rank', 'Player', 'Year'], ascending = [True , True, True], inplace = True)df.drop('Tm_Rank', 1, inplace = True)print(df)
According to pandas 1.1.0 documentation, it has become possible to sort with key
parameter like in sorted
function (finally!). Here how we can sort by Tm
import pandas as pddata = { 'id': [2967, 5335, 13950, 6141, 6169], 'Player': ['Cedric Hunter', 'Maurice Baker', 'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'], 'Year': [1991, 2004, 2001, 2009, 1997], 'Age': [27, 25, 22, 34, 31], 'Tm': ['CHH', 'VAN', 'TOT', 'OKC', 'DAL'], 'G': [6, 7, 60, 52, 81]}# Create DataFramedf = pd.DataFrame(data)def tm_sorter(column): """Sort function""" teams = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL', 'DEN', 'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL', 'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI', 'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN', 'WAS', 'WSB'] correspondence = {team: order for order, team in enumerate(teams)} return column.map(correspondence)df.sort_values(by='Tm', key=tm_sorter)
Sadly, it looks like we can use this feature only in sorting by 1 column (list with key
s is not acceptable). It can be circumvented by groupby
df.sort_values(['Player', 'Year']) \ .groupby(['Player', 'Year']) \ .apply(lambda x: x.sort_values(by='Tm', key=tm_sorter)) \ .reset_index(drop=True)
If you know how to use key
in sort_values
with multiple columns, tell me please