sorting by a custom list in pandas sorting by a custom list in pandas pandas pandas

sorting by a custom list in pandas


I just discovered that with pandas 15.1 it is possible to use categorical series (http://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#categoricals)

As for your example, lets define the same data-frame and sorter:

import pandas as pddata = {    'id': [2967, 5335, 13950, 6141, 6169],    'Player': ['Cedric Hunter', 'Maurice Baker',                'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],    'Year': [1991, 2004, 2001, 2009, 1997],    'Age': [27, 25, 22, 34, 31],    'Tm': ['CHH', 'VAN', 'TOT', 'OKC', 'DAL'],    'G': [6, 7, 60, 52, 81]}# Create DataFramedf = pd.DataFrame(data)# Define the sortersorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL', 'DEN',          'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',          'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',          'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN', 'WAS', 'WSB']

With the data-frame and sorter, which is a category-order, we can do the following in pandas 15.1:

# Convert Tm-column to category and in set the sorter as categories hierarchy# Youc could also do both lines in one just appending the cat.set_categories()df.Tm = df.Tm.astype("category")df.Tm.cat.set_categories(sorter, inplace=True)print(df.Tm)Out[48]: 0    CHH1    VAN2    TOT3    OKC4    DALName: Tm, dtype: categoryCategories (38, object): [TOT < ATL < BOS < BRK ... UTA < VAN < WAS < WSB]df.sort_values(["Tm"])  ## 'sort' changed to 'sort_values'Out[49]:    Age   G           Player   Tm  Year     id2   22  60      Ratko Varda  TOT  2001  139500   27   6    Cedric Hunter  CHH  1991   29674   31  81  Adrian Caldwell  DAL  1997   61693   34  52       Ryan Bowen  OKC  2009   61411   25   7    Maurice Baker  VAN  2004   5335


Below is an example that performs lexicographic sort on a dataframe.The idea is to create an numerical index based on the specific sort.Then to perform a numerical sort based on the index.A column is added to the dataframe to do so, and is then removed.

import pandas as pd# Create DataFramedf = pd.DataFrame({'id':[2967, 5335, 13950, 6141, 6169],    'Player': ['Cedric Hunter', 'Maurice Baker',               'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],    'Year': [1991, 2004, 2001, 2009, 1997],    'Age': [27, 25, 22, 34, 31],    'Tm': ['CHH' ,'VAN' ,'TOT' ,'OKC', 'DAL'],    'G': [6, 7, 60, 52, 81]})# Define the sortersorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL','DEN',          'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',          'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',          'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN',          'WAS', 'WSB']# Create the dictionary that defines the order for sortingsorterIndex = dict(zip(sorter, range(len(sorter))))# Generate a rank column that will be used to sort# the dataframe numericallydf['Tm_Rank'] = df['Tm'].map(sorterIndex)# Here is the result asked with the lexicographic sort# Result may be hard to analyze, so a second sorting is# proposed next## NOTE: ## Newer versions of pandas use 'sort_values' instead of 'sort'df.sort_values(['Player', 'Year', 'Tm_Rank'],        ascending = [True, True, True], inplace = True)df.drop('Tm_Rank', 1, inplace = True)print(df)# Here is an example where 'Tm' is sorted first, that will # give the first row of the DataFrame df to contain TOT as 'Tm'df['Tm_Rank'] = df['Tm'].map(sorterIndex)## NOTE: ## Newer versions of pandas use 'sort_values' instead of 'sort'df.sort_values(['Tm_Rank', 'Player', 'Year'],        ascending = [True , True, True], inplace = True)df.drop('Tm_Rank', 1, inplace = True)print(df)


According to pandas 1.1.0 documentation, it has become possible to sort with key parameter like in sorted function (finally!). Here how we can sort by Tm

import pandas as pddata = {    'id': [2967, 5335, 13950, 6141, 6169],    'Player': ['Cedric Hunter', 'Maurice Baker',                'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],    'Year': [1991, 2004, 2001, 2009, 1997],    'Age': [27, 25, 22, 34, 31],    'Tm': ['CHH', 'VAN', 'TOT', 'OKC', 'DAL'],    'G': [6, 7, 60, 52, 81]}# Create DataFramedf = pd.DataFrame(data)def tm_sorter(column):    """Sort function"""    teams = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL', 'DEN',       'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',       'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',       'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN',       'WAS', 'WSB']    correspondence = {team: order for order, team in enumerate(teams)}    return column.map(correspondence)df.sort_values(by='Tm', key=tm_sorter)

Sadly, it looks like we can use this feature only in sorting by 1 column (list with keys is not acceptable). It can be circumvented by groupby

df.sort_values(['Player', 'Year']) \  .groupby(['Player', 'Year']) \  .apply(lambda x: x.sort_values(by='Tm', key=tm_sorter)) \  .reset_index(drop=True)

If you know how to use key in sort_values with multiple columns, tell me please