Python: Pandas: two columns with same values, alphabetically sorted and stored Python: Pandas: two columns with same values, alphabetically sorted and stored pandas pandas

Python: Pandas: two columns with same values, alphabetically sorted and stored


As an alternative vectorized solution, you can use numpy.minimum() and numpy.maximum():

import numpy as npdf['restart_A'] = np.minimum(df['name_A'], df['name_B'])df['restart_B'] = np.maximum(df['name_A'], df['name_B']) 

enter image description here

Or use apply method:

df[['restated_A', 'restated_B']] = df.apply(lambda r: sorted(r), axis = 1)

enter image description here


Just send df.values to a list and sort that list for each row. Then reassign the elements in the pairs accordingly.

>>> df = pd.DataFrame([{"name_A": "john", "name_B": "mac"}, {"name_A": "mac", "name_B": "john"}])>>> restated_values = [sorted(pair) for pair in df.values.tolist()]>>> restated_values[['john', 'mac'], ['john', 'mac']]>>> df['restated_A'] = [pair[0] for pair in restated_values]>>> df  name_A name_B restated_A0   john    mac       john1    mac   john       john>>> df['restated_b'] = [pair[1] for pair in restated_values]>>> df  name_A name_B restated_A restated_b0   john    mac       john        mac1    mac   john       john        mac

Or, you could do this, using a dict and a new pandas.DataFrame object:

>>> df = pd.DataFrame([{"name_A": "john", "name_B": "mac"}, {"name_A": "mac", "name_B": "john"}])>>> restated_values = [sorted(pair) for pair in df.values.tolist()]>>> restated_values[['john', 'mac'], ['john', 'mac']]>>> new_col_rows = {'restated_A': [pair[0] for pair in restated_values], 'restated_B': [pair[1] for pair in restated_values]}>>> new_col_rows{'restated_A': ['john', 'john'], 'restated_B': ['mac', 'mac']}>>> new_df = pd.DataFrame(new_col_rows)>>> new_df  restated_A restated_B0       john        mac1       john        mac>>> df = df.join(new_df)>>> df  name_A name_B restated_A restated_B0   john    mac       john        mac1    mac   john       john        mac


You can use NumPy sort() method in order to sort "in place":

In [57]: dfOut[57]:  name_A   name_B0   john      mac1    mac     john2  Trump  ClintonIn [58]: df.values.sort(axis=1)In [59]: dfOut[59]:    name_A name_B0     john    mac1     john    mac2  Clinton  Trump

Timing against 30K rows DF:

In [69]: %%timeit    ...: big = pd.concat([df.copy()] * 10**4, ignore_index=True)    ...: big.values.sort(axis=1)    ...:1 loop, best of 3: 2.25 s per loopIn [70]: %%timeit    ...: big = pd.concat([df.copy()] * 10**4, ignore_index=True)    ...: big.apply(lambda r: sorted(r), axis = 1)    ...:1 loop, best of 3: 15.9 s per loopIn [71]: %%timeit    ...: big = pd.concat([df.copy()] * 10**4, ignore_index=True)    ...: pd.DataFrame([sorted(pair) for pair in big.values.tolist()], columns=df.columns)    ...:1 loop, best of 3: 2.29 s per loop

Timing against 300K rows DF:

In [73]: %%timeit    ...: big = pd.concat([df.copy()] * 10**5, ignore_index=True)    ...: big.values.sort(axis=1)    ...:1 loop, best of 3: 23 s per loopIn [74]: %%timeit    ...: big = pd.concat([df.copy()] * 10**5, ignore_index=True)    ...: big.apply(lambda r: sorted(r), axis = 1)    ...:1 loop, best of 3: 2min 39s per loopIn [75]: %%timeit    ...: big = pd.concat([df.copy()] * 10**5, ignore_index=True)    ...: pd.DataFrame([sorted(pair) for pair in big.values.tolist()], columns=df.columns)    ...:1 loop, best of 3: 23.4 s per loop