Finding n lowest values for each row in a dataframe
Use .argsort
to get the indices of the underlying array sorted. Slice the values and the column Index to get all of the information you need. We'll create a MultiIndex so we can store both the column headers and values in the same DataFrame. The first level will be your nth lowest indicator.
Example:
import pandas as pdimport numpy as npnp.random.seed(1)df = pd.DataFrame(np.random.randint(1,100000, (1739, 26)))df.columns = list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')N = 7 # 150 in your caseidx = np.argsort(df.values, 1)[:, 0:N]pd.concat([pd.DataFrame(np.take_along_axis(df.to_numpy(), idx, axis=1), index=df.index), pd.DataFrame(df.columns.to_numpy(), index=df.index)], keys=['Value', 'Columns'], axis=1)
Output:
Value Columns 0 1 2 3 4 5 6 0 1 2 3 4 5 60 5193 7752 8445 19947 20610 21441 21759 C K U V I G P1 432 3607 16278 17138 19434 26104 33879 R J W C B D G2 16 1047 1845 9553 12314 13784 19432 K S E F M O U3 244 5272 10836 13682 29237 33230 34448 K Q A S X W G4 9765 11275 13160 22808 30870 33484 42760 K T L U C D M5 2034 2179 4980 7184 14826 15238 22807 Z H F Q L R X...
You can use heapq.nsmallest
to find the n
smallest numbers in a list. This can be quickly applied to each row of a dataframe using .apply
:
import pandas as pdimport numpy as npimport heapqdf = pd.DataFrame(np.random.randn(1000, 1000))# Find the 150 smallest values in each rowsmallest = df.apply(lambda x: heapq.nsmallest(150, x), axis=1)
Each row of smallest is now a list of the 150 smallest values in the corresponding row in df
.
This can be converted to a dataframe using:
smallest_df = pd.DataFrame(smallest.values.tolist())
This is now a dataframe where each row corresponds to each row in the original dataframe. There are 150 columns, with the 150 smallest values in each row of the original.
smallest_df.head()
If I understand correctly, the question boils down to getting the k smallest numbers in a list of M (>k) numbers. This shall then be applied to each row individually.
If numpy is available and order does not matter, you could try using argpartition: With given parameter k, it partitions an array in a way that assuming the kth element is put into its sorted position, all smaller numbers are before, all larger numbers behind (in unspecified order):
import numpy as nprow = np.array([1, 6, 2, 12, 7, 8, 9, 11, 15, 26])k = 5idx = np.argpartition(row, k)[:k]print(idx)print(row[idx])-->[1 0 2 4 5][6 1 2 7 8]
Edit: This also works row/wise for full arrays:
import numpy as npdata = np.array([ [1, 6, 2, 12, 7, 8, 9, 11, 15, 26], [1, 65, 2, 12, 7, 8, 9, 11, 15, 26], [16, 6, 2, 12, 7, 8, 9, 11, 15, 26]])k = 5idx = np.argpartition(data, k)[:,:k]print(idx)-->[[1 0 2 4 5] [2 0 4 5 6] [4 2 1 5 6]]