Pandas: Select values from specific columns of a DataFrame by row

In [22]: df['new'] = df.lookup(df.index, columns_to_select)In [23]: dfOut[23]:   A   B    C  new0  1  10  100   101  2  20  200    22  3  30  300    33  4  40  400  400

python pandas numpy indexing

NumPy way

Here's a vectorized NumPy way using advanced indexing -

# Extract array dataIn [10]: a = df.values# Get integer based column IDsIn [11]: col_idx = np.searchsorted(df.columns, columns_to_select)# Use NumPy's advanced indexing to extract relevant elem per rowIn [12]: a[np.arange(len(col_idx)), col_idx]Out[12]: array([ 10,   2,   3, 400])

If column names of df are not sorted, we need to use sorter argument with np.searchsorted. The code to extract col_idx for such a generic df would be :

# https://stackoverflow.com/a/38489403/ @Divakardef column_index(df, query_cols):    cols = df.columns.values    sidx = np.argsort(cols)    return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]

So, col_idx would be obtained like so -

col_idx = column_index(df, columns_to_select)

Further optimization

Profiling it revealed that the bottleneck was processing strings with np.searchsorted, the usual NumPy weakness of not being so great with strings. So, to overcome that and using the special case scenario of column names being single letters, we could quickly convert those to numerals and then feed those to searchsorted for much faster processing.

Thus, an optimized version of getting the integer based column IDs, for the case where the column names are single letters and sorted, would be -

def column_index_singlechar_sorted(df, query_cols):    c0 = np.fromstring(''.join(df.columns), dtype=np.uint8)    c1 = np.fromstring(''.join(query_cols), dtype=np.uint8)    return np.searchsorted(c0, c1)

This, gives us a modified version of the solution, like so -

a = df.valuescol_idx = column_index_singlechar_sorted(df, columns_to_select)out = pd.Series(a[np.arange(len(col_idx)), col_idx])

Timings -

In [149]: # Setup df with 26 uppercase column letters and many rows     ...: import string     ...: df = pd.DataFrame(np.random.randint(0,9,(1000000,26)))     ...: s = list(string.uppercase[:df.shape[1]])     ...: df.columns = s     ...: idx = np.random.randint(0,df.shape[1],len(df))     ...: columns_to_select = np.take(s, idx).tolist()# With df.lookup from @MaxU's solnIn [150]: %timeit pd.Series(df.lookup(df.index, columns_to_select))10 loops, best of 3: 76.7 ms per loop# With proposed one from this solnIn [151]: %%timeit     ...: a = df.values     ...: col_idx = column_index_singlechar_sorted(df, columns_to_select)     ...: out = pd.Series(a[np.arange(len(col_idx)), col_idx])10 loops, best of 3: 59 ms per loop

Given that df.lookup solves for a generic case, that's a probably a better choice, but the other possible optimizations as shown in this post could be handy as well!

CodeHunter

Pandas: Select values from specific columns of a DataFrame by row

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last