Broadcast 1D array against 2D array for lexsort : Permutation for sorting each column independently when considering yet another vector

Here's one approach -

def app1(a, g):    m,n = a.shape    g_idx = np.unique(g, return_inverse=1)[1]    N = g_idx.max()+1    g_idx2D = g_idx[:,None] + N*np.arange(n)    r_out = np.lexsort([a.ravel('F'), g_idx2D.ravel('F')]).reshape(-1,m).T    r_out -= m*np.arange(n)    return r_out

The idea is simply that we create a 2D grid of integer version of g array of strings and then offset each column by a barrier that would limit the lexsort search within each column.

Now, on performance, it seems for large datasets, lexsort itself would be the bottleneck. For our problem, we are dealing with just two columns. So, we can create our own custom lexsort that scales the second column based on an offset, which is the max limit of number from the first column. The implementation for the same would look something like this -

def lexsort_twocols(A, B):    S = A.max() - A.min() + 1    return (B*S + A).argsort()

Thus, incorporating this into our proposed method and optimizing the creation of g_idx2D, we would have a formal function like so -

def proposed_app(a, g):    m,n = a.shape    g_idx = np.unique(g, return_inverse=1)[1]    N = g_idx.max()+1    g_idx2D = (g_idx + N*np.arange(n)[:,None]).ravel()    r_out = lexsort_twocols(a.ravel('F'), g_idx2D).reshape(-1,m).T        r_out -= m*np.arange(n)    return r_out

Runtime test

Original approach :

def org_app(a, g):    return np.column_stack([np.lexsort([a[:, i], g]) for i in range(a.shape[1])])

Timings -

In [763]: a = np.random.randint(10, size=(20, 10000))     ...: g = np.random.choice(list('abcdefgh'), 20)     ...: In [764]: %timeit org_app(a,g)10 loops, best of 3: 27.7 ms per loopIn [765]: %timeit app1(a,g)10 loops, best of 3: 25.4 ms per loopIn [766]: %timeit proposed_app(a,g)100 loops, best of 3: 5.93 ms per loop

python pandas numpy array-broadcasting

I'm only posting this to have a good place to show my derivative work, based on Divakar's answer. His lexsort_twocols function does everything we need and can just as easily be applied to broadcast a single dimension over several others. We can forgo the additional work in in proposed_app because we can use axis=0 in the argsort in the lexsort_twocols function.

def lexsort2(a, g):    n, m = a.shape    f = np.unique(g, return_inverse=1)[1] * (a.max() - a.min() + 1)    return (f[:, None] + a).argsort(0)lexsort2(a, g)array([[5, 5, 1, 1],       [1, 1, 5, 5],       [9, 9, 9, 9],       [0, 0, 2, 2],       [2, 2, 0, 0],       [4, 4, 6, 4],       [6, 6, 4, 6],       [3, 3, 7, 3],       [7, 7, 3, 7],       [8, 8, 8, 8]])

I also thought of this... though not nearly as good because I'm still relying on np.lexsort which as Divakar pointed out, can be slow.

def lexsort3(a, g):    n, m = a.shape    a_ = a.ravel()    g_ = np.repeat(g, m)    c_ = np.tile(np.arange(m), n)    return np.lexsort([c_, a_, g_]).reshape(n, m) // mlexsort3(a, g)array([[5, 5, 1, 1],       [1, 1, 5, 5],       [9, 9, 9, 9],       [0, 0, 2, 2],       [2, 2, 0, 0],       [4, 4, 6, 4],       [6, 6, 4, 6],       [3, 3, 7, 3],       [7, 7, 3, 7],       [8, 8, 8, 8]])

Assuming my first concept is lexsort1

def lexsort1(a, g):    return np.column_stack(        [np.lexsort([a[:, i], g]) for i in range(a.shape[1])]    )

from timeit import timeitimport pandas as pdresults = pd.DataFrame(    index=[100, 300, 1000, 3000, 10000, 30000, 100000, 300000, 1000000],    columns=['lexsort1', 'lexsort2', 'lexsort3'])for i in results.index:    a = np.random.randint(100, size=(i, 4))    g = np.random.choice(list('abcdefghijklmn'), i)    for f in results.columns:        results.set_value(            i, f,            timeit('{}(a, g)'.format(f), 'from __main__ import a, g, {}'.format(f))        )results.plot()

Thanks again @Divakar. Please upvote his answer!!!

CodeHunter

Broadcast 1D array against 2D array for lexsort : Permutation for sorting each column independently when considering yet another vector

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last