How can I slice each element of a numpy array of strings? How can I slice each element of a numpy array of strings? numpy numpy

How can I slice each element of a numpy array of strings?


Here's a vectorized approach -

def slicer_vectorized(a,start,end):    b = a.view((str,1)).reshape(len(a),-1)[:,start:end]    return np.fromstring(b.tostring(),dtype=(str,end-start))

Sample run -

In [68]: a = np.array(['hello', 'how', 'are', 'you'])In [69]: slicer_vectorized(a,1,3)Out[69]: array(['el', 'ow', 're', 'ou'],       dtype='|S2')In [70]: slicer_vectorized(a,0,3)Out[70]: array(['hel', 'how', 'are', 'you'],       dtype='|S3')

Runtime test -

Testing out all the approaches posted by other authors that I could run at my end and also including the vectorized approach from earlier in this post.

Here's the timings -

In [53]: # Setup input array    ...: a = np.array(['hello', 'how', 'are', 'you'])    ...: a = np.repeat(a,10000)    ...: # @Alberto Garcia-Raboso's answerIn [54]: %timeit slicer(1, 3)(a)10 loops, best of 3: 23.5 ms per loop# @hapaulj's answerIn [55]: %timeit np.frompyfunc(lambda x:x[1:3],1,1)(a)100 loops, best of 3: 11.6 ms per loop# Using loop-comprehensionIn [56]: %timeit np.array([i[1:3] for i in a])100 loops, best of 3: 12.1 ms per loop# From this postIn [57]: %timeit slicer_vectorized(a,1,3)1000 loops, best of 3: 787 µs per loop


Most, if not all the functions in np.char apply existing str methods to each element of the array. It's a little faster than direct iteration (or vectorize) but not drastically so.

There isn't a string slicer; at least not by that sort of name. Closest is indexing with a slice:

In [274]: 'astring'[1:3]Out[274]: 'st'In [275]: 'astring'.__getitem__Out[275]: <method-wrapper '__getitem__' of str object at 0xb3866c20>In [276]: 'astring'.__getitem__(slice(1,4))Out[276]: 'str'

An iterative approach can be with frompyfunc (which is also used by vectorize):

In [277]: a = numpy.array(['hello', 'how', 'are', 'you'])In [278]: np.frompyfunc(lambda x:x[1:3],1,1)(a)Out[278]: array(['el', 'ow', 're', 'ou'], dtype=object)In [279]: np.frompyfunc(lambda x:x[1:3],1,1)(a).astype('U2')Out[279]: array(['el', 'ow', 're', 'ou'],       dtype='<U2')

I could view it as a single character array, and slice that

In [289]: a.view('U1').reshape(4,-1)[:,1:3]Out[289]: array([['e', 'l'],       ['o', 'w'],       ['r', 'e'],       ['o', 'u']],       dtype='<U1')

I still need to figure out how to convert it back to 'U2'.

In [290]: a.view('U1').reshape(4,-1)[:,1:3].copy().view('U2')Out[290]: array([['el'],       ['ow'],       ['re'],       ['ou']],       dtype='<U2')

The initial view step shows the databuffer as Py3 characters (these would be bytes in a S or Py2 string case):

In [284]: a.view('U1')Out[284]: array(['h', 'e', 'l', 'l', 'o', 'h', 'o', 'w', '', '', 'a', 'r', 'e', '',       '', 'y', 'o', 'u', '', ''],       dtype='<U1')

Picking the 1:3 columns amounts to picking a.view('U1')[[1,2,6,7,11,12,16,17]] and then reshaping and view. Without getting into details, I'm not surprised that it requires a copy.


Interesting omission... I guess you can always write your own:

import numpy as npdef slicer(start=None, stop=None, step=1):    return np.vectorize(lambda x: x[start:stop:step], otypes=[str])a = np.array(['hello', 'how', 'are', 'you'])print(slicer(1, 3)(a))    # => ['el' 'ow' 're' 'ou']

EDIT: Here are some benchmarks using the text of Ulysses by James Joyce. It seems the clear winner is @hpaulj's last strategy. @Divakar gets into the race improving on @hpaulj's last strategy.

import numpy as npimport requestsulysses = requests.get('http://www.gutenberg.org/files/4300/4300-0.txt').texta = np.array(ulysses.split())# Ufuncdef slicer(start=None, stop=None, step=1):    return np.vectorize(lambda x: x[start:stop:step], otypes=[str])%timeit slicer(1, 3)(a)# => 1 loop, best of 3: 221 ms per loop# Non-mutating loopdef loop1(a):    out = np.empty(len(a), dtype=object)    for i, word in enumerate(a):        out[i] = word[1:3]%timeit loop1(a)# => 1 loop, best of 3: 262 ms per loop# Mutating loopdef loop2(a):    for i in range(len(a)):        a[i] = a[i][1:3]b = a.copy()%timeit -n 1 -r 1 loop2(b)# 1 loop, best of 1: 285 ms per loop# From @hpaulj's answer%timeit np.frompyfunc(lambda x:x[1:3],1,1)(a)# => 10 loops, best of 3: 141 ms per loop%timeit np.frompyfunc(lambda x:x[1:3],1,1)(a).astype('U2')# => 1 loop, best of 3: 170 ms per loop%timeit a.view('U1').reshape(len(a),-1)[:,1:3].astype(object).sum(axis=1)# => 10 loops, best of 3: 60.7 ms per loopdef slicer_vectorized(a,start,end):    b = a.view('S1').reshape(len(a),-1)[:,start:end]    return np.fromstring(b.tostring(),dtype='S'+str(end-start))%timeit slicer_vectorized(a,1,3)# => The slowest run took 5.34 times longer than the fastest.#    This could mean that an intermediate result is being cached.#    10 loops, best of 3: 16.8 ms per loop