Faster alternatives to Pandas pivot_table
Convert the columns months and industry to categorical columns:https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.htmlThis way you avoid a lot of string comparisons.
You can use Sparse Matrices. They are fast to implement, a little bit restricted though. For example: You can't do indexing on a COO_matrix
I recently needed to train a recommmender system(lightFM) and it accepted sparse matrices as input, which made my job a lot easier. See it in action:
row = np.array([0, 3, 1, 0])col = np.array([0, 3, 1, 2])data = np.array([4, 5, 7, 9])mat = sparse.coo_matrix((data, (row, col)), shape=(4, 4))
>>> print(mat) (0, 0) 4 (3, 3) 5 (1, 1) 7 (0, 2) 9>>> print(mat.toarray())[[4 0 9 0] [0 7 0 0] [0 0 0 0] [0 0 0 5]]
As you can see, it automatically creates a pivot table for you using the columns and rows of the data you have and fills the rest with zeros. You can convert the sparse matrix into array and dataframe as well (df = pd.DataFrame.sparse.from_spmatrix(mat, index=..., columns=...)
)
When you read the csv file into a df, you could pass a convert function (via the read_csv
parameter converters
), to transform client_name
into a hash and downcast orders
to an appropriate int
type, in particular, an unsigned one.
This function lists the types and their ranges:
import numpy as npdef list_np_types(): for k, v in np.sctypes.items(): for i, d in enumerate(v): if np.dtype(d).kind in 'iu': # only int and uint have a definite range fmt = '{:>7}, {:>2}: {:>26} From: {:>20}\tTo: {}' print(fmt.format(k, i, str(d), str(np.iinfo(d).min), str(np.iinfo(d).max))) else: print('{:>7}, {:>2}: {:>26}'.format(k, i, str(d)))list_np_types()
Output:
int, 0: <class 'numpy.int8'> From: -128 To: 127 int, 1: <class 'numpy.int16'> From: -32768 To: 32767 int, 2: <class 'numpy.int32'> From: -2147483648 To: 2147483647 int, 3: <class 'numpy.int64'> From: -9223372036854775808 To: 9223372036854775807 uint, 0: <class 'numpy.uint8'> From: 0 To: 255 uint, 1: <class 'numpy.uint16'> From: 0 To: 65535 uint, 2: <class 'numpy.uint32'> From: 0 To: 4294967295 uint, 3: <class 'numpy.uint64'> From: 0 To: 18446744073709551615 float, 0: <class 'numpy.float16'> float, 1: <class 'numpy.float32'> float, 2: <class 'numpy.float64'>complex, 0: <class 'numpy.complex64'>complex, 1: <class 'numpy.complex128'> others, 0: <class 'bool'> others, 1: <class 'object'> others, 2: <class 'bytes'> others, 3: <class 'str'> others, 4: <class 'numpy.void'>