Faster way to accomplish this Pandas job than by using Apply for large data set? Faster way to accomplish this Pandas job than by using Apply for large data set? python python

Faster way to accomplish this Pandas job than by using Apply for large data set?


My shot at the problem:

  • extract all object_a's under a Parent Name where i) there are >1 object_a's and; ii) the object_a has 0 ticks but the other object_a has >0 ticks. i.e. just the one with zero ticks
  • extract all object_b's under a Parent Name where i) there is >=1 object_a and; ii) the object_b has 0 ticks but the object_a has >0 ticks

My first impression when reading this is that the actual "Type" doesn't really matter, we just want an existing object_a with >0 Ticks for each group, and extract all the elements with 0 ticks, regardless of their type.

Considering that, my approach was first to create a new column to count the number of object_a ticks for any parent. If this number is >0, it means that at least 1 object_a exists with Ticks>0.

In [63]: df.groupby(['Parent Name']).apply(lambda x: x[x['Type'] == 'object_a']['Ticks'].sum())Out[63]: Parent Name3217863     24556421    34dtype: int64

Let's now merge that into the original DataFrame...

In [64]: sumATicks = df.groupby(['Parent Name']).apply(lambda x: x[x['Type'] == 'object_a']['Ticks'].sum())In [65]: merged = df.merge(pd.DataFrame(sumATicks).rename(columns={0: 'nbATicks'}), left_on='Parent Name', right_index=True)In [66]: mergedOut[66]:        Type  Parent Name  Ticks  nbATicks0  object_a      4556421     34        341  object_a      4556421      0        342  object_b      4556421      0        343  object_a      3217863      2         24  object_b      3217863      1         2

...and extract all the interesting rows, according to the criteria I stated above:

In [67]: merged[(merged['nbATicks'] > 0) & (merged['Ticks'] == 0)]Out[67]:        Type  Parent Name  Ticks  nbATicks1  object_a      4556421      0        342  object_b      4556421      0        34

Hopefully I didn't forget any fringe case...

Regarding the chunk problem, why don't you just load the whole csv file in memory ? If it's that big, you can try sorting by ParentName before processing, and splitting the chunks at relevant places.


Here comes my idea for the problem:

I think the first objective is easier because we only depend on rows with object_a. We can use transform to convert the conditions into boolean list:

df_1 = df.loc[df['Type']=='object_a']object_a = df_1.loc[(df_1.groupby('Parent_Name')['Ticks'].transform(min)==0)&                    (df_1.groupby('Parent_Name')['Ticks'].transform(max)>0)&                    (a['Ticks']==0)                   ]Out[1]:        Type  Parent_Name  Ticks1  object_a      4556421      0

For the second objective i create a list of Parent_Names meeting the requirements for object_a. In the next step isin is used to selected only the corresponding rows.

a_condition = df.loc[df['Type']=='object_a'].groupby('Parent_Name').sum()a_condition = a_condition[a_condition>0].indexobject_b = df.loc[(df['Type']=='object_b')&                  (df['Ticks']==0)&                  (df['Parent_Name'].isin(a_condition))                 ]Out[2]:        Type  Parent_Name  Ticks2  object_b      4556421      0


In [35]: dfOut[32]:        Type         Parent Name   Ticks0  object_a             4556421      341  object_a             4556421       02  object_b             4556421       03  object_a             3217863       24  object_b             3217863       1

Aggregate the data into tuples

In [33]: df1 = df.groupby(['Parent Name',                           'Type']).agg(lambda x: tuple(x)).unstack(1)In [34]: df1Out[34]:                       Ticks         Type               object_a object_b       Parent Name                  3217863                (2,)     (1,)4556421             (34, 0)     (0,)

Build the Boolean mask for your case #1

In [35]: mask1 = df1.apply(lambda x: (len(x[0])>1) & ((x[0]).count(0)==1),                            axis=1)In [36]: mask1Out[36]:        Parent Name3217863    False4556421     Truedtype: bool

Build Boolean mask for your case #2

In [37]: mask2 = df1.apply(lambda x: ((len(x[0])>=1) &                                       (len(set(x[0]).difference([0]))>0) &                                      (len(x[1])==1) &                                       (x[1][0]==0)),                           axis=1)In [38]: mask2Out[38]:        Parent Name3217863    False4556421     Truedtype: bool

Get the result for case #1

In [39]: df1.loc[mask1, [('Ticks', 'object_a')]]Out[39]:                       TicksType               object_a       Parent Name         4556421             (34, 0)

Get the result for case #2

In [30]: df1.loc[mask2, [('Ticks', 'object_b')]]Out[30]:                       TicksType               object_b       Parent Name         4556421                (0,)