Faster way to accomplish this Pandas job than by using Apply for large data set?
My shot at the problem:
- extract all object_a's under a Parent Name where i) there are >1 object_a's and; ii) the object_a has 0 ticks but the other object_a has >0 ticks. i.e. just the one with zero ticks
- extract all object_b's under a Parent Name where i) there is >=1 object_a and; ii) the object_b has 0 ticks but the object_a has >0 ticks
My first impression when reading this is that the actual "Type" doesn't really matter, we just want an existing object_a
with >0 Ticks for each group, and extract all the elements with 0 ticks, regardless of their type.
Considering that, my approach was first to create a new column to count the number of object_a
ticks for any parent. If this number is >0, it means that at least 1 object_a
exists with Ticks>0.
In [63]: df.groupby(['Parent Name']).apply(lambda x: x[x['Type'] == 'object_a']['Ticks'].sum())Out[63]: Parent Name3217863 24556421 34dtype: int64
Let's now merge that into the original DataFrame...
In [64]: sumATicks = df.groupby(['Parent Name']).apply(lambda x: x[x['Type'] == 'object_a']['Ticks'].sum())In [65]: merged = df.merge(pd.DataFrame(sumATicks).rename(columns={0: 'nbATicks'}), left_on='Parent Name', right_index=True)In [66]: mergedOut[66]: Type Parent Name Ticks nbATicks0 object_a 4556421 34 341 object_a 4556421 0 342 object_b 4556421 0 343 object_a 3217863 2 24 object_b 3217863 1 2
...and extract all the interesting rows, according to the criteria I stated above:
In [67]: merged[(merged['nbATicks'] > 0) & (merged['Ticks'] == 0)]Out[67]: Type Parent Name Ticks nbATicks1 object_a 4556421 0 342 object_b 4556421 0 34
Hopefully I didn't forget any fringe case...
Regarding the chunk problem, why don't you just load the whole csv file in memory ? If it's that big, you can try sorting by ParentName before processing, and splitting the chunks at relevant places.
Here comes my idea for the problem:
I think the first objective is easier because we only depend on rows with object_a. We can use transform to convert the conditions into boolean list:
df_1 = df.loc[df['Type']=='object_a']object_a = df_1.loc[(df_1.groupby('Parent_Name')['Ticks'].transform(min)==0)& (df_1.groupby('Parent_Name')['Ticks'].transform(max)>0)& (a['Ticks']==0) ]Out[1]: Type Parent_Name Ticks1 object_a 4556421 0
For the second objective i create a list of Parent_Names meeting the requirements for object_a. In the next step isin is used to selected only the corresponding rows.
a_condition = df.loc[df['Type']=='object_a'].groupby('Parent_Name').sum()a_condition = a_condition[a_condition>0].indexobject_b = df.loc[(df['Type']=='object_b')& (df['Ticks']==0)& (df['Parent_Name'].isin(a_condition)) ]Out[2]: Type Parent_Name Ticks2 object_b 4556421 0
In [35]: dfOut[32]: Type Parent Name Ticks0 object_a 4556421 341 object_a 4556421 02 object_b 4556421 03 object_a 3217863 24 object_b 3217863 1
Aggregate the data into tuple
s
In [33]: df1 = df.groupby(['Parent Name', 'Type']).agg(lambda x: tuple(x)).unstack(1)In [34]: df1Out[34]: Ticks Type object_a object_b Parent Name 3217863 (2,) (1,)4556421 (34, 0) (0,)
Build the Boolean mask for your case #1
In [35]: mask1 = df1.apply(lambda x: (len(x[0])>1) & ((x[0]).count(0)==1), axis=1)In [36]: mask1Out[36]: Parent Name3217863 False4556421 Truedtype: bool
Build Boolean mask for your case #2
In [37]: mask2 = df1.apply(lambda x: ((len(x[0])>=1) & (len(set(x[0]).difference([0]))>0) & (len(x[1])==1) & (x[1][0]==0)), axis=1)In [38]: mask2Out[38]: Parent Name3217863 False4556421 Truedtype: bool
Get the result for case #1
In [39]: df1.loc[mask1, [('Ticks', 'object_a')]]Out[39]: TicksType object_a Parent Name 4556421 (34, 0)
Get the result for case #2
In [30]: df1.loc[mask2, [('Ticks', 'object_b')]]Out[30]: TicksType object_b Parent Name 4556421 (0,)