Faster way to accomplish this Pandas job than by using Apply for large data set?

python pandas

My shot at the problem:

extract all object_a's under a Parent Name where i) there are >1 object_a's and; ii) the object_a has 0 ticks but the other object_a has >0 ticks. i.e. just the one with zero ticks
extract all object_b's under a Parent Name where i) there is >=1 object_a and; ii) the object_b has 0 ticks but the object_a has >0 ticks

My first impression when reading this is that the actual "Type" doesn't really matter, we just want an existing object_a with >0 Ticks for each group, and extract all the elements with 0 ticks, regardless of their type.

Considering that, my approach was first to create a new column to count the number of object_a ticks for any parent. If this number is >0, it means that at least 1 object_a exists with Ticks>0.

In [63]: df.groupby(['Parent Name']).apply(lambda x: x[x['Type'] == 'object_a']['Ticks'].sum())Out[63]: Parent Name3217863     24556421    34dtype: int64

Let's now merge that into the original DataFrame...

In [64]: sumATicks = df.groupby(['Parent Name']).apply(lambda x: x[x['Type'] == 'object_a']['Ticks'].sum())In [65]: merged = df.merge(pd.DataFrame(sumATicks).rename(columns={0: 'nbATicks'}), left_on='Parent Name', right_index=True)In [66]: mergedOut[66]:        Type  Parent Name  Ticks  nbATicks0  object_a      4556421     34        341  object_a      4556421      0        342  object_b      4556421      0        343  object_a      3217863      2         24  object_b      3217863      1         2

...and extract all the interesting rows, according to the criteria I stated above:

In [67]: merged[(merged['nbATicks'] > 0) & (merged['Ticks'] == 0)]Out[67]:        Type  Parent Name  Ticks  nbATicks1  object_a      4556421      0        342  object_b      4556421      0        34

Hopefully I didn't forget any fringe case...

Regarding the chunk problem, why don't you just load the whole csv file in memory ? If it's that big, you can try sorting by ParentName before processing, and splitting the chunks at relevant places.

python pandas

Here comes my idea for the problem:

I think the first objective is easier because we only depend on rows with object_a. We can use transform to convert the conditions into boolean list:

df_1 = df.loc[df['Type']=='object_a']object_a = df_1.loc[(df_1.groupby('Parent_Name')['Ticks'].transform(min)==0)&                    (df_1.groupby('Parent_Name')['Ticks'].transform(max)>0)&                    (a['Ticks']==0)                   ]Out[1]:        Type  Parent_Name  Ticks1  object_a      4556421      0

For the second objective i create a list of Parent_Names meeting the requirements for object_a. In the next step isin is used to selected only the corresponding rows.

a_condition = df.loc[df['Type']=='object_a'].groupby('Parent_Name').sum()a_condition = a_condition[a_condition>0].indexobject_b = df.loc[(df['Type']=='object_b')&                  (df['Ticks']==0)&                  (df['Parent_Name'].isin(a_condition))                 ]Out[2]:        Type  Parent_Name  Ticks2  object_b      4556421      0

python pandas

In [35]: dfOut[32]:        Type         Parent Name   Ticks0  object_a             4556421      341  object_a             4556421       02  object_b             4556421       03  object_a             3217863       24  object_b             3217863       1

Aggregate the data into tuples

In [33]: df1 = df.groupby(['Parent Name',                           'Type']).agg(lambda x: tuple(x)).unstack(1)In [34]: df1Out[34]:                       Ticks         Type               object_a object_b       Parent Name                  3217863                (2,)     (1,)4556421             (34, 0)     (0,)

Build the Boolean mask for your case #1

In [35]: mask1 = df1.apply(lambda x: (len(x[0])>1) & ((x[0]).count(0)==1),                            axis=1)In [36]: mask1Out[36]:        Parent Name3217863    False4556421     Truedtype: bool

Build Boolean mask for your case #2

In [37]: mask2 = df1.apply(lambda x: ((len(x[0])>=1) &                                       (len(set(x[0]).difference([0]))>0) &                                      (len(x[1])==1) &                                       (x[1][0]==0)),                           axis=1)In [38]: mask2Out[38]:        Parent Name3217863    False4556421     Truedtype: bool

Get the result for case #1

In [39]: df1.loc[mask1, [('Ticks', 'object_a')]]Out[39]:                       TicksType               object_a       Parent Name         4556421             (34, 0)

Get the result for case #2

In [30]: df1.loc[mask2, [('Ticks', 'object_b')]]Out[30]:                       TicksType               object_b       Parent Name         4556421                (0,)

CodeHunter

Faster way to accomplish this Pandas job than by using Apply for large data set?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last