Anti-Join Pandas

python pandas dataframe merge anti-join

indicator = True in merge command will tell you which join was applied by creating new column _merge with three possible values:

left_only
right_only
both

Keep right_only and left_only. That is it.

outer_join = TableA.merge(TableB, how = 'outer', indicator = True)anti_join = outer_join[~(outer_join._merge == 'both')].drop('_merge', axis = 1)

easy!

Here is a comparison with a solution from piRSquared:

1) When run on this example matching based on one column, piRSquared's solution is faster.

2) But it only works for matching on one column. If you want to match on several columns - my solution works just as fine as with one column.

So it's up for you to decide.

python pandas dataframe merge anti-join

Consider the following dataframes

TableA = pd.DataFrame(np.random.rand(4, 3),                      pd.Index(list('abcd'), name='Key'),                      ['A', 'B', 'C']).reset_index()TableB = pd.DataFrame(np.random.rand(4, 3),                      pd.Index(list('aecf'), name='Key'),                      ['A', 'B', 'C']).reset_index()

TableA

TableB

This is one way to do what you want

Method 1

# Identify what values are in TableB and not in TableAkey_diff = set(TableB.Key).difference(TableA.Key)where_diff = TableB.Key.isin(key_diff)# Slice TableB accordingly and append to TableATableA.append(TableB[where_diff], ignore_index=True)

Method 2

rows = []for i, row in TableB.iterrows():    if row.Key not in TableA.Key.values:        rows.append(row)pd.concat([TableA.T] + rows, axis=1).T

Timing

4 rows with 2 overlap

Method 1 is much quicker

10,000 rows 5,000 overlap

loops are bad

python pandas dataframe merge anti-join

I had the same problem. This answer using how='outer' and indicator=True of merge inspired me to come up with this solution:

import pandas as pdimport numpy as npTableA = pd.DataFrame(np.random.rand(4, 3),                      pd.Index(list('abcd'), name='Key'),                      ['A', 'B', 'C']).reset_index()TableB = pd.DataFrame(np.random.rand(4, 3),                      pd.Index(list('aecf'), name='Key'),                      ['A', 'B', 'C']).reset_index()print('TableA', TableA, sep='\n')print('TableB', TableB, sep='\n')TableB_only = pd.merge(    TableA, TableB,    how='outer', on='Key', indicator=True, suffixes=('_foo','')).query(        '_merge == "right_only"')print('TableB_only', TableB_only, sep='\n')Table_concatenated = pd.concat((TableA, TableB_only), join='inner')print('Table_concatenated', Table_concatenated, sep='\n')

Which prints this output:

TableA  Key         A         B         C0   a  0.035548  0.344711  0.8609181   b  0.640194  0.212250  0.2773592   c  0.592234  0.113492  0.0374443   d  0.112271  0.205245  0.227157TableB  Key         A         B         C0   a  0.754538  0.692902  0.5377041   e  0.499092  0.864145  0.0045592   c  0.082087  0.682573  0.4216543   f  0.768914  0.281617  0.924693TableB_only  Key  A_foo  B_foo  C_foo         A         B         C      _merge4   e    NaN    NaN    NaN  0.499092  0.864145  0.004559  right_only5   f    NaN    NaN    NaN  0.768914  0.281617  0.924693  right_onlyTable_concatenated  Key         A         B         C0   a  0.035548  0.344711  0.8609181   b  0.640194  0.212250  0.2773592   c  0.592234  0.113492  0.0374443   d  0.112271  0.205245  0.2271574   e  0.499092  0.864145  0.0045595   f  0.768914  0.281617  0.924693

CodeHunter

Anti-Join Pandas

Method 1

Method 2

Timing

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last