Anti-Join Pandas
indicator = True
in merge
command will tell you which join was applied by creating new column _merge
with three possible values:
left_only
right_only
both
Keep right_only
and left_only
. That is it.
outer_join = TableA.merge(TableB, how = 'outer', indicator = True)anti_join = outer_join[~(outer_join._merge == 'both')].drop('_merge', axis = 1)
easy!
Here is a comparison with a solution from piRSquared:
1) When run on this example matching based on one column, piRSquared's solution is faster.
2) But it only works for matching on one column. If you want to match on several columns - my solution works just as fine as with one column.
So it's up for you to decide.
Consider the following dataframes
TableA = pd.DataFrame(np.random.rand(4, 3), pd.Index(list('abcd'), name='Key'), ['A', 'B', 'C']).reset_index()TableB = pd.DataFrame(np.random.rand(4, 3), pd.Index(list('aecf'), name='Key'), ['A', 'B', 'C']).reset_index()
TableA
TableB
This is one way to do what you want
Method 1
# Identify what values are in TableB and not in TableAkey_diff = set(TableB.Key).difference(TableA.Key)where_diff = TableB.Key.isin(key_diff)# Slice TableB accordingly and append to TableATableA.append(TableB[where_diff], ignore_index=True)
Method 2
rows = []for i, row in TableB.iterrows(): if row.Key not in TableA.Key.values: rows.append(row)pd.concat([TableA.T] + rows, axis=1).T
Timing
4 rows with 2 overlap
Method 1 is much quicker
10,000 rows 5,000 overlap
loops are bad
I had the same problem. This answer using how='outer'
and indicator=True
of merge inspired me to come up with this solution:
import pandas as pdimport numpy as npTableA = pd.DataFrame(np.random.rand(4, 3), pd.Index(list('abcd'), name='Key'), ['A', 'B', 'C']).reset_index()TableB = pd.DataFrame(np.random.rand(4, 3), pd.Index(list('aecf'), name='Key'), ['A', 'B', 'C']).reset_index()print('TableA', TableA, sep='\n')print('TableB', TableB, sep='\n')TableB_only = pd.merge( TableA, TableB, how='outer', on='Key', indicator=True, suffixes=('_foo','')).query( '_merge == "right_only"')print('TableB_only', TableB_only, sep='\n')Table_concatenated = pd.concat((TableA, TableB_only), join='inner')print('Table_concatenated', Table_concatenated, sep='\n')
Which prints this output:
TableA Key A B C0 a 0.035548 0.344711 0.8609181 b 0.640194 0.212250 0.2773592 c 0.592234 0.113492 0.0374443 d 0.112271 0.205245 0.227157TableB Key A B C0 a 0.754538 0.692902 0.5377041 e 0.499092 0.864145 0.0045592 c 0.082087 0.682573 0.4216543 f 0.768914 0.281617 0.924693TableB_only Key A_foo B_foo C_foo A B C _merge4 e NaN NaN NaN 0.499092 0.864145 0.004559 right_only5 f NaN NaN NaN 0.768914 0.281617 0.924693 right_onlyTable_concatenated Key A B C0 a 0.035548 0.344711 0.8609181 b 0.640194 0.212250 0.2773592 c 0.592234 0.113492 0.0374443 d 0.112271 0.205245 0.2271574 e 0.499092 0.864145 0.0045595 f 0.768914 0.281617 0.924693