Finding common rows (intersection) in two Pandas dataframes
My understanding is that this question is better answered over in this post.
But briefly, the answer to the OP with this method is simply:
s1 = pd.merge(df1, df2, how='inner', on=['user_id'])
Which gives s1 with 5 columns: user_id and the other two columns from each of df1 and df2.
If I understand you correctly, you can use a combination of Series.isin()
and DataFrame.append()
:
In [80]: df1Out[80]: rating user_id0 2 0x21abL1 1 0x21abL2 1 0xdafL3 0 0x21abL4 4 0x1d14L5 2 0x21abL6 1 0x21abL7 0 0xdafL8 4 0x1d14L9 1 0x21abLIn [81]: df2Out[81]: rating user_id0 2 0x1d14L1 1 0xdbdcad72 1 0x21abL3 3 0x21abL4 3 0x21abL5 1 0x5734a81e26 2 0x1d14L7 0 0xdafL8 0 0x1d14L9 4 0x5734a81e2In [82]: ind = df2.user_id.isin(df1.user_id) & df1.user_id.isin(df2.user_id)In [83]: indOut[83]:0 True1 False2 True3 True4 True5 False6 True7 True8 True9 FalseName: user_id, dtype: boolIn [84]: df1[ind].append(df2[ind])Out[84]: rating user_id0 2 0x21abL2 1 0xdafL3 0 0x21abL4 4 0x1d14L6 1 0x21abL7 0 0xdafL8 4 0x1d14L0 2 0x1d14L2 1 0x21abL3 3 0x21abL4 3 0x21abL6 2 0x1d14L7 0 0xdafL8 0 0x1d14L
This is essentially the algorithm you described as "clunky", using idiomatic pandas
methods. Note the duplicate row indices. Also, note that this won't give you the expected output if df1
and df2
have no overlapping row indices, i.e., if
In [93]: df1.index & df2.indexOut[93]: Int64Index([], dtype='int64')
In fact, it won't give the expected output if their row indices are not equal.
In SQL, this problem could be solved by several methods:
select * from df1 where exists (select * from df2 where df2.user_id = df1.user_id)union allselect * from df2 where exists (select * from df1 where df1.user_id = df2.user_id)
or join and then unpivot (possible in SQL server)
select df1.user_id, c.ratingfrom df1 inner join df2 on df2.user_i = df1.user_id outer apply ( select df1.rating union all select df2.rating ) as c
Second one could be written in pandas with something like:
>>> df1 = pd.DataFrame({"user_id":[1,2,3], "rating":[10, 15, 20]})>>> df2 = pd.DataFrame({"user_id":[3,4,5], "rating":[30, 35, 40]})>>>>>> df4 = df[['user_id', 'rating_1']].rename(columns={'rating_1':'rating'})>>> df = pd.merge(df1, df2, on='user_id', suffixes=['_1', '_2'])>>> df3 = df[['user_id', 'rating_1']].rename(columns={'rating_1':'rating'})>>> df4 = df[['user_id', 'rating_2']].rename(columns={'rating_2':'rating'})>>> pd.concat([df3, df4], axis=0) user_id rating0 3 200 3 30