Calculating pairwise Euclidean distance between all the rows of a dataframe

python pandas numpy dataframe euclidean-distance

To compute the Eucledian distance between two rows i and j of a dataframe df:

np.linalg.norm(df.loc[i] - df.loc[j])

To compute it between consecutive rows, i.e. 0 and 1, 1 and 2, 2 and 3, ...

np.linalg.norm(df.diff(axis=0).drop(0), axis=1)

If you want to compute it between all the rows, i.e. 0 and 1, 0 and 2, ..., 1 and 1, 1 and 2 ..., then you have to loop through all the combinations of i and j (keep in mind that for 44062 rows there are 970707891 such combinations so using a for-loop will be very slow):

import itertoolsfor i, j in itertools.combinations(df.index, 2):    d_ij = np.linalg.norm(df.loc[i] - df.loc[j])

Edit:

Instead, you can use scipy.spatial.distance.cdist which computes distance between each pair of two collections of inputs:

from scipy.spatial.distance import cdistcdist(df, df, 'euclid')

This will return you a symmetric (44062 by 44062) matrix of Euclidian distances between all the rows of your dataframe. The problem is that you need a lot of memory for it to work (at least 8*44062**2 bytes of memory, i.e. ~16GB).So a better option is to use pdist

from scipy.spatial.distance import pdistpdist(df.values, 'euclid')

which will return an array (of size 970707891) of all the pairwise Euclidean distances between the rows of df.

P.s. Don't forget to ignore the 'Actual_data' column in the computations of distances. E.g. you can do the following: data = df.drop('Actual_Data', axis=1).values and then cdist(data, data, 'euclid') or pdist(data, 'euclid'). You can also create another dataframe with distances like this:

data = df.drop('Actual_Data', axis=1).valuesd = pd.DataFrame(itertools.combinations(df.index, 2), columns=['i','j'])d['dist'] = pdist(data, 'euclid')   i  j  dist0  0  1  ...1  0  2  ...2  0  3  ...3  0  4  ......

python pandas numpy dataframe euclidean-distance

Working with a subset of your data for eg.

df_data = [[888888, 3, 0, 0], [677767, 0, 2, 1], [212341212, 0, 0, 0], [141414141414, 0, 0, 0], [1112224, 0, 0, 0]]# Creating the datadf = pd.DataFrame(data=data, columns=['Actual_Data', '8,8', '6,6', '7,7'], dtype=np.float64)# Which looks like#     Actual_Data  8,8  6,6  7,7# 0  8.888880e+05  3.0  0.0  0.0# 1  6.777670e+05  0.0  2.0  1.0# 2  2.123412e+08  0.0  0.0  0.0# 3  1.414141e+11  0.0  0.0  0.0# 4  1.112224e+06  0.0  0.0  0.0# Computing the distance matrixdist_matrix = df.apply(lambda row: [np.linalg.norm(row.values - df.loc[[_id], :].values, 2) for _id in df.index.values], axis=1)# Which looks like# 0     [0.0, 211121.00003315636, 211452324.0, 141413252526.0, 223336.000020149]# 1    [211121.00003315636, 0.0, 211663445.0, 141413463647.0, 434457.0000057543]# 2                 [211452324.0, 211663445.0, 0.0, 141201800202.0, 211228988.0]# 3        [141413252526.0, 141413463647.0, 141201800202.0, 0.0, 141413029190.0]# 4      [223336.000020149, 434457.0000057543, 211228988.0, 141413029190.0, 0.0]# Reformatting the above into readable formatdist_matrix = pd.DataFrame(  data=dist_matrix.values.tolist(),   columns=df.index.tolist(),   index=df.index.tolist())# Which gives you#               0             1             2             3             4# 0  0.000000e+00  2.111210e+05  2.114523e+08  1.414133e+11  2.233360e+05# 1  2.111210e+05  0.000000e+00  2.116634e+08  1.414135e+11  4.344570e+05# 2  2.114523e+08  2.116634e+08  0.000000e+00  1.412018e+11  2.112290e+08# 3  1.414133e+11  1.414135e+11  1.412018e+11  0.000000e+00  1.414130e+11# 4  2.233360e+05  4.344570e+05  2.112290e+08  1.414130e+11  0.000000e+00

Update

as pointed out in the comments the issue is memory overflow so we have to operate the problem in batches.

# Collecting the data# df = ....# Set this number to a lower value if you get the same `memory` errors.batch = 200 # #'s of row's / user's used to compute the matrix# To be conservative, let's write the intermediate results to file type.dffname = []for ifile,_slice in enumerate(np.array_split(range(df.shape[0]), batch)):  # Let's compute distance for `batch` #'s of points in data frame  tmp_df = df.iloc[_slice, :].apply(lambda row: [np.linalg.norm(row.values - df.loc[[_id], :].values, 2) for _id in df.index.values], axis=1)  tmp_df = pd.DataFrame(tmp_df.values.tolist(), index=df.index.values[_slice], columns=df.index.values)  # You can change it from csv to any other files  tmp_df.to_csv(f"{ifile+1}.csv")  dffname.append(f"{ifile+1}.csv")# Reading back the dataFramesdflist = []for f in dffname:  dflist.append(pd.read_csv(f, dtype=np.float64, index_col=0))res = pd.concat(dflist)

CodeHunter

Calculating pairwise Euclidean distance between all the rows of a dataframe

Update

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last