Merge a large Dask dataframe with a small Pandas dataframe
you can iterate over unique equal values and assign other columns with loop:
unioun_set = list(set(small_df['common_column']) & set(large_df['common_column']))for el in union_set: for column in small_df.columns: if column not in large_df.columns: large_df.loc[large_df['common_column'] == el,column] = small_df.loc[small_df['common_column'] == el,column]
While working with big data, partitioning data is very important at the same time having enough cluster and memory size is mandatory.
You can try using spark
.
DASK is a pure Python framework, which does more of same i.e. it allows one to run the same Pandas or NumPy code either locally or on a cluster. Whereas, Apache Spark brings about a learning curve involving a new API and execution model although with a Python wrapper.
You can try partitioning data and storing it into parquet
files.