Dask: How would I parallelize my code with dask delayed? Dask: How would I parallelize my code with dask delayed? python-3.x python-3.x

Dask: How would I parallelize my code with dask delayed?


You need to call dask.compute to eventually compute the result. See dask.delayed documentation.

Sequential code

import pandas as pdfrom sklearn.metrics import mean_squared_error as msefilenames = [...]results = []for count, name in enumerate(filenames):    file1 = pd.read_csv(name)    df = pd.DataFrame(file1)  # isn't this already a dataframe?    prediction = df['Close'][:-1]    observed = df['Close'][1:]    mean_squared_error = mse(observed, prediction)      results.append(mean_squared_error)

Parallel code

import daskimport pandas as pdfrom sklearn.metrics import mean_squared_error as msefilenames = [...]delayed_results = []for count, name in enumerate(filenames):    df = dask.delayed(pd.read_csv)(name)    prediction = df['Close'][:-1]    observed = df['Close'][1:]    mean_squared_error = dask.delayed(mse)(observed, prediction)    delayed_results.append(mean_squared_error)results = dask.compute(*delayed_results)


A much clearer solution, IMO, than the accepted answer is this snippet.

from dask import compute, delayedimport pandas as pdfrom sklearn.metrics import mean_squared_error as msefilenames = [...]def compute_mse(file_name):    df = pd.read_csv(file_name)    prediction = df['Close'][:-1]    observed = df['Close'][1:]    return mse(observed, prediction)delayed_results = [delayed(compute_mse)(file_name) for file_name in filenames]mean_squared_errors = compute(*delayed_results, scheduler="processes")