Dask: How would I parallelize my code with dask delayed?
You need to call dask.compute to eventually compute the result. See dask.delayed documentation.
Sequential code
import pandas as pdfrom sklearn.metrics import mean_squared_error as msefilenames = [...]results = []for count, name in enumerate(filenames): file1 = pd.read_csv(name) df = pd.DataFrame(file1) # isn't this already a dataframe? prediction = df['Close'][:-1] observed = df['Close'][1:] mean_squared_error = mse(observed, prediction) results.append(mean_squared_error)
Parallel code
import daskimport pandas as pdfrom sklearn.metrics import mean_squared_error as msefilenames = [...]delayed_results = []for count, name in enumerate(filenames): df = dask.delayed(pd.read_csv)(name) prediction = df['Close'][:-1] observed = df['Close'][1:] mean_squared_error = dask.delayed(mse)(observed, prediction) delayed_results.append(mean_squared_error)results = dask.compute(*delayed_results)
A much clearer solution, IMO, than the accepted answer is this snippet.
from dask import compute, delayedimport pandas as pdfrom sklearn.metrics import mean_squared_error as msefilenames = [...]def compute_mse(file_name): df = pd.read_csv(file_name) prediction = df['Close'][:-1] observed = df['Close'][1:] return mse(observed, prediction)delayed_results = [delayed(compute_mse)(file_name) for file_name in filenames]mean_squared_errors = compute(*delayed_results, scheduler="processes")