How to randomly split a DataFrame into several smaller DataFrames?
Use np.array_split
shuffled = df.sample(frac=1)result = np.array_split(shuffled, 5)
df.sample(frac=1)
shuffle the rows of df
. Then use np.array_split
split it into parts that have equal size.
It gives you:
for part in result: print(part,'\n')
movie_id 1 2 4 5 6 7 8 9 10 11 12 borda5 6 5 0 0 0 0 0 0 5 0 0 0 104 5 3 0 0 0 0 0 0 0 0 0 0 37 8 1 0 0 0 4 5 0 0 0 4 0 1416 17 3 0 0 4 0 0 0 0 0 0 0 722 23 4 0 0 0 4 3 0 0 5 0 0 16 movie_id 1 2 4 5 6 7 8 9 10 11 12 borda13 14 5 4 0 0 5 0 0 0 0 0 0 1414 15 5 0 0 0 3 0 0 0 0 5 5 1821 22 4 0 0 0 3 5 5 0 5 4 0 261 2 3 0 0 3 0 0 0 0 0 0 0 620 21 1 0 0 3 3 0 0 0 0 0 0 7 movie_id 1 2 4 5 6 7 8 9 10 11 12 borda10 11 2 0 4 0 0 3 3 0 4 2 0 189 10 3 2 0 0 0 4 0 0 0 0 0 911 12 5 0 0 0 4 5 0 0 5 2 0 218 9 5 0 0 0 4 5 0 0 4 5 0 2312 13 5 4 0 0 2 0 0 0 3 0 0 14 movie_id 1 2 4 5 6 7 8 9 10 11 12 borda18 19 5 3 0 0 4 0 0 0 0 0 0 123 4 3 0 0 0 0 5 0 0 4 0 5 170 1 5 4 0 4 4 0 0 0 4 0 0 2123 24 3 0 0 4 0 0 0 0 0 3 0 106 7 4 0 0 0 2 5 3 4 4 0 0 22 movie_id 1 2 4 5 6 7 8 9 10 11 12 borda17 18 4 0 0 0 0 0 0 0 0 0 0 42 3 4 0 0 0 0 0 0 0 0 0 0 415 16 5 0 0 0 0 0 0 0 4 0 0 919 20 4 0 0 0 0 0 0 0 0 0 0 4
A simple demo:
df = pd.DataFrame({"movie_id": np.arange(1, 25), "borda": np.random.randint(1, 25, size=(24,))})n_split = 5# the indices used to select parts from dataframeixs = np.arange(df.shape[0])np.random.shuffle(ixs)# np.split cannot work when there is no equal division# so we need to find out the split points ourself# we need (n_split-1) split pointssplit_points = [i*df.shape[0]//n_split for i in range(1, n_split)]# use these indices to select the part we wantfor ix in np.split(ixs, split_points): print(df.iloc[ix])
The result:
borda movie_id8 3 910 2 1122 14 237 14 8 borda movie_id0 16 120 4 2117 15 1815 1 166 6 7 borda movie_id9 9 1019 4 205 1 616 23 1721 20 22 borda movie_id11 24 1223 5 241 22 212 7 1318 15 19 borda movie_id3 11 414 10 152 6 34 7 513 21 14
IIUC, you can do this:
frames={}for e,i in enumerate(np.split(df,6)): frames.update([('df_'+str(e+1),pd.DataFrame(np.random.permutation(i),columns=df.columns))])print(frames['df_1']) movie_id 1 2 4 5 6 7 8 9 10 11 12 borda0 4 3 0 0 0 0 5 0 0 4 0 5 171 3 4 0 0 0 0 0 0 0 0 0 0 42 2 3 0 0 3 0 0 0 0 0 0 0 63 1 5 4 0 4 4 0 0 0 4 0 0 21
Explanation: np.split(df,6)
splits the df to 6 equal size. pd.DataFrame(np.random.permutation(i),columns=df.columns)
randomly reshapes the rows so creating a dataframe with this information and storing in a dictionary names frames
.
Finally print the dictionary by calling each keys, values as dataframe will be returned. you can try print frames['df_1']
, frames['df_2']
, etc. It will return random permutations of a split of the dataframe.