How to randomly split a DataFrame into several smaller DataFrames? How to randomly split a DataFrame into several smaller DataFrames? pandas pandas

How to randomly split a DataFrame into several smaller DataFrames?


Use np.array_split

shuffled = df.sample(frac=1)result = np.array_split(shuffled, 5)  

df.sample(frac=1) shuffle the rows of df. Then use np.array_split split it into parts that have equal size.

It gives you:

for part in result:    print(part,'\n')
    movie_id  1  2  4  5  6  7  8  9  10  11  12  borda5          6  5  0  0  0  0  0  0  5   0   0   0     104          5  3  0  0  0  0  0  0  0   0   0   0      37          8  1  0  0  0  4  5  0  0   0   4   0     1416        17  3  0  0  4  0  0  0  0   0   0   0      722        23  4  0  0  0  4  3  0  0   5   0   0     16     movie_id  1  2  4  5  6  7  8  9  10  11  12  borda13        14  5  4  0  0  5  0  0  0   0   0   0     1414        15  5  0  0  0  3  0  0  0   0   5   5     1821        22  4  0  0  0  3  5  5  0   5   4   0     261          2  3  0  0  3  0  0  0  0   0   0   0      620        21  1  0  0  3  3  0  0  0   0   0   0      7     movie_id  1  2  4  5  6  7  8  9  10  11  12  borda10        11  2  0  4  0  0  3  3  0   4   2   0     189         10  3  2  0  0  0  4  0  0   0   0   0      911        12  5  0  0  0  4  5  0  0   5   2   0     218          9  5  0  0  0  4  5  0  0   4   5   0     2312        13  5  4  0  0  2  0  0  0   3   0   0     14     movie_id  1  2  4  5  6  7  8  9  10  11  12  borda18        19  5  3  0  0  4  0  0  0   0   0   0     123          4  3  0  0  0  0  5  0  0   4   0   5     170          1  5  4  0  4  4  0  0  0   4   0   0     2123        24  3  0  0  4  0  0  0  0   0   3   0     106          7  4  0  0  0  2  5  3  4   4   0   0     22     movie_id  1  2  4  5  6  7  8  9  10  11  12  borda17        18  4  0  0  0  0  0  0  0   0   0   0      42          3  4  0  0  0  0  0  0  0   0   0   0      415        16  5  0  0  0  0  0  0  0   4   0   0      919        20  4  0  0  0  0  0  0  0   0   0   0      4 


A simple demo:

df = pd.DataFrame({"movie_id": np.arange(1, 25),          "borda": np.random.randint(1, 25, size=(24,))})n_split = 5# the indices used to select parts from dataframeixs = np.arange(df.shape[0])np.random.shuffle(ixs)# np.split cannot work when there is no equal division# so we need to find out the split points ourself# we need (n_split-1) split pointssplit_points = [i*df.shape[0]//n_split for i in range(1, n_split)]# use these indices to select the part we wantfor ix in np.split(ixs, split_points):    print(df.iloc[ix])

The result:

    borda  movie_id8       3         910      2        1122     14        237      14         8    borda  movie_id0      16         120      4        2117     15        1815      1        166       6         7    borda  movie_id9       9        1019      4        205       1         616     23        1721     20        22    borda  movie_id11     24        1223      5        241      22         212      7        1318     15        19    borda  movie_id3      11         414     10        152       6         34       7         513     21        14


IIUC, you can do this:

frames={}for e,i in enumerate(np.split(df,6)):    frames.update([('df_'+str(e+1),pd.DataFrame(np.random.permutation(i),columns=df.columns))])print(frames['df_1'])   movie_id  1  2  4  5  6  7  8  9  10  11  12  borda0         4  3  0  0  0  0  5  0  0   4   0   5     171         3  4  0  0  0  0  0  0  0   0   0   0      42         2  3  0  0  3  0  0  0  0   0   0   0      63         1  5  4  0  4  4  0  0  0   4   0   0     21

Explanation: np.split(df,6) splits the df to 6 equal size. pd.DataFrame(np.random.permutation(i),columns=df.columns) randomly reshapes the rows so creating a dataframe with this information and storing in a dictionary names frames.

Finally print the dictionary by calling each keys, values as dataframe will be returned. you can try print frames['df_1'] , frames['df_2'] , etc. It will return random permutations of a split of the dataframe.