Split a large dataframe into a list of data frames based on common value in column
You can just as easily access each element in the list using e.g. path[[1]]
. You can't put a set of matrices into an atomic vector and access each element. A matrix is an atomic vector with dimension attributes. I would use the list structure returned by split
, it's what it was designed for. Each list element can hold data of different types and sizes so it's very versatile and you can use *apply
functions to further operate on each element in the list. Example below.
# For reproducibile dataset.seed(1)# Make some datauserid <- rep(1:2,times=4)data1 <- replicate(8 , paste( sample(letters , 3 ) , collapse = "" ) )data2 <- sample(10,8)df <- data.frame( userid , data1 , data2 )# Split on useridout <- split( df , f = df$userid )#$`1`# userid data1 data2#1 1 gjn 3#3 1 yqp 1#5 1 rjs 6#7 1 jtw 5#$`2`# userid data1 data2#2 2 xfv 4#4 2 bfe 10#6 2 mrx 2#8 2 fqd 9
Access each element using the [[
operator like this:
out[[1]]# userid data1 data2#1 1 gjn 3#3 1 yqp 1#5 1 rjs 6#7 1 jtw 5
Or use an *apply
function to do further operations on each list element. For instance, to take the mean of the data2
column you could use sapply like this:
sapply( out , function(x) mean( x$data2 ) )# 1 2 #3.75 6.25
From version 0.8.0, dplyr
offers a handy function called group_split()
:
# On sample data from @Aus_10df %>% group_split(g)[[1]]# A tibble: 25 x 3 ran_data1 ran_data2 g <dbl> <dbl> <fct> 1 2.04 0.627 A 2 0.530 -0.703 A 3 -0.475 0.541 A 4 1.20 -0.565 A 5 -0.380 -0.126 A 6 1.25 -1.69 A 7 -0.153 -1.02 A 8 1.52 -0.520 A 9 0.905 -0.976 A 10 0.517 -0.535 A # … with 15 more rows[[2]]# A tibble: 25 x 3 ran_data1 ran_data2 g <dbl> <dbl> <fct> 1 1.61 0.858 B 2 1.05 -1.25 B 3 -0.440 -0.506 B 4 -1.17 1.81 B 5 1.47 -1.60 B 6 -0.682 -0.726 B 7 -2.21 0.282 B 8 -0.499 0.591 B 9 0.711 -1.21 B 10 0.705 0.960 B # … with 15 more rows
To not include the grouping column:
df %>% group_split(g, keep = FALSE)
Stumbled across this answer and I actually wanted BOTH groups (data containing that one user and data containing everything but that one user). Not necessary for the specifics of this post, but I thought I would add in case someone was googling the same issue as me.
df <- data.frame( ran_data1=rnorm(125), ran_data2=rnorm(125), g=rep(factor(LETTERS[1:5]), 25) )test_x = split(df,df$g)[['A']]test_y = split(df,df$g!='A')[['TRUE']]
Here's what it looks like:
head(test_x) x y g1 1.1362198 1.2969541 A6 0.5510307 -0.2512449 A11 0.0321679 0.2358821 A16 0.4734277 -1.2889081 A21 -1.2686151 0.2524744 A> head(test_y) x y g2 -2.23477293 1.1514810 B3 -0.46958938 -1.7434205 C4 0.07365603 0.1111419 D5 -1.08758355 0.4727281 E7 0.28448637 -1.5124336 B8 1.24117504 0.4928257 C