Split a large dataframe into a list of data frames based on common value in column

r performance matrix split dataframe

You can just as easily access each element in the list using e.g. path[[1]]. You can't put a set of matrices into an atomic vector and access each element. A matrix is an atomic vector with dimension attributes. I would use the list structure returned by split, it's what it was designed for. Each list element can hold data of different types and sizes so it's very versatile and you can use *apply functions to further operate on each element in the list. Example below.

#  For reproducibile dataset.seed(1)#  Make some datauserid <- rep(1:2,times=4)data1 <- replicate(8 , paste( sample(letters , 3 ) , collapse = "" ) )data2 <- sample(10,8)df <- data.frame( userid , data1 , data2 )#  Split on useridout <- split( df , f = df$userid )#$`1`#  userid data1 data2#1      1   gjn     3#3      1   yqp     1#5      1   rjs     6#7      1   jtw     5#$`2`#  userid data1 data2#2      2   xfv     4#4      2   bfe    10#6      2   mrx     2#8      2   fqd     9

Access each element using the [[ operator like this:

out[[1]]#  userid data1 data2#1      1   gjn     3#3      1   yqp     1#5      1   rjs     6#7      1   jtw     5

Or use an *apply function to do further operations on each list element. For instance, to take the mean of the data2 column you could use sapply like this:

sapply( out , function(x) mean( x$data2 ) )#   1    2 #3.75 6.25

r performance matrix split dataframe

From version 0.8.0, dplyr offers a handy function called group_split():

# On sample data from @Aus_10df %>%  group_split(g)[[1]]# A tibble: 25 x 3   ran_data1 ran_data2 g           <dbl>     <dbl> <fct> 1     2.04      0.627 A     2     0.530    -0.703 A     3    -0.475     0.541 A     4     1.20     -0.565 A     5    -0.380    -0.126 A     6     1.25     -1.69  A     7    -0.153    -1.02  A     8     1.52     -0.520 A     9     0.905    -0.976 A    10     0.517    -0.535 A    # … with 15 more rows[[2]]# A tibble: 25 x 3   ran_data1 ran_data2 g           <dbl>     <dbl> <fct> 1     1.61      0.858 B     2     1.05     -1.25  B     3    -0.440    -0.506 B     4    -1.17      1.81  B     5     1.47     -1.60  B     6    -0.682    -0.726 B     7    -2.21      0.282 B     8    -0.499     0.591 B     9     0.711    -1.21  B    10     0.705     0.960 B    # … with 15 more rows

To not include the grouping column:

df %>% group_split(g, keep = FALSE)

r performance matrix split dataframe

Stumbled across this answer and I actually wanted BOTH groups (data containing that one user and data containing everything but that one user). Not necessary for the specifics of this post, but I thought I would add in case someone was googling the same issue as me.

df <- data.frame(     ran_data1=rnorm(125),     ran_data2=rnorm(125),     g=rep(factor(LETTERS[1:5]), 25) )test_x = split(df,df$g)[['A']]test_y = split(df,df$g!='A')[['TRUE']]

Here's what it looks like:

head(test_x)            x          y g1   1.1362198  1.2969541 A6   0.5510307 -0.2512449 A11  0.0321679  0.2358821 A16  0.4734277 -1.2889081 A21 -1.2686151  0.2524744 A> head(test_y)            x          y g2 -2.23477293  1.1514810 B3 -0.46958938 -1.7434205 C4  0.07365603  0.1111419 D5 -1.08758355  0.4727281 E7  0.28448637 -1.5124336 B8  1.24117504  0.4928257 C

CodeHunter

Split a large dataframe into a list of data frames based on common value in column

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last