Efficiently create dataframe from strings containing key-value pairs Efficiently create dataframe from strings containing key-value pairs r r

Efficiently create dataframe from strings containing key-value pairs


Here you go:

Recreate the data:

x <- c(  "HGVSc=ENST00000495576.1:n.820-1G>A;INTRON=1//1;CANONICAL=YES",  "DISTANCE=2179",  "HGVSc=ENST00000466430.1:n.911C>T;EXON=4//4;CANONICAL=YES",  "DISTANCE=27;CANONICAL=YES;common")

Create a named vector with your desired names. This is used for fast lookup later:

names <- setNames(1:15, c('ENSP','HGVS','DOMAINS','EXON','INTRON', 'HGVSp', 'HGVSc','CANONICAL','GMAF','DISTANCE', 'HGNC', 'CCDS', 'SIFT', 'PolyPhen', 'common'))

Create a helper function that assigns each variable to the correct position in a matrix. Then use lapply and strsplit:

assign <- function(x, names){  xx <- sapply(x, function(i)if(length(i)==2L) i else c(i, "YES"))  z <- rep(NA, length(names))  z[names[xx[1, ]]] <- xx[2, ]  z}sx <- lapply(strsplit(x, ";"), strsplit, "=")ret <- t(sapply(sx, assign, names))colnames(ret) <- names(names)ret

The results:

     ENSP HGVS DOMAINS EXON   INTRON HGVSp HGVSc                          CANONICAL GMAF DISTANCE HGNC[1,] NA   NA   NA      NA     "1//1" NA    "ENST00000495576.1:n.820-1G>A" "YES"     NA   NA       NA  [2,] NA   NA   NA      NA     NA     NA    NA                             NA        NA   "2179"   NA  [3,] NA   NA   NA      "4//4" NA     NA    "ENST00000466430.1:n.911C>T"   "YES"     NA   NA       NA  [4,] NA   NA   NA      NA     NA     NA    NA                             "YES"     NA   "27"     NA       CCDS SIFT PolyPhen common[1,] NA   NA   NA       NA    [2,] NA   NA   NA       NA    [3,] NA   NA   NA       NA    [4,] NA   NA   NA       "YES" 


Here's another, faster, solution taking advantage of the original pairings...

##                   test elapsed replications relative average## 2    thell_solution(x)    0.37         1000    1.000 0.00037## 3   andrie_solution(x)    1.04         1000    2.811 0.00104## 1 original_solution(x)    2.61         1000    7.054 0.00261

Since pairing[1] always gets assigned pairing[2] except with the final bool (… not that I understand why that one flag is treated differently in the original string vector …) we can take advantage of the sequence and the fact that the vector will assign NA when a name is given without a value ( ie: x[5] == NA ) and we also have no need to call names multiple times. And since strsplit uses regex we can do alternation.

# Let `x` be as @Andrie made it in his answer.  Let `names` be as you had# in the original question.# A pre-built dummy record and empty list.na.record <- setNames(rep(NA, time = length(names)), names)y <- list()do.call(rbind, lapply(strsplit(x, "(;|=)"), FUN = function(x) {    x_seq <- seq.int(to = length(x), by = 2)    y[x[x_seq]] <- x[x_seq + 1]    y[is.na(y)] <- "YES"    na.record[x[x_seq]] <- y    na.record}))##      ENSP HGVS DOMAINS EXON   INTRON HGVSp HGVSc                         ## [1,] NA   NA   NA      NA     "1//1" NA    "ENST00000495576.1:n.820-1G>A"## [2,] NA   NA   NA      NA     NA     NA    NA                            ## [3,] NA   NA   NA      "4//4" NA     NA    "ENST00000466430.1:n.911C>T"  ## [4,] NA   NA   NA      NA     NA     NA    NA                            ##      CANONICAL GMAF DISTANCE HGNC CCDS SIFT PolyPhen common## [1,] "YES"     NA   NA       NA   NA   NA   NA       NA    ## [2,] NA        NA   "2179"   NA   NA   NA   NA       NA    ## [3,] "YES"     NA   NA       NA   NA   NA   NA       NA    ## [4,] "YES"     NA   "27"     NA   NA   NA   NA       "YES"