Efficiently create dataframe from strings containing key-value pairs
Here you go:
Recreate the data:
x <- c( "HGVSc=ENST00000495576.1:n.820-1G>A;INTRON=1//1;CANONICAL=YES", "DISTANCE=2179", "HGVSc=ENST00000466430.1:n.911C>T;EXON=4//4;CANONICAL=YES", "DISTANCE=27;CANONICAL=YES;common")
Create a named vector with your desired names. This is used for fast lookup later:
names <- setNames(1:15, c('ENSP','HGVS','DOMAINS','EXON','INTRON', 'HGVSp', 'HGVSc','CANONICAL','GMAF','DISTANCE', 'HGNC', 'CCDS', 'SIFT', 'PolyPhen', 'common'))
Create a helper function that assigns each variable to the correct position in a matrix. Then use lapply
and strsplit
:
assign <- function(x, names){ xx <- sapply(x, function(i)if(length(i)==2L) i else c(i, "YES")) z <- rep(NA, length(names)) z[names[xx[1, ]]] <- xx[2, ] z}sx <- lapply(strsplit(x, ";"), strsplit, "=")ret <- t(sapply(sx, assign, names))colnames(ret) <- names(names)ret
The results:
ENSP HGVS DOMAINS EXON INTRON HGVSp HGVSc CANONICAL GMAF DISTANCE HGNC[1,] NA NA NA NA "1//1" NA "ENST00000495576.1:n.820-1G>A" "YES" NA NA NA [2,] NA NA NA NA NA NA NA NA NA "2179" NA [3,] NA NA NA "4//4" NA NA "ENST00000466430.1:n.911C>T" "YES" NA NA NA [4,] NA NA NA NA NA NA NA "YES" NA "27" NA CCDS SIFT PolyPhen common[1,] NA NA NA NA [2,] NA NA NA NA [3,] NA NA NA NA [4,] NA NA NA "YES"
Here's another, faster, solution taking advantage of the original pairings...
## test elapsed replications relative average## 2 thell_solution(x) 0.37 1000 1.000 0.00037## 3 andrie_solution(x) 1.04 1000 2.811 0.00104## 1 original_solution(x) 2.61 1000 7.054 0.00261
Since pairing[1] always gets assigned pairing[2] except with the final bool (… not that I understand why that one flag is treated differently in the original string vector …) we can take advantage of the sequence and the fact that the vector will assign NA when a name is given without a value ( ie: x[5] == NA ) and we also have no need to call names multiple times. And since strsplit uses regex we can do alternation.
# Let `x` be as @Andrie made it in his answer. Let `names` be as you had# in the original question.# A pre-built dummy record and empty list.na.record <- setNames(rep(NA, time = length(names)), names)y <- list()do.call(rbind, lapply(strsplit(x, "(;|=)"), FUN = function(x) { x_seq <- seq.int(to = length(x), by = 2) y[x[x_seq]] <- x[x_seq + 1] y[is.na(y)] <- "YES" na.record[x[x_seq]] <- y na.record}))## ENSP HGVS DOMAINS EXON INTRON HGVSp HGVSc ## [1,] NA NA NA NA "1//1" NA "ENST00000495576.1:n.820-1G>A"## [2,] NA NA NA NA NA NA NA ## [3,] NA NA NA "4//4" NA NA "ENST00000466430.1:n.911C>T" ## [4,] NA NA NA NA NA NA NA ## CANONICAL GMAF DISTANCE HGNC CCDS SIFT PolyPhen common## [1,] "YES" NA NA NA NA NA NA NA ## [2,] NA NA "2179" NA NA NA NA NA ## [3,] "YES" NA NA NA NA NA NA NA ## [4,] "YES" NA "27" NA NA NA NA "YES"