How to generate one hot encoding for DNA sequences using R or python
Question:
I want to generate one hot coding matrix for a list of DNA sequences. I have tried to solve my problem from the following link How to generate one hot encoding for DNA sequences? but some of the solutions are given only for one single DNA sequence and not for a list of DNA sequences.
For example
def one_hot_encode(seq):
mapping = dict(zip("ACGT", range(4)))
seq2 = [mapping[i] for i in seq]
return np.eye(4)[seq2]
one_hot_encode("AACGT")
In the given above code, if I run one_hot_encode("AACGT","GGTAC","CGTAC")
it will fail, also i want to generate matrix as output.
Currently, I am working in R and below is my DNA sequence in the r data frame(single-column file)
ACTTTA
TTGATG
CTTACG
GTACGT
Expected output
1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0
0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0
0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0
0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
is it possible to do this in R?
Answers:
In base R, you can do:
dna <- c("A", "C", "G", "T")
dna_seq <- c("ACTTTA", "TTGATG")
one_hot_encode <- function(x){
spl <- strsplit(x, "")[[1]]
fa <- factor(spl, levels = dna)
sapply(fa, table) |>
Reduce(f = c, x = _)
}
data.frame(do.call(rbind, lapply(dna_seq, one_hot_encode)))
output
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24
1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0
2 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0
library(stringr)
dataIn <- c(
"ACTTTA",
"TTGATG",
"CTTACG",
"GTACGT"
)
one_hot_encode <- function(baseSeq) {
outSeq <- stringr::str_replace_all(baseSeq, c("A" = "1000",
"C" = "0100",
"G" = "0010",
"T" = "0001"))
outSeq <- str_extract_all(outSeq, boundary("character"))
unlist(outSeq)
}
data.frame(do.call(rbind,lapply(dataIn, one_hot_encode)))
gives
> data.frame(do.call(rbind,lapply(dataIn, one_hot_encode)))
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24
1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0
2 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0
3 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0
4 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
Some row and column names might tidy up the output, but I think this is essentially what you were after?
Here’s a solution using base R that generates a transpose of your solution. That is, it creates a one-hot column vector for each individual character and concatenates these columns (i.e. there are always four rows, regardless of the number of strings).
(For a solution that produces the same format as the question, see the bottom.)
sequences = c('ACTTTA', 'TTGATG', 'GATTACA')
strsplit(sequences, '') |>
lapply(match, table = c('A', 'C', 'G', 'T')) |>
lapply((x) {
m = diag(0L, nrow = 4L, ncol = length(x))
m[cbind(x, seq_along(x))] = 1L
m
}) |>
do.call('cbind', args = _)
Or, alternatively (shorter but not necessarily more readable):
strsplit(sequences, '') |>
lapply(match, table = c('A', 'C', 'G', 'T')) |>
unlist() %>%
{diag(1L, 4L)[, .]}
Unfortunately this requires the ‘magrittr’ pipe but we can change this (and make it more readable again) by abstracting the expression on the last line into a function:
num_to_one_hot = function (x, bits) {
diag(1L, bits)[, x]
}
This is a nice, general function to one-hot encode a numeric vector. Furthermore, we can unlist
after the first step, which allows us to avoid lapply
. And with that our DNA encoding code becomes:
strsplit(sequences, '') |>
unlist() |>
match(c('A', 'C', 'G', 'T')) |>
num_to_one_hot(bits = 4L)
… which I find by far the most self-explanatory (and most readable!) of all the alternatives. It’s also fully vectorised, and doesn’t use lapply
or similar, so it’s also more efficient.
For completeness, here’s a solution that produces the same output as requested in the question, by transforming the matrix produced by the previous algorithm from col-major to row-major orientation:
strsplit(sequences, '') |>
unlist() |>
match(c('A', 'C', 'G', 'T')) |>
num_to_one_hot(4L) |>
matrix(nrow = length(sequences), byrow = TRUE)
I benchmarked all currently posted solutions, and the ones using num_to_one_hot
(Konrad2
and Konrad3
below) are the fastest:
> bench::mark(Maël(), Paul(), Konrad1(), Konrad2(), Konrad3(), GKi1(), GKi2(), Thomas(), check = FALSE)
# # A tibble: 8 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 Maël() 776µs 818.25µs 1154. 1.36KB 13.3 522 6 452.2ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 2 Paul() 1.07ms 1.17ms 743. 4.38KB 2.04 365 1 491.2ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 3 Konrad1() 21.7µs 23.7µs 39401. 432B 15.8 9996 4 253.7ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 4 Konrad2() 4.8µs 5.7µs 148619. 1.69KB 14.9 9999 1 67.3ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 5 Konrad3() 6.6µs 7.9µs 108540. 2.11KB 10.9 9999 1 92.1ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 6 GKi1() 9.2µs 10.8µs 83596. 960B 16.7 9998 2 119.6ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 7 GKi2() 258.3µs 278µs 3442. 960B 0 1721 0 500ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 8 Thomas() 10µs 11.6µs 77604. 2.39KB 7.76 9999 1 128.8ms <NULL> <Rprofmem> <bench_tm> <tibble>
Another base way.
s <- c("ACTTTA", "TTGATG", "CTTACG", "GTACGT")
dna <- c("A", "C", "G", "T")
lup <- setNames(asplit(diag(length(dna)), 1), dna)
lapply(strsplit(s, "", TRUE), (x) unlist(lup[x], FALSE, FALSE))
#[[1]]
# [1] 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0
#
#[[2]]
# [1] 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0
#
#[[3]]
# [1] 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0
#
#[[4]]
# [1] 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
Or using gsub
. <- gsub("A", "1000", s)
. <- gsub("C", "0100", .)
. <- gsub("G", "0010", .)
. <- gsub("T", "0001", .)
cbind(s, .)
# s .
#[1,] "ACTTTA" "100001000001000100011000"
#[2,] "TTGATG" "000100010010100000010010"
#[3,] "CTTACG" "010000010001100001000010"
#[4,] "GTACGT" "001000011000010000100001"
lapply(strsplit(., "", TRUE), as.integer)
#[[1]]
# [1] 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0
#
#[[2]]
# [1] 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0
#
#[[3]]
# [1] 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0
#
#[[4]]
# [1] 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
Or
lapply(strsplit(s, "", TRUE), (x) c(diag(length(dna))[,match(x, dna)]))
Or
matrix(diag(4)[,unlist(lapply(strsplit(s, "", TRUE), match, dna), FALSE, FALSE)], length(s), byrow = TRUE)
Or
. <- chartr("ACGT", "1-4", s)
. <- strsplit(., "", TRUE)
matrix(diag(4)[, as.integer(unlist(.))], length(s), byrow = TRUE)
Or
. <- paste(s, collapse="")
. <- chartr("ACGT", "1-4", .)
matrix(diag(1L, 4L)[, utf8ToInt(.) - utf8ToInt("0")], length(s), byrow = TRUE)
Or
matrix(diag(1L, 4L)[,rep(1:4, utf8ToInt("ACGT") - 64)[utf8ToInt(paste(s, collapse="")) - 64]], length(s), byrow = TRUE)
Benchmark
library(magrittr) #For Konrad
s <- c("ACTTTA", "TTGATG", "CTTACG", "GTACGT")
dna <- c("A", "C", "G", "T")
bench::mark(check=FALSE,
GKi1 = {lup <- setNames(asplit(diag(length(dna)), 1), dna)
lapply(strsplit(s, "", TRUE), (x) unlist(lup[x], FALSE, FALSE))},
GKi2 = {. <- gsub("A", "1000", s, fixed=TRUE)
. <- gsub("C", "0100", ., fixed=TRUE)
. <- gsub("G", "0010", ., fixed=TRUE)
gsub("T", "0001", ., fixed=TRUE)},
GKi3 = lapply(strsplit(s, "", TRUE), (x) c(diag(length(dna))[,match(x, dna)])),
GKi4 = matrix(diag(4)[,unlist(lapply(strsplit(s, "", TRUE), match, dna), FALSE, FALSE)], length(s), byrow = TRUE),
GKi5 = matrix(diag(4)[, as.integer(unlist(strsplit(chartr("ACGT", "1-4", s), "", TRUE)))], length(s), byrow = TRUE),
GKi6 = matrix(diag(1L, 4L)[, utf8ToInt(chartr("ACGT", "1-4", paste(s, collapse=""))) - utf8ToInt("0")], length(s), byrow = TRUE),
GKi7 = matrix(diag(1L, 4L)[,rep(1:4, utf8ToInt("ACGT") - 64)[utf8ToInt(paste(s, collapse="")) - 64]], length(s), byrow = TRUE),
Konrad = {
strsplit(s, '') |>
lapply(match, table = c('A', 'C', 'G', 'T')) |>
unlist() %>%
{replace(diag(0L, 4L, length(.)), cbind(., seq_along(.)), 1L)}
},
Thomas = matrix(t(diag(4)[match(unlist(lapply(s, utf8ToInt)), utf8ToInt("ACGT")), ]), nrow = length(dna), byrow = TRUE)
)
Result
expression min median itr/s…¹ mem_al…² gc/se…³ n_itr n_gc total…⁴ result
<bch:expr> <bch:t> <bch:t> <dbl> <bch:by> <dbl> <int> <dbl> <bch:t> <list>
1 GKi1 23µs 26.31µs 33203. 3.96MB 69.9 9979 21 300.5ms <NULL>
2 GKi2 6.97µs 8.27µs 109392. 0B 98.5 9991 9 91.3ms <NULL>
3 GKi3 12.39µs 15.41µs 51404. 169.84KB 87.5 9983 17 194.2ms <NULL>
4 GKi4 8.21µs 10.67µs 80760. 1.59KB 121. 9985 15 123.6ms <NULL>
5 GKi5 7.03µs 8.43µs 108751. 6.2KB 76.2 9993 7 91.9ms <NULL>
6 GKi6 6.2µs 7.62µs 118818. 864B 95.1 9992 8 84.1ms <NULL>
7 GKi7 5.64µs 7.46µs 113596. 1.08KB 90.9 9992 8 88ms <NULL>
8 Konrad 11.18µs 13.91µs 60428. 8.92KB 103. 9983 17 165.2ms <NULL>
9 Thomas 8.4µs 10.75µs 77019. 6.34KB 69.4 9991 9 129.7ms <NULL>
In base R, we can use match
and utf8ToInt
to find the mapping position, and matrix
to format the output, e.g.,
dna <- c("ACTTTA", "TTGATG", "CTTACG", "GTACGT")
matrix(t(diag(4)[match(unlist(lapply(dna, utf8ToInt)), utf8ToInt("ACGT")), ]), nrow = length(dna), byrow = TRUE)
such that we will obtain
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,] 1 0 0 0 0 1 0 0 0 0 0 1 0 0
[2,] 0 0 0 1 0 0 0 1 0 0 1 0 1 0
[3,] 0 1 0 0 0 0 0 1 0 0 0 1 1 0
[4,] 0 0 1 0 0 0 0 1 1 0 0 0 0 1
[,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
[1,] 0 1 0 0 0 1 1 0 0 0
[2,] 0 0 0 0 0 1 0 0 1 0
[3,] 0 0 0 1 0 0 0 0 1 0
[4,] 0 0 0 0 1 0 0 0 0 1
I want to generate one hot coding matrix for a list of DNA sequences. I have tried to solve my problem from the following link How to generate one hot encoding for DNA sequences? but some of the solutions are given only for one single DNA sequence and not for a list of DNA sequences.
For example
def one_hot_encode(seq):
mapping = dict(zip("ACGT", range(4)))
seq2 = [mapping[i] for i in seq]
return np.eye(4)[seq2]
one_hot_encode("AACGT")
In the given above code, if I run one_hot_encode("AACGT","GGTAC","CGTAC")
it will fail, also i want to generate matrix as output.
Currently, I am working in R and below is my DNA sequence in the r data frame(single-column file)
ACTTTA
TTGATG
CTTACG
GTACGT
Expected output
1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0
0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0
0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0
0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
is it possible to do this in R?
In base R, you can do:
dna <- c("A", "C", "G", "T")
dna_seq <- c("ACTTTA", "TTGATG")
one_hot_encode <- function(x){
spl <- strsplit(x, "")[[1]]
fa <- factor(spl, levels = dna)
sapply(fa, table) |>
Reduce(f = c, x = _)
}
data.frame(do.call(rbind, lapply(dna_seq, one_hot_encode)))
output
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24
1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0
2 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0
library(stringr)
dataIn <- c(
"ACTTTA",
"TTGATG",
"CTTACG",
"GTACGT"
)
one_hot_encode <- function(baseSeq) {
outSeq <- stringr::str_replace_all(baseSeq, c("A" = "1000",
"C" = "0100",
"G" = "0010",
"T" = "0001"))
outSeq <- str_extract_all(outSeq, boundary("character"))
unlist(outSeq)
}
data.frame(do.call(rbind,lapply(dataIn, one_hot_encode)))
gives
> data.frame(do.call(rbind,lapply(dataIn, one_hot_encode)))
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24
1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0
2 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0
3 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0
4 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
Some row and column names might tidy up the output, but I think this is essentially what you were after?
Here’s a solution using base R that generates a transpose of your solution. That is, it creates a one-hot column vector for each individual character and concatenates these columns (i.e. there are always four rows, regardless of the number of strings).
(For a solution that produces the same format as the question, see the bottom.)
sequences = c('ACTTTA', 'TTGATG', 'GATTACA')
strsplit(sequences, '') |>
lapply(match, table = c('A', 'C', 'G', 'T')) |>
lapply((x) {
m = diag(0L, nrow = 4L, ncol = length(x))
m[cbind(x, seq_along(x))] = 1L
m
}) |>
do.call('cbind', args = _)
Or, alternatively (shorter but not necessarily more readable):
strsplit(sequences, '') |>
lapply(match, table = c('A', 'C', 'G', 'T')) |>
unlist() %>%
{diag(1L, 4L)[, .]}
Unfortunately this requires the ‘magrittr’ pipe but we can change this (and make it more readable again) by abstracting the expression on the last line into a function:
num_to_one_hot = function (x, bits) {
diag(1L, bits)[, x]
}
This is a nice, general function to one-hot encode a numeric vector. Furthermore, we can unlist
after the first step, which allows us to avoid lapply
. And with that our DNA encoding code becomes:
strsplit(sequences, '') |>
unlist() |>
match(c('A', 'C', 'G', 'T')) |>
num_to_one_hot(bits = 4L)
… which I find by far the most self-explanatory (and most readable!) of all the alternatives. It’s also fully vectorised, and doesn’t use lapply
or similar, so it’s also more efficient.
For completeness, here’s a solution that produces the same output as requested in the question, by transforming the matrix produced by the previous algorithm from col-major to row-major orientation:
strsplit(sequences, '') |>
unlist() |>
match(c('A', 'C', 'G', 'T')) |>
num_to_one_hot(4L) |>
matrix(nrow = length(sequences), byrow = TRUE)
I benchmarked all currently posted solutions, and the ones using num_to_one_hot
(Konrad2
and Konrad3
below) are the fastest:
> bench::mark(Maël(), Paul(), Konrad1(), Konrad2(), Konrad3(), GKi1(), GKi2(), Thomas(), check = FALSE)
# # A tibble: 8 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 Maël() 776µs 818.25µs 1154. 1.36KB 13.3 522 6 452.2ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 2 Paul() 1.07ms 1.17ms 743. 4.38KB 2.04 365 1 491.2ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 3 Konrad1() 21.7µs 23.7µs 39401. 432B 15.8 9996 4 253.7ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 4 Konrad2() 4.8µs 5.7µs 148619. 1.69KB 14.9 9999 1 67.3ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 5 Konrad3() 6.6µs 7.9µs 108540. 2.11KB 10.9 9999 1 92.1ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 6 GKi1() 9.2µs 10.8µs 83596. 960B 16.7 9998 2 119.6ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 7 GKi2() 258.3µs 278µs 3442. 960B 0 1721 0 500ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 8 Thomas() 10µs 11.6µs 77604. 2.39KB 7.76 9999 1 128.8ms <NULL> <Rprofmem> <bench_tm> <tibble>
Another base way.
s <- c("ACTTTA", "TTGATG", "CTTACG", "GTACGT")
dna <- c("A", "C", "G", "T")
lup <- setNames(asplit(diag(length(dna)), 1), dna)
lapply(strsplit(s, "", TRUE), (x) unlist(lup[x], FALSE, FALSE))
#[[1]]
# [1] 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0
#
#[[2]]
# [1] 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0
#
#[[3]]
# [1] 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0
#
#[[4]]
# [1] 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
Or using gsub
. <- gsub("A", "1000", s)
. <- gsub("C", "0100", .)
. <- gsub("G", "0010", .)
. <- gsub("T", "0001", .)
cbind(s, .)
# s .
#[1,] "ACTTTA" "100001000001000100011000"
#[2,] "TTGATG" "000100010010100000010010"
#[3,] "CTTACG" "010000010001100001000010"
#[4,] "GTACGT" "001000011000010000100001"
lapply(strsplit(., "", TRUE), as.integer)
#[[1]]
# [1] 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0
#
#[[2]]
# [1] 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0
#
#[[3]]
# [1] 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0
#
#[[4]]
# [1] 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
Or
lapply(strsplit(s, "", TRUE), (x) c(diag(length(dna))[,match(x, dna)]))
Or
matrix(diag(4)[,unlist(lapply(strsplit(s, "", TRUE), match, dna), FALSE, FALSE)], length(s), byrow = TRUE)
Or
. <- chartr("ACGT", "1-4", s)
. <- strsplit(., "", TRUE)
matrix(diag(4)[, as.integer(unlist(.))], length(s), byrow = TRUE)
Or
. <- paste(s, collapse="")
. <- chartr("ACGT", "1-4", .)
matrix(diag(1L, 4L)[, utf8ToInt(.) - utf8ToInt("0")], length(s), byrow = TRUE)
Or
matrix(diag(1L, 4L)[,rep(1:4, utf8ToInt("ACGT") - 64)[utf8ToInt(paste(s, collapse="")) - 64]], length(s), byrow = TRUE)
Benchmark
library(magrittr) #For Konrad
s <- c("ACTTTA", "TTGATG", "CTTACG", "GTACGT")
dna <- c("A", "C", "G", "T")
bench::mark(check=FALSE,
GKi1 = {lup <- setNames(asplit(diag(length(dna)), 1), dna)
lapply(strsplit(s, "", TRUE), (x) unlist(lup[x], FALSE, FALSE))},
GKi2 = {. <- gsub("A", "1000", s, fixed=TRUE)
. <- gsub("C", "0100", ., fixed=TRUE)
. <- gsub("G", "0010", ., fixed=TRUE)
gsub("T", "0001", ., fixed=TRUE)},
GKi3 = lapply(strsplit(s, "", TRUE), (x) c(diag(length(dna))[,match(x, dna)])),
GKi4 = matrix(diag(4)[,unlist(lapply(strsplit(s, "", TRUE), match, dna), FALSE, FALSE)], length(s), byrow = TRUE),
GKi5 = matrix(diag(4)[, as.integer(unlist(strsplit(chartr("ACGT", "1-4", s), "", TRUE)))], length(s), byrow = TRUE),
GKi6 = matrix(diag(1L, 4L)[, utf8ToInt(chartr("ACGT", "1-4", paste(s, collapse=""))) - utf8ToInt("0")], length(s), byrow = TRUE),
GKi7 = matrix(diag(1L, 4L)[,rep(1:4, utf8ToInt("ACGT") - 64)[utf8ToInt(paste(s, collapse="")) - 64]], length(s), byrow = TRUE),
Konrad = {
strsplit(s, '') |>
lapply(match, table = c('A', 'C', 'G', 'T')) |>
unlist() %>%
{replace(diag(0L, 4L, length(.)), cbind(., seq_along(.)), 1L)}
},
Thomas = matrix(t(diag(4)[match(unlist(lapply(s, utf8ToInt)), utf8ToInt("ACGT")), ]), nrow = length(dna), byrow = TRUE)
)
Result
expression min median itr/s…¹ mem_al…² gc/se…³ n_itr n_gc total…⁴ result
<bch:expr> <bch:t> <bch:t> <dbl> <bch:by> <dbl> <int> <dbl> <bch:t> <list>
1 GKi1 23µs 26.31µs 33203. 3.96MB 69.9 9979 21 300.5ms <NULL>
2 GKi2 6.97µs 8.27µs 109392. 0B 98.5 9991 9 91.3ms <NULL>
3 GKi3 12.39µs 15.41µs 51404. 169.84KB 87.5 9983 17 194.2ms <NULL>
4 GKi4 8.21µs 10.67µs 80760. 1.59KB 121. 9985 15 123.6ms <NULL>
5 GKi5 7.03µs 8.43µs 108751. 6.2KB 76.2 9993 7 91.9ms <NULL>
6 GKi6 6.2µs 7.62µs 118818. 864B 95.1 9992 8 84.1ms <NULL>
7 GKi7 5.64µs 7.46µs 113596. 1.08KB 90.9 9992 8 88ms <NULL>
8 Konrad 11.18µs 13.91µs 60428. 8.92KB 103. 9983 17 165.2ms <NULL>
9 Thomas 8.4µs 10.75µs 77019. 6.34KB 69.4 9991 9 129.7ms <NULL>
In base R, we can use match
and utf8ToInt
to find the mapping position, and matrix
to format the output, e.g.,
dna <- c("ACTTTA", "TTGATG", "CTTACG", "GTACGT")
matrix(t(diag(4)[match(unlist(lapply(dna, utf8ToInt)), utf8ToInt("ACGT")), ]), nrow = length(dna), byrow = TRUE)
such that we will obtain
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,] 1 0 0 0 0 1 0 0 0 0 0 1 0 0
[2,] 0 0 0 1 0 0 0 1 0 0 1 0 1 0
[3,] 0 1 0 0 0 0 0 1 0 0 0 1 1 0
[4,] 0 0 1 0 0 0 0 1 1 0 0 0 0 1
[,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
[1,] 0 1 0 0 0 1 1 0 0 0
[2,] 0 0 0 0 0 1 0 0 1 0
[3,] 0 0 0 1 0 0 0 0 1 0
[4,] 0 0 0 0 1 0 0 0 0 1