How to generate one hot encoding for DNA sequences using R or python

Question:

I want to generate one hot coding matrix for a list of DNA sequences. I have tried to solve my problem from the following link How to generate one hot encoding for DNA sequences? but some of the solutions are given only for one single DNA sequence and not for a list of DNA sequences.

For example

def one_hot_encode(seq):
    mapping = dict(zip("ACGT", range(4)))    
    seq2 = [mapping[i] for i in seq]
    return np.eye(4)[seq2]

one_hot_encode("AACGT")

In the given above code, if I run one_hot_encode("AACGT","GGTAC","CGTAC") it will fail, also i want to generate matrix as output.

Currently, I am working in R and below is my DNA sequence in the r data frame(single-column file)

ACTTTA
TTGATG
CTTACG
GTACGT

Expected output

1   0   0   0   0   1   0   0   0   0   0   1   0   0   0   1   0   0   0   1   1   0   0   0
0   0   0   1   0   0   0   1   0   0   1   0   1   0   0   0   0   0   0   1   0   0   1   0
0   1   0   0   0   0   0   1   0   0   0   1   1   0   0   0   0   1   0   0   0   0   1   0
0   0   1   0   0   0   0   1   1   0   0   0   0   1   0   0   0   0   1   0   0   0   0   1

is it possible to do this in R?

Asked By: Sunil Pachakar

||

Answers:

In base R, you can do:

dna <- c("A", "C", "G", "T")
dna_seq <- c("ACTTTA", "TTGATG")

one_hot_encode <- function(x){
  spl <- strsplit(x, "")[[1]]
  fa <- factor(spl, levels = dna)
  sapply(fa, table) |>
    Reduce(f = c, x = _)
}

data.frame(do.call(rbind, lapply(dna_seq, one_hot_encode)))

output

  X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24
1  1  0  0  0  0  1  0  0  0   0   0   1   0   0   0   1   0   0   0   1   1   0   0   0
2  0  0  0  1  0  0  0  1  0   0   1   0   1   0   0   0   0   0   0   1   0   0   1   0
Answered By: Maël
library(stringr)

dataIn <- c(
  "ACTTTA", 
  "TTGATG", 
  "CTTACG", 
  "GTACGT"
  )

one_hot_encode <- function(baseSeq) {
 outSeq <- stringr::str_replace_all(baseSeq, c("A" = "1000",
                                  "C" = "0100",
                                  "G" = "0010",
                                  "T" = "0001"))
 outSeq <- str_extract_all(outSeq, boundary("character"))
 unlist(outSeq)
}

data.frame(do.call(rbind,lapply(dataIn, one_hot_encode)))

gives

 > data.frame(do.call(rbind,lapply(dataIn, one_hot_encode)))
  X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24
1  1  0  0  0  0  1  0  0  0   0   0   1   0   0   0   1   0   0   0   1   1   0   0   0
2  0  0  0  1  0  0  0  1  0   0   1   0   1   0   0   0   0   0   0   1   0   0   1   0
3  0  1  0  0  0  0  0  1  0   0   0   1   1   0   0   0   0   1   0   0   0   0   1   0
4  0  0  1  0  0  0  0  1  1   0   0   0   0   1   0   0   0   0   1   0   0   0   0   1

Some row and column names might tidy up the output, but I think this is essentially what you were after?

Answered By: Paul Stafford Allen

Here’s a solution using base R that generates a transpose of your solution. That is, it creates a one-hot column vector for each individual character and concatenates these columns (i.e. there are always four rows, regardless of the number of strings).

(For a solution that produces the same format as the question, see the bottom.)

sequences = c('ACTTTA', 'TTGATG', 'GATTACA')
strsplit(sequences, '') |>
  lapply(match, table = c('A', 'C', 'G', 'T')) |>
  lapply((x) {
    m = diag(0L, nrow = 4L, ncol = length(x))
    m[cbind(x, seq_along(x))] = 1L
    m
  }) |>
  do.call('cbind', args = _)

Or, alternatively (shorter but not necessarily more readable):

strsplit(sequences, '') |>
  lapply(match, table = c('A', 'C', 'G', 'T')) |>
  unlist() %>%
  {diag(1L, 4L)[, .]}

Unfortunately this requires the ‘magrittr’ pipe but we can change this (and make it more readable again) by abstracting the expression on the last line into a function:

num_to_one_hot = function (x, bits) {
  diag(1L, bits)[, x]
}

This is a nice, general function to one-hot encode a numeric vector. Furthermore, we can unlist after the first step, which allows us to avoid lapply. And with that our DNA encoding code becomes:

strsplit(sequences, '') |>
  unlist() |>
  match(c('A', 'C', 'G', 'T')) |>
  num_to_one_hot(bits = 4L)

… which I find by far the most self-explanatory (and most readable!) of all the alternatives. It’s also fully vectorised, and doesn’t use lapply or similar, so it’s also more efficient.


For completeness, here’s a solution that produces the same output as requested in the question, by transforming the matrix produced by the previous algorithm from col-major to row-major orientation:

strsplit(sequences, '') |>
  unlist() |>
  match(c('A', 'C', 'G', 'T')) |>
  num_to_one_hot(4L) |>
  matrix(nrow = length(sequences), byrow = TRUE)

I benchmarked all currently posted solutions, and the ones using num_to_one_hot (Konrad2 and Konrad3 below) are the fastest:

> bench::mark(Maël(), Paul(), Konrad1(), Konrad2(), Konrad3(), GKi1(), GKi2(), Thomas(), check = FALSE)
# # A tibble: 8 × 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 Maël()        776µs 818.25µs     1154.    1.36KB    13.3    522     6    452.2ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 2 Paul()       1.07ms   1.17ms      743.    4.38KB     2.04   365     1    491.2ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 3 Konrad1()    21.7µs   23.7µs    39401.      432B    15.8   9996     4    253.7ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 4 Konrad2()     4.8µs    5.7µs   148619.    1.69KB    14.9   9999     1     67.3ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 5 Konrad3()     6.6µs    7.9µs   108540.    2.11KB    10.9   9999     1     92.1ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 6 GKi1()        9.2µs   10.8µs    83596.      960B    16.7   9998     2    119.6ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 7 GKi2()      258.3µs    278µs     3442.      960B     0     1721     0      500ms <NULL> <Rprofmem> <bench_tm> <tibble>
# 8 Thomas()       10µs   11.6µs    77604.    2.39KB     7.76  9999     1    128.8ms <NULL> <Rprofmem> <bench_tm> <tibble>
Answered By: Konrad Rudolph

Another base way.

s <- c("ACTTTA", "TTGATG", "CTTACG", "GTACGT")

dna <- c("A", "C", "G", "T")
lup <- setNames(asplit(diag(length(dna)), 1), dna)
lapply(strsplit(s, "", TRUE), (x) unlist(lup[x], FALSE, FALSE))
#[[1]]
# [1] 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0
#
#[[2]]
# [1] 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0
#
#[[3]]
# [1] 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0
#
#[[4]]
# [1] 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1

Or using gsub

. <- gsub("A", "1000", s)
. <- gsub("C", "0100", .)
. <- gsub("G", "0010", .)
. <- gsub("T", "0001", .)
cbind(s, .)
#     s        .                         
#[1,] "ACTTTA" "100001000001000100011000"
#[2,] "TTGATG" "000100010010100000010010"
#[3,] "CTTACG" "010000010001100001000010"
#[4,] "GTACGT" "001000011000010000100001"

lapply(strsplit(., "", TRUE), as.integer)
#[[1]]
# [1] 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0
#
#[[2]]
# [1] 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0
#
#[[3]]
# [1] 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0
#
#[[4]]
# [1] 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1

Or

lapply(strsplit(s, "", TRUE), (x) c(diag(length(dna))[,match(x, dna)]))

Or

matrix(diag(4)[,unlist(lapply(strsplit(s, "", TRUE), match, dna), FALSE, FALSE)], length(s), byrow = TRUE)

Or

. <- chartr("ACGT", "1-4", s)
. <- strsplit(., "", TRUE)
matrix(diag(4)[, as.integer(unlist(.))], length(s), byrow = TRUE)

Or

. <- paste(s, collapse="")
. <- chartr("ACGT", "1-4", .)
matrix(diag(1L, 4L)[, utf8ToInt(.) - utf8ToInt("0")], length(s), byrow = TRUE)

Or

matrix(diag(1L, 4L)[,rep(1:4, utf8ToInt("ACGT") - 64)[utf8ToInt(paste(s, collapse="")) - 64]], length(s), byrow = TRUE)

Benchmark

library(magrittr) #For Konrad
s <- c("ACTTTA", "TTGATG", "CTTACG", "GTACGT")
dna <- c("A", "C", "G", "T")

bench::mark(check=FALSE,
            GKi1 = {lup <- setNames(asplit(diag(length(dna)), 1), dna)
              lapply(strsplit(s, "", TRUE), (x) unlist(lup[x], FALSE, FALSE))},
GKi2 = {. <- gsub("A", "1000", s, fixed=TRUE)
  . <- gsub("C", "0100", ., fixed=TRUE)
  . <- gsub("G", "0010", ., fixed=TRUE)
  gsub("T", "0001", ., fixed=TRUE)},
GKi3 = lapply(strsplit(s, "", TRUE), (x) c(diag(length(dna))[,match(x, dna)])),
GKi4 = matrix(diag(4)[,unlist(lapply(strsplit(s, "", TRUE), match, dna), FALSE, FALSE)], length(s), byrow = TRUE),
GKi5 = matrix(diag(4)[, as.integer(unlist(strsplit(chartr("ACGT", "1-4", s), "", TRUE)))], length(s), byrow = TRUE),
GKi6 = matrix(diag(1L, 4L)[, utf8ToInt(chartr("ACGT", "1-4", paste(s, collapse=""))) - utf8ToInt("0")], length(s), byrow = TRUE),
GKi7 = matrix(diag(1L, 4L)[,rep(1:4, utf8ToInt("ACGT") - 64)[utf8ToInt(paste(s, collapse="")) - 64]], length(s), byrow = TRUE),
Konrad = {
strsplit(s, '') |>
  lapply(match, table = c('A', 'C', 'G', 'T')) |>
  unlist() %>%
  {replace(diag(0L, 4L, length(.)), cbind(., seq_along(.)), 1L)}
},
Thomas = matrix(t(diag(4)[match(unlist(lapply(s, utf8ToInt)), utf8ToInt("ACGT")), ]), nrow = length(dna), byrow = TRUE)
         )

Result

  expression     min  median itr/s…¹ mem_al…² gc/se…³ n_itr  n_gc total…⁴ result
  <bch:expr> <bch:t> <bch:t>   <dbl> <bch:by>   <dbl> <int> <dbl> <bch:t> <list>
1 GKi1          23µs 26.31µs  33203.   3.96MB    69.9  9979    21 300.5ms <NULL>
2 GKi2        6.97µs  8.27µs 109392.       0B    98.5  9991     9  91.3ms <NULL>
3 GKi3       12.39µs 15.41µs  51404. 169.84KB    87.5  9983    17 194.2ms <NULL>
4 GKi4        8.21µs 10.67µs  80760.   1.59KB   121.   9985    15 123.6ms <NULL>
5 GKi5        7.03µs  8.43µs 108751.    6.2KB    76.2  9993     7  91.9ms <NULL>
6 GKi6         6.2µs  7.62µs 118818.     864B    95.1  9992     8  84.1ms <NULL>
7 GKi7        5.64µs  7.46µs 113596.   1.08KB    90.9  9992     8    88ms <NULL>
8 Konrad     11.18µs 13.91µs  60428.   8.92KB   103.   9983    17 165.2ms <NULL>
9 Thomas       8.4µs 10.75µs  77019.   6.34KB    69.4  9991     9 129.7ms <NULL>
Answered By: GKi

In base R, we can use match and utf8ToInt to find the mapping position, and matrix to format the output, e.g.,

dna <- c("ACTTTA", "TTGATG", "CTTACG", "GTACGT")
matrix(t(diag(4)[match(unlist(lapply(dna, utf8ToInt)), utf8ToInt("ACGT")), ]), nrow = length(dna), byrow = TRUE)

such that we will obtain

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,]    1    0    0    0    0    1    0    0    0     0     0     1     0     0
[2,]    0    0    0    1    0    0    0    1    0     0     1     0     1     0
[3,]    0    1    0    0    0    0    0    1    0     0     0     1     1     0
[4,]    0    0    1    0    0    0    0    1    1     0     0     0     0     1
     [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
[1,]     0     1     0     0     0     1     1     0     0     0
[2,]     0     0     0     0     0     1     0     0     1     0
[3,]     0     0     0     1     0     0     0     0     1     0
[4,]     0     0     0     0     1     0     0     0     0     1
Answered By: ThomasIsCoding
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.