Improving the speed of one-hot encoding a list of strings

Question 1

I've recently developed two functions to functions to essentially convert a list of strings that look something like this (these strings are 101 characters long in my case):

['AGT', 'AAT']

To a numpy array:

array([[[[1],
 [0],
 [0],
 [0]],
 [[0],
 [1],
 [0],
 [0]],
 [[0],
 [0],
 [0],
 [1]]],
 [[[1],
 [0],
 [0],
 [0]],
 [[1],
 [0],
 [0],
 [0]],
 [[0],
 [0],
 [1],
 [0]]]])

The shape of which is [2, 3, 4, 1] in this case

At the moment, my code essentially defines one function, in which I define a dictionary, which is then mapped to a single input string, like so:

def sequence_one_hot_encoder(seq):
 
 import numpy as np
 mapping = {
 "A": [[1], [0], [0], [0]],
 "G": [[0], [1], [0], [0]],
 "C": [[0], [0], [1], [0]],
 "T": [[0], [0], [0], [1]],
 "X": [[0], [0], [0], [0]],
 "N": [[1], [1], [1], [1]]
 }
 encoded_seq = np.array([mapping[i] for i in str(seq)])
 return(encoded_seq)

Following from this, I then create another function to map this function to my list of strings:

def sequence_list_encoder(sequence_file):
 
 import numpy as np
 
 one_hot_encoded_array = np.asarray(list(map(sequence_one_hot_encoder, sequence_file)))
 print(one_hot_encoded_array.shape)
 return(one_hot_encoded_array)

At the moment, for a list containing 1,688,119 strings of 101 characters, it's taking around 7-8 minutes. I was curious if there was a better way of rewriting my two functions to reduce runtime?

Question 2

sequence_one_hot_encoder(seq) builds an array of shape (len(seq), 4, 1). sequence_list_encoder() puts all these into a python list and then coverts the list into an array with shape (number_of_sequences, len(seq), 4, 1). It looks like there is a lot of overhead doing that. It is much faster to treat the one_hot_encoded_array as 1-D and then set the shape at the end.

def sequence_list_encoder(sequence_file):
 mapping = {
 "A": (1, 0, 0, 0),
 "G": (0, 1, 0, 0),
 "C": (0, 0, 1, 0),
 "T": (0, 0, 0, 1),
 "X": (0, 0, 0, 0),
 "N": (1, 1, 1, 1)
 }
 sequences = sequence_file.read().splitlines()
 bits = [b for seq in sequences for ch in seq for b in mapping[ch]]
 one_hot_encoded_array = np.fromiter(bits, dtype=np.uint8)
 one_hot_encoded_array.shape = (len(sequences), len(sequences[0]), 4, 1)
 return one_hot_encoded_array

This runs in about 1/5 the time as your code.

RootTwo RootTwo 10.6k1 gold badge14 silver badges30 bronze badges · Accepted Answer · 2022-07-30 23:57:46Z

sequence_one_hot_encoder(seq) builds an array of shape (len(seq), 4, 1). sequence_list_encoder() puts all these into a python list and then coverts the list into an array with shape (number_of_sequences, len(seq), 4, 1). It looks like there is a lot of overhead doing that. It is much faster to treat the one_hot_encoded_array as 1-D and then set the shape at the end.

def sequence_list_encoder(sequence_file):
 mapping = {
 "A": (1, 0, 0, 0),
 "G": (0, 1, 0, 0),
 "C": (0, 0, 1, 0),
 "T": (0, 0, 0, 1),
 "X": (0, 0, 0, 0),
 "N": (1, 1, 1, 1)
 }
 sequences = sequence_file.read().splitlines()
 bits = [b for seq in sequences for ch in seq for b in mapping[ch]]
 one_hot_encoded_array = np.fromiter(bits, dtype=np.uint8)
 one_hot_encoded_array.shape = (len(sequences), len(sequences[0]), 4, 1)
 return one_hot_encoded_array

This runs in about 1/5 the time as your code.

Stack Exchange Network

Improving the speed of one-hot encoding a list of strings

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Improving the speed of one-hot encoding a list of strings

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions