2
\$\begingroup\$

I've recently developed two functions to functions to essentially convert a list of strings that look something like this (these strings are 101 characters long in my case):

['AGT', 'AAT']

To a numpy array:

array([[[[1],
 [0],
 [0],
 [0]],
 [[0],
 [1],
 [0],
 [0]],
 [[0],
 [0],
 [0],
 [1]]],
 [[[1],
 [0],
 [0],
 [0]],
 [[1],
 [0],
 [0],
 [0]],
 [[0],
 [0],
 [1],
 [0]]]])

The shape of which is [2, 3, 4, 1] in this case

At the moment, my code essentially defines one function, in which I define a dictionary, which is then mapped to a single input string, like so:

def sequence_one_hot_encoder(seq):
 
 import numpy as np
 mapping = {
 "A": [[1], [0], [0], [0]],
 "G": [[0], [1], [0], [0]],
 "C": [[0], [0], [1], [0]],
 "T": [[0], [0], [0], [1]],
 "X": [[0], [0], [0], [0]],
 "N": [[1], [1], [1], [1]]
 }
 encoded_seq = np.array([mapping[i] for i in str(seq)])
 return(encoded_seq)

Following from this, I then create another function to map this function to my list of strings:

def sequence_list_encoder(sequence_file):
 
 import numpy as np
 
 one_hot_encoded_array = np.asarray(list(map(sequence_one_hot_encoder, sequence_file)))
 print(one_hot_encoded_array.shape)
 return(one_hot_encoded_array)

At the moment, for a list containing 1,688,119 strings of 101 characters, it's taking around 7-8 minutes. I was curious if there was a better way of rewriting my two functions to reduce runtime?

Toby Speight
87.2k14 gold badges104 silver badges322 bronze badges
asked Jul 28, 2022 at 12:10
\$\endgroup\$

1 Answer 1

4
\$\begingroup\$

sequence_one_hot_encoder(seq) builds an array of shape (len(seq), 4, 1). sequence_list_encoder() puts all these into a python list and then coverts the list into an array with shape (number_of_sequences, len(seq), 4, 1). It looks like there is a lot of overhead doing that. It is much faster to treat the one_hot_encoded_array as 1-D and then set the shape at the end.

def sequence_list_encoder(sequence_file):
 mapping = {
 "A": (1, 0, 0, 0),
 "G": (0, 1, 0, 0),
 "C": (0, 0, 1, 0),
 "T": (0, 0, 0, 1),
 "X": (0, 0, 0, 0),
 "N": (1, 1, 1, 1)
 }
 sequences = sequence_file.read().splitlines()
 bits = [b for seq in sequences for ch in seq for b in mapping[ch]]
 one_hot_encoded_array = np.fromiter(bits, dtype=np.uint8)
 one_hot_encoded_array.shape = (len(sequences), len(sequences[0]), 4, 1)
 return one_hot_encoded_array

This runs in about 1/5 the time as your code.

answered Jul 30, 2022 at 23:57
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.