I'm a beginner at python, and this is my first major project. Inspired by this reddit post, I wrote a program that accepts a DNA sequence (in a FASTA file format) and generates a graphical representation enter image description here
Ebola Virus
I've optimized it as much as I can, but I've hit a bottleneck with generating the actual image and rendering the path onto it. I'm run into memory errors with larger source files. Here are some of the source file's I'm testing. I can render everything except for the Fruit Fly genome, which crashes before completion.
Does anyone have any tips for refactoring and optimizing this? Particularly I've been wondering if I could break up the image into chunks, render them one by one, save them to disk (like with the array
class), and then stitch them back together but I can't figure out a way to.
#!/usr/bin/env python3
# Python 3.6
from PIL import Image, ImageDraw
from re import sub
from copy import deepcopy, copy
from os import listdir, path, makedirs
from shutil import rmtree
import pickle
filepath = input("""What is the input file? FASTA format recommended (path/to/file)
?> """)
file = open(filepath,'r')
print("Serializing %s ..."%filepath)
raw = ''.join([n for n in file.readlines() if not n.startswith('>')]).replace('\n',"").lower()
file.close()
del file
raw = sub(r'[rykmswbdhv-]', "n", raw) # Handles miscellaneous FASTA characters
raw = sub(r'[^atgcn]', "", raw) # Handles 4 bases and not-lnown
sequence = deepcopy(list(raw)) # Completed filtered list containing all the bases
del sub, raw
print("The input file has been serialized. (%s items) Calculating path..." %str(len(sequence)))
action = { # The different bases and their respective colours and movements
"a": ((0,255,0),0,-1), #green - Moves up
"t": ((255,0,0),0,1), #red - Moves Down
"g": ((255,0,255),-1,0), #hot pink - Moves Left
"c": ((0,0,255),1,0), #blue - Moves Right
"n": ((0,0,0),0,0), #black - Stays on spot
}
class array(): # Array class that dynamically saves temp files to disk to conserve memory
def __init__(self):
self.a = []
self.maxmem = int(5e6) # How much data should be let accumulated in memory before being dumped to disk?
# 1e6 ~ 20,000mb and 5e6 ~ 100,000mb
self.fc = 0 # File Count
if path.exists("temp"):
rmtree('temp')
makedirs("temp")
self.path = "temp/temp%d.dat" # Name of temp files
def append(self,n):
self.a.append(n)
if len(self.a) >= self.maxmem:
self.fc += 1
with open(self.path%self.fc,'wb') as pfile:
pickle.dump(self.a,pfile) # Dump the data
del self.a[:]
def setupiterate(self):
if len(self.a) > 0:
self.fc += 1
with open(self.path%self.fc,'wb') as pfile:
pickle.dump(self.a,pfile)
self.maxfc = copy(self.fc)
self.fc = 0
def iterate(self): # This is called in a loop
self.fc += 1
with open(self.path%self.fc,'rb') as pfile:
return pickle.load(pfile) # Get the data
count = [[0,0],[0,0]] # Top left and bottom right corners of completed path
curr = [0,0]
pendingactions = array()
for i in sequence:
#get the actions associated from dict
curr[0] += action[i][1]
curr[1] += action[i][2]
if curr[0] > count[0][0]:
count[0][0] = curr[0]
elif curr[0] < count[1][0]:
count[1][0] = curr[0]
if curr[1] > count[0][1]:
count[0][1] = curr[1]
elif curr[1] < count[1][1]:
count[1][1] = curr[1]
pendingactions.append((copy(curr),action[i][0]))
pendingactions.setupiterate()
del sequence, deepcopy, copy
# Final dimensions of image + 10px border
dim = (abs(count[0][0]-count[1][0])+20,abs(count[0][1]-count[1][1])+20)
print("The path has been calculated. Rendering image... %s"%("("+str(dim[0])+"x"+str(dim[1])+")"))
img = Image.new("RGBA", dim, None) # Memory intensive with larger source files
draw = ImageDraw.Draw(img)
for i in range(0,pendingactions.maxfc):
for j in pendingactions.iterate(): # This method returns a single file's worth of data
draw.point((j[0][0]+abs(count[1][0])+10,j[0][1]+abs(count[1][1])+10), fill=j[1]) # Plots a point for every coordinate on the path
def mean(n): # I couldn't find an average function in base python
s = float(n[0] + n[1])/2
return s
# Start and End points are dynamically sized to the dimensions of the final image
draw.ellipse([((abs(count[1][0])+10)-round(mean(dim)/500),(abs(count[1][1])+10)-round(mean(dim)/500)),((abs(count[1][0])+10)+round(mean(dim)/500),(abs(count[1][1])+10)+round(mean(dim)/500))], fill = (255,255,0), outline = (255,255,0)) #yellow
draw.ellipse([((curr[0]+abs(count[1][0])+10)-round(mean(dim)/500),(curr[1]+abs(count[1][1])+10)-round(mean(dim)/500)),((curr[0]+abs(count[1][0])+10)+round(mean(dim)/500),(curr[1]+abs(count[1][1])+10)+round(mean(dim)/500))], fill = (51,255,255), outline = (51,255,255)) #neon blue
del count, curr, mean, dim, draw, ImageDraw
print("The image has been rendered. Saving...")
loc = '%s.png'%filepath.split(".", 1)[0]
img.save(loc)
img.close()
del img
print("Done! Image is saved as: %s"%loc)
rmtree('temp')
print("Temp files have been deleted.")
input("Press [enter] to exit")
-
\$\begingroup\$ Interesting and beautiful question! By any chance, would you know where input files could be found ? Thanks \$\endgroup\$SylvainD– SylvainD2018年04月11日 13:20:50 +00:00Commented Apr 11, 2018 at 13:20
-
\$\begingroup\$ @Josay I got some of the organisms I wanted to sequence from wikipedia and the actual genomes themselves from the National Center for Biotechnical Information I've linked several genomes in the dropbox folder I shared in my post. \$\endgroup\$logwet– logwet2018年04月11日 14:51:08 +00:00Commented Apr 11, 2018 at 14:51
-
\$\begingroup\$ Oh, indeed, I did not notice, sorry for that and congratulations for the high quality of your question. \$\endgroup\$SylvainD– SylvainD2018年04月11日 14:53:05 +00:00Commented Apr 11, 2018 at 14:53
1 Answer 1
You should make your array
class behave a bit more like a Python list. It would be nice if there was a nice representation of it, if you could extend
the list, if you could just iterate over it without having to worry about setting up the iteration first or even initialize it with some values already in the array. Currently you can also not have more than one instance of array
, because it deletes the temp directory upon initialization
All of this can be achieved using the so-called magic or dunder-methods. These are specially named methods that get automatically called by Python in certain circumstances. For a nice representation this is __repr__
, if you call print(x)
or str(x)
, it is __str__
and to iterate it is __iter__
. With these (and a few more methods) your class becomes:
from itertools import islice
from os import path, makedirs
from shutil import rmtree
import pickle
class array():
"""1D Array class
Dynamically saves temp files to disk to conserve memory"""
def __init__(self, a=None, maxmem=int(5e6)):
# How much data to keep in memory before dumping to disk
self.maxmem = maxmem
# 1e6 ~ 20,000mb and 5e6 ~ 100,000mb
self.fc = 0 # file counter
# make a unique subfolder (unique as long as the array exists)
self.uuid = id(self)
self.dir = ".array_cache/%d" % self.uuid
if path.exists(self.dir):
rmtree(self.dir)
makedirs(self.dir)
self.path = self.dir + "/temp%d.dat" # Name of temp files
self.a = []
if a is not None:
self.extend(a)
def append(self, n):
"""Append n to the array.
If size exceeds self.maxmem, dump to disk
"""
self.a.append(n)
if len(self.a) >= self.maxmem:
with open(self.path % self.fc, 'wb') as pfile:
pickle.dump(self.a, pfile) # Dump the data
self.a = []
self.fc += 1
def extend(self, values):
"""Convenience method to append multiple values"""
for n in values:
self.append(n)
def __iter__(self):
"""Allows iterating over the values in the array.
Loads the values from disk as necessary."""
for fc in range(self.fc):
with open(self.path % fc, 'rb') as pfile:
yield from pickle.load(pfile)
yield from self.a
def __repr__(self):
"""The values currently in memory"""
s = "[..., " if self.fc else "["
return s + ", ".join(map(str, self.a)) + "]"
def __getitem__(self, index):
"""Get the item at index or the items in slice.
Loads all dumps from disk until start of slice for the latter."""
if isinstance(index, slice):
return list(islice(self, index.start, index.stop, index.step))
else:
fc, i = divmod(index, self.maxmem)
with open(self.path % fc, 'rb') as pfile:
return pickle.load(pfile)[i]
def __len__(self):
"""Length of the array (including values on disk)"""
return self.fc * self.maxmem + len(self.a)
I also added docstrings
to describe what the methods do and made the size an optional parameter (for testing purposes, mostly)
You can use this class like this:
x = array(maxmem=5)
x.extend(range(21))
print(x)
# [..., 20]
list(x)
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
x[4]
# 4
x[:4]
# [0, 1, 2, 3]
len(x)
# 21
for i in x:
print(i)
# 0
# 1
# 2
# ...
x2 = array(range(10), maxmem=9)
list(x2)
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
This implementation is still far from perfect. Indeed extend
could probably be implemented better, by extending in batches of self.maxmem
(taking care of the first slice). Adding elements at an arbitrary location is not implemented (__setitem__
), and neither is deleting (__delitem__
). (削除) It would be better if the path was also configurable, so if you have two instances of , and so on. But for now those are left as an exercise for the reader...array
, they don't overwrite each other (削除ここまで)
-
\$\begingroup\$ Thanks for the help :D. I've implemented your class and I've managed to write _setitem_ and _delitem_ magic methods. Your class is so much more optimized than mine. It uses a fifth of the memory and runs faster too! Also thanks for your tips on using generator expressions, I could never get my head around them. \$\endgroup\$logwet– logwet2018年04月12日 03:52:58 +00:00Commented Apr 12, 2018 at 3:52
-
\$\begingroup\$ @SachilaHerarh You're welcome, yeah generator expressions take some getting used to. But I was kind of hoping somebody else would review the rest of your code. You might want to ask a follow-up question (linking to this one), because I think there is still a lot to gain there :-) \$\endgroup\$Graipher– Graipher2018年04月12日 04:33:29 +00:00Commented Apr 12, 2018 at 4:33
-
1\$\begingroup\$ I asked another question regarding optimizing PIL. \$\endgroup\$logwet– logwet2018年04月14日 09:42:53 +00:00Commented Apr 14, 2018 at 9:42
Explore related questions
See similar questions with these tags.