12
\$\begingroup\$

I've been giving my grade tens an introduction to computational phylogenomics, based around computing dissimilarity between DNA strands, and a simplified model of mutation. They've been exploring how mutations introduce variation, and how dissimilarity with the progenitor increases with the number of generations. We're working towards and understanding of a molecular clock.

So far they've developed "algorithms" of their own and been doing this by hand, but next I want to give them a tool to run multiple simulations, so they can do some statistics so they can arrive at conclusions with more confidence. Rather than just giving them a black box that will magically produce results, I want them to be able to relate the computer algorithms with the algorithms they've been using, and thereby having a better understanding of how it works. Most of the students have little experience with actual programming, so I want to keep the code as simple and intuitive as possible, with few abstractions.

For example, I'm not too happy about the use of zip() and how the lists are "transposed", but that might be a necessary evil of not having proper arrays. If the code can be made less efficient and more verbose, in order to make it easier for a novice to read and understand, I want that. I've also steered away from using modules that aren't in the standard Python 3.x distribution, as I don't want to deal with the extra hassle, otherwise I probably would have used numpy. The plan is for them to simply copy/paste the code into the Python interpreter in Terminal (MacOS).

import random
import os
import csv
# Function that counts number of dissimilarities
# between two sequences
def diss(x, y):
 n = 0
 for a, b in zip(x, y):
 if a != b:
 n += 1
 return (n)
# Function that mutates a string (Dna) Gen times.
# On average 0.75 mutations per generation, as the resampling will
# pick the existing nucleotide in one out of four times.
def mutate(Dna, Gen):
 # Store a copy of the original DNA sequence that
 # the mutated sequence can be compared to.
 Dna_o = Dna[:]
 Len = len(Dna)
 Dis = []
 
 # For loop that mutates the sequence Gen times and calculates the
 # dissimilarity after each mutation.
 for _ in range(Gen): 
 # Random location for mutation
 Mut = random.randint(0, Len-1)
 # Randomly selected nucleotide
 Dna[Mut] = random.choice(['A', 'T', 'G', 'C'])
 # Calculate proportional dissimalirity
 Dis.append(diss(Dna, Dna_o)/Len)
 
 # Print proportional dissimilarities.
 return (Dis)
# Iterate (repeat) the mutate function i times
# Outputs a nested list
def mutate_i(Dna, Gen, i):
 Mat = []
 for _ in range(i):
 Mat.append(mutate(Dna, Gen))
 return (Mat)
# Function for helping export simulation result as CSV file.
def export_mutate(Mutate_mat, Filename="mutation_simulation.csv"):
 # Add a count of the generations
 Mutate_mat.insert(0, list(range(1, len(Mutate_mat[0])+1)))
 # Transpose the list (swap rows and columns)
 Mutate_mat = list(map(list, zip(*Mutate_mat)))
 
 # Add headers
 header = ["Generation"]
 for i in range(1, len(Mutate_mat[0])):
 header.append("sim_" + str(i))
 
 Mutate_mat.insert(0, header)
 
 # Export simulation data as a CSV file that can be imported to
 # Google Sheets
 with open(Filename, 'w') as csvfile:
 csvwriter = csv.writer(csvfile, delimiter=',')
 csvwriter.writerows(Mutate_mat)
 
 # Print the path to the CSV file
 return (os.path.join(os.getcwd(), Filename))
# # # Simulation start
# Original DNA sequence
Dna = list("ATGC" * 4)
# Set random seed so the simulation is repeatable
random.seed(1)
# Generate four simulations, each ten generations long
Mutate_mat = mutate_i(Dna, 12, 4)
# Export simulation results
export_mutate(Mutate_mat)

Produces this output:

Generation,sim_1,sim_2,sim_3,sim_4
1,0.0,0.0625,0.0625,0.0625
2,0.0,0.125,0.125,0.125
3,0.0,0.125,0.125,0.125
4,0.0,0.1875,0.125,0.1875
5,0.0625,0.25,0.1875,0.25
6,0.125,0.25,0.25,0.3125
7,0.1875,0.3125,0.3125,0.3125
8,0.25,0.3125,0.375,0.3125
9,0.3125,0.375,0.375,0.3125
10,0.375,0.375,0.375,0.375
11,0.3125,0.3125,0.4375,0.375
12,0.3125,0.3125,0.5,0.375
asked Jun 2 at 17:46
\$\endgroup\$
3
  • 2
    \$\begingroup\$ Why do you transpose the data? With only 10-12 "generations", my brain, for instance, would prefer to see the stages of difference for each 'simulation' appearing across one row, not down one column. Currently all simulations differ only by some random number. One sim could represent life in a toxic environment in which mutagenetic chemicals or radiation is present (Opportunity for more mutations in a single generation)... Sorry... And, I'd prefer the rows and cols "unswapped"... Time, for me, is horizontal, not vertical... \$\endgroup\$ Commented Jun 3 at 10:09
  • 1
    \$\begingroup\$ @Fe2O3: I was see-sawing between the the two options, but i landed on having reps along the columns and generations down the rows because i think this will be easier to handle once imported into Google Docs for further processing and plotting. What feels natural has a lot to do with habits and past exposure. Coming from R I'm used to multivariate time series (mts, xts, zoo) to run down the rows. \$\endgroup\$ Commented Jun 3 at 14:58
  • 1
    \$\begingroup\$ While "it's as broad as it is long", and it's yours to decide, after loading the CSV into Google Sheets it takes 3 mouse clicks to 'transpose' the cells of a sheet. Not trying to argue... Just that the code could use a simpler 1D array instead of 2D while data gathering... \$\endgroup\$ Commented Jun 3 at 21:44

4 Answers 4

9
\$\begingroup\$

Naming

As already mentioned, you should follow the PEP 8 style guide. But simply converting the leading character of a name to lower case might not be what you want to do in all cases. In function mutate you currently have

Len = len(Dna)

If you were to change this to, ...

len = len(Dna)

... this will generate an exception:

UnboundLocalError: cannot access local variable 'len' where it is not associated with a value

In other situations, such as:

x = 'abc'
dict = len(x) # refefines the dict function
print(dict(x=1))

Since dict is now 3 after its assignment, you are essentially trying to execute print(3(x=1)) and this will generate the following exception:

TypeError: 'int' object is not callable

In general, do not use a variable name that is also the name of a built-in function.

If You Don't Wish to Use zip

You could implement the following function:

def dissimilarity_count(dna_sequence1, dna_sequence2):
 """Return the number of dissimilarities between two sequences."""
 cnt = 0 # number of dissimilarities so far
 lnth = len(dna_sequence1)
 for idx in range(lnth):
 if dna_sequence1[idx] != dna_sequence2[idx]:
 cnt += 1
 return cnt

Note that I have used function and argument names that are more descriptive.

open Function Usage with csv Module

Even though this might not be an issue with your actual usage, nevertheless it would be a good idea to follow the documentation for csv, which states:

If csvfile is a file object, it should be opened with newline=''.

If newline='' is not specified, newlines embedded inside quoted fields will not be interpreted correctly, and on platforms that use \r\n linendings on write an extra \r will be added. It should always be safe to specify newline='', since the csv module does its own (universal) newline handling. So:

...
 with open(filename, 'w', newline='') as csvfile:
 ...
answered Jun 3 at 11:28
\$\endgroup\$
9
  • \$\begingroup\$ Thank you for the alternative to zip. Not as "neat", but closer to how the students already conceptualise the process, so should be much easier for them to follow. \$\endgroup\$ Commented Jun 3 at 17:44
  • \$\begingroup\$ I would have mentioned that function mutate_i could be implemented in one line using a list comprehension: return [mutate(dna, gen) for _ in range(i)]. This would be both more efficient and more Pythonic, but perhaps not as obvious to a non-prgrammer. By the way, can you think of a better name than mutate_i (something that directly tells us that something is being mutated a specified number of times)? \$\endgroup\$ Commented Jun 3 at 17:54
  • \$\begingroup\$ Early on I used some list comprehension, but decided it would be safer to use more general programming concepts. mutate_n? :) Naming things is hard. \$\endgroup\$ Commented Jun 3 at 18:30
  • \$\begingroup\$ How about mutate_n_times or mutate_multiple_times? \$\endgroup\$ Commented Jun 3 at 20:34
  • 1
    \$\begingroup\$ @AkselA Having given your project too much thought, it occurs to me the best approach would be to "linearise" the entire code as much as you can... This is contrary to "advice to coders" wherein 'encapsulate functionality in separate functions' is encouraged... For beginners, top-to-bottom recipes might be easier to follow and understand... You're not trying to impress any gurus; you want the kids to grok what's happening... Cheers! \$\endgroup\$ Commented Jun 5 at 13:33
7
\$\begingroup\$

The code adheres to many good coding practices already, and it should be simple for beginners to follow. Here are some minor suggestions.

Documentation

The PEP 8 style guide recommends adding docstrings for functions. For example, you could convert your helpful function comments into docstrings like this:

def diss(x, y):
 """ Counts number of dissimilarities between two sequences """

There is no need to use the word "Function" in this case. It would be useful to also document the input variable types and the function return type. Documentation like this will help to achieve one of your goals (to make the code more intuitive).

It would also be useful to add a docstring at the top of the code to summarize its purpose. It is helpful to mention the code creates an output CSV file in the current directory.

Naming

In the diss function, the variables do not have very meaningful names. Since n represents a count, it could be named count.

The function inputs, in addition to having descriptions in the docstring, could have names with "seq" in them.

PEP 8 recommends snake_case for variable names. For example:

def mutate(Dna, Gen):

would be:

def mutate(dna, gen):

Comments

The comment doesn't seem to match the code here ("ten" vs. "12"):

# Generate four simulations, each ten generations long
Mutate_mat = mutate_i(Dna, 12, 4)

Layout

Simple return statements like this:

return (n)

are commonly written without the parentheses:

return n

There are lines with just a single comment character (#) on them:

Dis = []
#
# For loop that mutates the sequence Gen times and calculates the 

The code is cleaner with a true blank line:

Dis = []
# For loop that mutates the sequence Gen times and calculates the 

Simpler

Lines like this:

 header.append("sim_" + str(i))

can be simplified using an f-string:

 header.append(f"sim_{i}")
answered Jun 2 at 18:50
\$\endgroup\$
3
  • \$\begingroup\$ Thank you for the suggestions. Some linting already fixed some of your reservations, and I've added docstrings. Though now that I've started documenting the code I feel like a repo is the next step :P \$\endgroup\$ Commented Jun 3 at 15:03
  • \$\begingroup\$ I ended up following @user286929 recommendation to avoid mismatches between numbers in code and comments. \$\endgroup\$ Commented Jun 3 at 17:19
  • 1
    \$\begingroup\$ If you really want to follow PEP 8 (and PEP 257 it delegates to) about docstrings, you'd remove the surrounding spaces, end it with a period, and say "Count" instead of "Counts". \$\endgroup\$ Commented Jun 4 at 11:36
7
\$\begingroup\$

Python's true and false equate to 1 and 0 respectively.
Save one line of code with:

def diss(x, y):
 n = 0
 for a, b in zip(x, y):
# n += ( a != b ) editted to improve clarity with next line
 n += ( a != b ) # true = 1, and false = 0
 return n

using optional parentheses to improve clarity of the code.


If it were up to me, I'd move the initialisation of the original DNA string inside mutate_i().
It doesn't appear to be needed at any higher level.


As already noted by @toolic, those constants 12 and 4 should be named, not magic numbers whose purpose is explained by a (erroneous) comment.

NUM_GEN = 12 # number of generations along each run
NUM_SIM = 4 # number of simulations to run
Mutate_mat = mutate_i(NUM_GEN, NUM_SIM)

Unless this program is destined to become more complicated, those values might be simpler if declared inside mutate_i(), rather than being passed as parameters (one of which becomes an anonymous i...)


Will the real "original DNA" please stand up?

def mutate(Dna, Gen):
 # yadda-yadda
 # yadda-yadda yadda yadda
 Dna_o = Dna[:]
 ...

versus

def mutate(Dna_o, Gen):
 Dna = Dna_o[:] # working copy of original DNA
 ...

"Working copies" are made from (unchanged) parameters passed.
"Preserving copies of originals" is putting the horse after the cart.


Give serious thought to using variations on the names of variables used. Code newbies may become confused about whether Dna is referring to a local or 'global' variable.


Proofread the code from the perspective of a coding neophyte.

# Outputs a nested list <-- Not so good
# Returns a matrix of results for preserving as a CSV file. <-- More honest
def export_mutate(Mutate_mat, Filename="mutation_simulation.csv"):
 ...
# for i in range(1, len(Mutate_mat[0])):
 for i in range(1, NUM_SIM): # if only this constant were available here...

And...

 Dis.append(diss(Dna, Dna_o)/Len)

Really? Counting the number of mutations, calculating that as a fraction of the length, AND poking the result into an array... All at the same time! What's a beginner to make of this? Are you hoping to teach coding Python or how genetic mutation works across generations.

Think through all the Pythonesque techniques you've used that will bewilder a beginner. Seek to simplify (like no transposition of the matrix) so as not to lose any who might struggle to understand.


With some effort, this could be re-written to open the file, write the header, then loop generating all the data of a single simulation, write that data as a CSV, then loop generating all of the next simulation, and so on. As stated in a comment, I view time as passing left-to-right, and discrete rows for each different simulation.

EDIT:

I want to keep the code as simple and intuitive as possible, with few abstractions.

There's been a small bit of discussion in various comments regarding the orientation of the output results (ie. generation increments vertical or horizontal). The algorithm used implements "row major" population of the data store wherein there is no cross-row dependence of the data. Things are unnecessarily made more complicated by transposing the matrix and invoking a "CSV writing" module for transfer to Google Sheets.

There are no 'embedded comma/quote' concerns that require a robust CSV writer.

Simplifying the output to write each value with a print statement would eliminate much of this code's excursion away from its central purpose of simply generating some results for downstream analysis.

"Generations" across the horizontal rows, and "simulation runs" as successive rows. Transposing and decorative headings can be added, if desired, once the data values are in a Google Sheets table. KISS!


Phylogenomics is outside of my wheelhouse (having seen this word for the first time in this question.) You've noted that 0.625 is, in part, related to a "mutation" iteration that replaces an 'A' with an 'A', or 'G' with another 'G' (nett zero)... Yet every iteration is scored by comparing it to the original progenitor sequence.

Shouldn't every generation be compared to its own predecessor? Shouldn't the statistics account for 'A' -> 'G' -> 'A' across several generations whereby a subsequence may be partially restored in a descendent? This would have been two actual mutations, but, when compared to the original, the second would have 'undone' the first. Perhaps, for realism, more than one mutation per generation should occur occasionally?


To make this exercise realistic (ie. Darwin's Natural Selection), although it is much more complex, consider teaching the kids about "codons". There could be arbitrary 'good', 'bad' and 'neutral' (aka "junk DNA") codons in a longer "string" of DNA. Mutation still happens, but each generation might require, say, >3 'good' codons and <2 'bad' codons ('neutral' codons have no impact) to survive into the next generation. (John Conway's "Game of Life"). A 'bad' or 'neutral' codon may mutate to become 'good', or whatever... Insufficient 'good' or too many 'bad' codons means the lineage dies out and the 'molecular clock' stops (aka "extinction").


Faulty premise

Consider an (unlikely but plausible) edge case:
In the first mutation, one particular base is replaced; say: 'A' -> 'T'. In an unlikely sequence - but possible from a fair RNG - only that same character is ever replaced:
T -> G -> C -> T -> C -> G -> C -> G -> T ...
That run of the simulation would flat line indicating only one mutation ('A' -> something not 'A') even though there is variation between every generation... Less unlikely edge cases would, similarly yield (perhaps marginally less-) misleading values and still corrupt any statistical results.

It's looking like less time should have been given to dealing with the complexities of Python and its arrays, lists and CSVs; and more thought given to how to produce data that would demonstrate the use for Phylogenomics.
Oversimplification can fuzz important nuances out of existence.

answered Jun 3 at 11:08
\$\endgroup\$
4
  • \$\begingroup\$ Thank you for the suggestions, most of them were solved in different ways. About "original sequence" vs. "working copy", we've so far been thinking of chinese whispers and making copies of copies on a copying machine in order to get an intuition of the sort of corruption that occurs when cells divide and DNA is copied. In this context I think talking about an "original sequence" and a "generation n copy" fits best. \$\endgroup\$ Commented Jun 3 at 18:03
  • \$\begingroup\$ About your last two paragraphs: Dissimilarity between successive generations is (assumed) stationary, but mutations accumulate, so we're interested in the dissimilarity between the original and each successive mutation. One exercise I plan for the students is to plot a curve of how dissimilarity develops over time, and take an average of this across multiple simulations. Then I want them to reason why the curve flattens out, what value it flattens out at, and why it flattens out at this value. \$\endgroup\$ Commented Jun 3 at 18:15
  • \$\begingroup\$ Your point about "Chinese Whispers" (also known as "Telephone") makes sense for forward 'copying errors'. Program execution, however, is heirarchical, not linear. 'Callees' terminate earlier than 'callers'. The contrast might make a good "teaching point" for class... You could consider using recursion instead of iteration... \$\endgroup\$ Commented Jun 3 at 22:02
  • \$\begingroup\$ Re: "plotting" and "curves"... This, to me, supports the idea that "time is horizontal; not vertical"... (Although I may be missing something.) People perceive information in different ways. My background makes me struggle to "keep straight" the data as presented in the question. \$\endgroup\$ Commented Jun 3 at 23:24
6
\$\begingroup\$

I think the most confusing line for a beginner is:

Mutate_mat = list(map(list, zip(*Mutate_mat)))

I would personally remove the map for beginners in a course that isn't for python because it's just one more thing for them to learn.

Mutate_mat = list(zip(*Mutate_mat))

works exactly the same except that it is a list of tuples instead of a list of lists but you don't need it to be mutable past that point anyways

You could also avoid the zip altogether and write a transpose function:

def transpose_2d(array):
 """
 Transposes a list of lists, assuming that the inner lists
 are of the same length i.e. an M x N array.
 """
 M = len(array)
 N = len(array[0])
 
 transposed = []
 for i in range(N):
 sub_list = []
 for j in range(M):
 sub_list.append(array[j][i])
 transposed.append(sub_list)
 return transposed
Mutate_mat = transpose_2d(Mutate_mat)

Personally I wouldn't do this because I think the extra code to read through is more of a burden than the zip statement so I would use list(zip(*Mutate_mat)) but just an option.


I might also change this line:

Mutate_mat.insert(0, list(range(1, len(Mutate_mat[0])+1)))

to:

generation_count = list(range(1, len(Mutate_mat[0])+1))
Mutate_mat.insert(0, generation_count)

I think it is harder for beginners when there are multiple things happening on a single line. Also splitting up the lines allows you to choose a variable name that describes what that part of the code is.


I would also try to avoid short forms as long as variable names aren't too long. just makes it easier to read in my opinion.

Gen -> generation
diss -> dissimilarities
Dna_o -> original_dna

The suggestion to avoid magic numbers was already made but I just wanted to say that this comment really illustrates the point:

# Generate four simulations, each ten generations long
Mutate_mat = mutate_i(Dna, 12, 4)

You've changed those numbers so now isn't not clear from this comment which argument is number of simulations and which argument is number of generations.

NUM_GEN = 12 # number of generations along each run
NUM_SIM = 4 # number of simulations to run

at the top so it's easy to find the variable you want to play with is really helpful and makes it easier to make comments that don't become obsolete

# Generate NUM_SIM simulations, each NUM_GEN generations long
Mutate_mat = mutate_i(Dna, NUM_GEN , NUM_SIM )
answered Jun 3 at 13:45
\$\endgroup\$
2
  • \$\begingroup\$ Thank you for your good suggestions, Ive implemented most of them. \$\endgroup\$ Commented Jun 3 at 17:17
  • \$\begingroup\$ "at the top" ... If the 'constants' are at global scope, passing them as function parameters (in a tiny program) is not warranted (and may only confuse...) Don't anticipate what the next version of the code may require. Write to current requirements. \$\endgroup\$ Commented Jun 3 at 22:11

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.