Reading an Excel file and comparing the amino acid sequence of each data pair

Question 1

Since I am fairly new to Python I was wondering whether anyone can help me by making the code more efficient. I know the output stinks; I will be using Pandas to make this a little nicer.

from xlrd import *
def main():
 '''This Proram reads input (A:clone name, B:sequence, C:elisa) from an
 Excel file and makes a cross comparison of each sequence pair'''
 book = open_workbook("mtask.xlsx")
 Input = book.sheet_by_index(0)
 # naming of input data
 a = (Input.col_values(0,0))
 b = (Input.col_values(1,0))
 c = (Input.col_values(2,0))
 # make dictionary: keys are seq numbers; values are residues 
 y = {}
 for i in range(Input.nrows):
 x = []
 for j in b[i]:
 x.append(j)
 y[a[i]] = x
 # comparison of sequences and extraction of mutations for each sequence pair
 List = []
 for shit in range(Input.nrows):
 for seq in range(Input.nrows):
 seq12 = []
 z = 0
 for i in y[a[seq]]:
 try:
 for j in y[a[shit]][z]:
 if i == j:
 seq12.append(i.lower()+j.lower())
 else:
 seq12.append(i+j)
 z = z+1
 except IndexError:
 print("oops")
 lib = [a[seq],a[shit],c[seq],c[shit]]
 for position, item in enumerate(seq12):
 if item.isupper():
 x = (str(item[0])+str(position+1)+str(item[1]))
 lib.append(x)
 List.append(lib)
 # comparison of sequences and extraction of mutations for each sequence pair
 dic = {}
 for i in range(Input.nrows*Input.nrows):
 x = []
 for j in List[i]:
 x.append(j)
 dic[i] = x
 # sort
 a = []
 for i in dic.values():
 a.append(i)
 # collect number of mutations in data files
 import csv
 null = []
 one = []
 two = []
 three = []
 four = []
 five = []
 six = []
 seven = []
 eight = []
 nine = []
 ten = []
 for i in range(Input.nrows*Input.nrows):
 if len(a[i]) <= 4:
 null.append(a[i])
 with open("no_mut.csv", "w", newline="") as f:
 writer = csv.writer(f)
 writer.writerows(null)
 elif len(a[i]) == 5:
 one.append(a[i])
 with open("one.csv", "w", newline="") as f:
 writer = csv.writer(f)
 writer.writerows(one)
 elif len(a[i]) == 6:
 two.append(a[i])
 with open("two.csv", "w", newline="") as f:
 writer = csv.writer(f)
 writer.writerows(two)
 elif len(a[i]) == 7:
 three.append(a[i])
 with open("three.csv", "w", newline="") as f:
 writer = csv.writer(f)
 writer.writerows(three)
 elif len(a[i]) == 8:
 four.append(a[i])
 with open("four.csv", "w", newline="") as f:
 writer = csv.writer(f)
 writer.writerows(four)
 elif len(a[i]) == 9:
 five.append(a[i])
 with open("five.csv", "w", newline="") as f:
 writer = csv.writer(f)
 writer.writerows(five)
 elif len(a[i]) == 10:
 six.append(a[i])
 with open("six.csv", "w", newline="") as f:
 writer = csv.writer(f)
 writer.writerows(six)
 elif len(a[i]) == 11:
 seven.append(a[i])
 with open("seven.csv", "w", newline="") as f:
 writer = csv.writer(f)
 writer.writerows(seven)
 elif len(a[i]) == 12:
 eight.append(a[i])
 with open("eight.csv", "w", newline="") as f:
 writer = csv.writer(f)
 writer.writerows(eight)
 elif len(a[i]) == 13:
 nine.append(a[i])
 with open("nine.csv", "w", newline="") as f:
 writer = csv.writer(f)
 writer.writerows(nine)
 elif len(a[i]) == 14:
 ten.append(a[i])
 with open("ten.csv", "w", newline="") as f:
 writer = csv.writer(f)
 writer.writerows(ten)
main()

Question 2

Follow PEP8

PEP8 is the official coding style guide of Python. You have several notable violations:

Avoid wildcard imports like from xlrd import *. It makes it hard to tell which names in the code come from xlrd. frim xlrd import open_workbook would be much better and easier to understand.
Use snake_case for variable names. The variables Input and List violate this, but I admit these are a bit tricky because input and list could shadow builtin names, so you need to find something better instead. Instead of Input, sheet seems a clear choice, and instead of List, something that describes what kind of list it is would be good, for example sequence_pairs

These are just examples, there are many more PEP8 violations. Use the pep8 command line utility (you can install with pip install --user pep8) to detect all the PEP8 violations in your source files.

Although it's not a PEP8 violation, but for better readability I recommend extra spaces around operators, and to remove unnecessary parentheses. So for example:

Instead of: x = (str(item[0])+str(position+1)+str(item[1]))
Write like this: x = (str(item[0]) + str(position+1) + str(item[1]))

Use list comprehensions

List comprehensions are a powerful and elegant sexy feature of Python. For example instead of this:

x = []
for j in b[i]:
 x.append(j)
y[a[i]] = x

You can write simply:

y[a[i]] = [j for j in b[i]]

But actually, if you just want to clone a list, this is the best:

y[a[i]] = b[i][:]

Cleaning up the csv writing

This of course is a complete mess:

null = []
one = []
two = []
# ...
for i in range(Input.nrows*Input.nrows):
 if len(a[i]) <= 4:
 null.append(a[i])
 with open("no_mut.csv", "w", newline="") as f:
 writer = csv.writer(f)
 writer.writerows(null)
 elif len(a[i]) == 5:
 one.append(a[i])
 with open("one.csv", "w", newline="") as f:
 writer = csv.writer(f)
 writer.writerows(one)
 elif len(a[i]) == 6:
 # ...

The code duplication might actually be the smaller of two big problems. The biggest problem is that each file is potentially rewritten multiple times. I don't know what kind of data you have, but for example if you have 10 rows with len(a[i]) == 5, then the file one.csv will be first written with 1 line, then rewritten with 2 lines, 3 lines, ..., in the end 10 lines. This is crazy. You should rework this to write each file only once.

Here's an untested approach that I think should work, and solves both the multiple writing and the code duplication:

files = (
 (lambda length: length <= 4, "no_mut.csv", []),
 (lambda length: length == 5, "one.csv", []),
 (lambda length: length == 6, "two.csv", []),
 (lambda length: length == 7, "three.csv", []),
 # ... and so on
)
for i in range(sheet.nrows * sheet.nrows):
 for matcher, _, rows in files:
 if matcher(a[i]):
 rows.append(a[i])
 break
for _, filename, rows in files:
 with open(filename, 'w', newline='') as fh:
 writer = csv.writer(fh)
 writer.writerows(rows)

Question 3

This is a common misconception, but PEP 8 only requires spaces around assignment and comparison operators, for others it only requires equal amounts of whitespace on both sides. It does recommend using space to point out operators with different priorities, see here, but that doesn't really apply to the example you chose.

Question 4

If you meant str(position+1) instead of str(position + 1), I corrected it now, thanks! If you meant something else please let me know!

Question 5

Even the spaces between the str calls are not mandatory. I think they do increase readability, but PEP 8 leaves it to you to decide.

Question 6

On closer look, you're absolutely right! I rephrased that in my answer, and thanks a lot for pointing out!

janos janos 113k15 gold badges154 silver badges396 bronze badges · Answer 1 · 2015-02-22 10:10:22Z

Follow PEP8

PEP8 is the official coding style guide of Python. You have several notable violations:

Avoid wildcard imports like from xlrd import *. It makes it hard to tell which names in the code come from xlrd. frim xlrd import open_workbook would be much better and easier to understand.
Use snake_case for variable names. The variables Input and List violate this, but I admit these are a bit tricky because input and list could shadow builtin names, so you need to find something better instead. Instead of Input, sheet seems a clear choice, and instead of List, something that describes what kind of list it is would be good, for example sequence_pairs

These are just examples, there are many more PEP8 violations. Use the pep8 command line utility (you can install with pip install --user pep8) to detect all the PEP8 violations in your source files.

Although it's not a PEP8 violation, but for better readability I recommend extra spaces around operators, and to remove unnecessary parentheses. So for example:

Instead of: x = (str(item[0])+str(position+1)+str(item[1]))
Write like this: x = (str(item[0]) + str(position+1) + str(item[1]))

Use list comprehensions

List comprehensions are a powerful and elegant sexy feature of Python. For example instead of this:

x = []
for j in b[i]:
 x.append(j)
y[a[i]] = x

You can write simply:

y[a[i]] = [j for j in b[i]]

But actually, if you just want to clone a list, this is the best:

y[a[i]] = b[i][:]

Cleaning up the csv writing

This of course is a complete mess:

null = []
one = []
two = []
# ...
for i in range(Input.nrows*Input.nrows):
 if len(a[i]) <= 4:
 null.append(a[i])
 with open("no_mut.csv", "w", newline="") as f:
 writer = csv.writer(f)
 writer.writerows(null)
 elif len(a[i]) == 5:
 one.append(a[i])
 with open("one.csv", "w", newline="") as f:
 writer = csv.writer(f)
 writer.writerows(one)
 elif len(a[i]) == 6:
 # ...

The code duplication might actually be the smaller of two big problems. The biggest problem is that each file is potentially rewritten multiple times. I don't know what kind of data you have, but for example if you have 10 rows with len(a[i]) == 5, then the file one.csv will be first written with 1 line, then rewritten with 2 lines, 3 lines, ..., in the end 10 lines. This is crazy. You should rework this to write each file only once.

Here's an untested approach that I think should work, and solves both the multiple writing and the code duplication:

files = (
 (lambda length: length <= 4, "no_mut.csv", []),
 (lambda length: length == 5, "one.csv", []),
 (lambda length: length == 6, "two.csv", []),
 (lambda length: length == 7, "three.csv", []),
 # ... and so on
)
for i in range(sheet.nrows * sheet.nrows):
 for matcher, _, rows in files:
 if matcher(a[i]):
 rows.append(a[i])
 break
for _, filename, rows in files:
 with open(filename, 'w', newline='') as fh:
 writer = csv.writer(fh)
 writer.writerows(rows)

This is a common misconception, but PEP 8 only requires spaces around assignment and comparison operators, for others it only requires equal amounts of whitespace on both sides. It does recommend using space to point out operators with different priorities, see here, but that doesn't really apply to the example you chose.
If you meant str(position+1) instead of str(position + 1), I corrected it now, thanks! If you meant something else please let me know!
Even the spaces between the str calls are not mandatory. I think they do increase readability, but PEP 8 leaves it to you to decide.
On closer look, you're absolutely right! I rephrased that in my answer, and thanks a lot for pointing out!

Stack Exchange Network

Reading an Excel file and comparing the amino acid sequence of each data pair

1 Answer 1

Follow PEP8

Use list comprehensions

Cleaning up the csv writing

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Reading an Excel file and comparing the amino acid sequence of each data pair

1 Answer 1

Follow PEP8

Use list comprehensions

Cleaning up the csv writing

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions