Added the problem link below and my implementation of it. Took me a few hours to program. I have added comments to explain my thought process. Problem Link Goal - To implement a program that identifies to whom a sequence of DNA belongs.
should require first command-line argument the name of a CSV file containing the STR counts for a list of individuals and sits second command-line argument the name of a text file containing the DNA sequence to identify.
program should print an error message if incorrect no of arguments.
Your program should open the CSV file and dna file and read its contents into memory.
first row of the CSV file will be the column names. The first column will be the word name and the remaining columns will be the STR sequences themselves.
Your program should open the DNA sequence and read its contents into memory.
For each of the STRs (from the first line of the CSV file), your program should compute the longest run of consecutive repeats of the STR in the DNA sequence to identify.
If the STR counts match exactly with any of the individuals in the CSV file, your program should print out the name of the matching individual and no match if there are no matches.
You may assume that the STR counts will not match more than one individual.
import csv
import sys
import re
def main():
# checking if its correct no of arguments
if len(sys.argv) != 3:
print("Please specify both csv and text files")
sys.exit(1)
data = []
# opening the file into memory
with open(sys.argv[1], "r") as f:
reader = csv.DictReader(f)
for i in reader:
data.append(i)
# opening text sequence file and storing as string in seq
with open(sys.argv[2]) as f:
seq = f.read()
# getting count from sequence
countDict = count(data, seq)
name = ""
# comparing with data from csv file
name = check(countDict, data)
# printing the name of the match
if name:
print(name)
else:
print("No match")
def count(data, seq):
# getting STRs from database file so we know what STRs to search for
keys = list(data[0].keys())
keys.pop(0)
countDict = {}
# creating a dictionary to store the count for the keys with key names
for i in keys:
countDict[i] = 0
strlist = []
# looping through the STRs to see if any matches in the string and if there are adding to the count
for key in countDict:
counter = 0
for i in range(len(seq)):
end = len(key) # to get the length of the key
if key == seq[i:end+i]:
counter += 1
v = countDict[key]
if not key == seq[i+end:end+i+end]:
countDict[key] = max(v, counter)
counter = 0 # resetting the counter to 0
else:
countDict[key] = v
i += len(key)
return countDict
def check(countDict, data):
# empty string to store match name
name = ""
# going through the csv data file with potential matches ine at a time
for i in data:
# removing names from the data so we can do direct comparison with the count from dna found
dictExceptName = dict(list(i.items())[1:])
# converting string count to int count for comparison
for key, value in dictExceptName.items():
dictExceptName[key] = int(value)
# comparing each person to dna found list
if dictExceptName == countDict:
name = i["name"] # getting name of the person if there is a match
return name
main()
-
2\$\begingroup\$ While the question is excellent (perfect match for the site)! We ask that every question is self contained, what happens if your link dies? Please include the problemstatement from your link in your question =) \$\endgroup\$N3buchadnezzar– N3buchadnezzar2021年11月22日 17:09:21 +00:00Commented Nov 22, 2021 at 17:09
-
\$\begingroup\$ As a second note you would be able to read the problem statement in your code if you had added docstrings and proper documentation ;-) \$\endgroup\$N3buchadnezzar– N3buchadnezzar2021年11月22日 17:10:00 +00:00Commented Nov 22, 2021 at 17:10
-
\$\begingroup\$ also included the problem statement, and i will look into how to add docstrings and do proper documentation thanks \$\endgroup\$singhsokhal– singhsokhal2021年11月22日 17:34:39 +00:00Commented Nov 22, 2021 at 17:34
2 Answers 2
Your code seems to work properly on the whole test suite provided which is a very good start.
Also, splitting the logic in 2 parts, one preprocessing the counts and the other performing the comparison is a great way to proceeed.
However, there are still various way to improve the code.
Style
Python has a style guide called PEP 8 which is definitly worth reading and trying to apply.
Among other things, in your case, it would imply using snake_case
instead of camel_case
.
Also, tiny invisible detail: your code contains trailing whitespace which can be cleaned up.
Documentation
The functions defined have very generic name which do not bring much information about what is being done. This could be improved with a better name but also by adding documentation to your function, for instance as docstrings
More functions to count
The count
function is quite complicated because it tries to apply some logic to various elements but also needs to worry about preprocessing and storing the results in a usable form.
Here is my take on this:
def count_consecutive_repetitions(key, sequence):
"""Count the number of consecutive repetitions of string 'key' are in string 'sequence'."""
search_str = key
counter = 0
pos = 0
while True:
pos = sequence.find(search_str, pos)
if pos == -1:
return counter
counter += 1
search_str += key
def count(data, seq):
"""Get count for to be continued."""
# Extract relevant key strings from header row and get rid of first item
keys = list(data[0].keys())[1:]
# Get consecutive counts and store in a dictionnary
return { k: count_consecutive_repetitions(k, seq) for k in keys }
Note: I've:
- rewritten the algorithm to search to avoid messing with string indices
- used a dictionnary comprehension to create the dictionnary. This was not required but it is a nice tool to have in your toolkit
Also, smaller functions are also easier to test. In my case, I wrote:
assert count_consecutive_repetitions("ABC", "") == 0
assert count_consecutive_repetitions("ABC", "DEF") == 0
assert count_consecutive_repetitions("ABC", "ABDEFC") == 0
assert count_consecutive_repetitions("ABC", "ABCDEF") == 1
assert count_consecutive_repetitions("ABC", "DEFABC") == 1
assert count_consecutive_repetitions("ABC", "DEFABCABCABCABCGHI") == 4
assert count_consecutive_repetitions("ABC", "DEFABCDEFABCABCDEFABCABCABCGHI") == 3
In order to do things properly, these could have been written with a proper test framework but I was too lazy.
Improvements to check
Instead of giving name the initial value "", you could give the value
None
. It would be clearer in case of wrong data to make a difference between someone whose name was input as "" and the case where no relevant result was found. (You would need to update the main code accordingly withname is None
).Here again, we can use dictionnary comprehension to perform all the operations you are interested: filtering and converting.
Here is what I got:
def check(countDict, data):
"""Get elements in data matching the string count in `countDict`."""
# Going through the csv data file with potential matches ine at a time
for i in data:
# Get count by converting data to integers (except for the name)
count = { k: int(v) for k, v in i.items() if k != 'name' }
# Comparing each person to dna found list
if count == countDict:
return i["name"]
return None
Now, to make the filtering (based on string rather than index) more consistent in the other function, we could update it to:
def count(data, seq):
"""Get count for to be continued."""
# Get consecutive counts and store in a dictionnary
return { k: count_consecutive_repetitions(k, seq) for k in data[0].keys() if k != 'name' }
Final code
Performing various cosmetic changes, here is my final version of the code:
import csv
import sys
def main():
# Check number of arguments
if len(sys.argv) != 3:
print("Please specify both csv and text files")
sys.exit(1)
# Get list of individuals
with open(sys.argv[1], "r") as f:
data = list(csv.DictReader(f))
# Get DNA sequence
with open(sys.argv[2]) as f:
seq = f.read()
# Get count of repeatited keys
count_dict = count_repeated_keys(data, seq)
# Get name of person with same count
name = get_name_with_same_count(data, count_dict)
# Print result
print("No match" if name is None else name)
def count_consecutive_repetitions(key, sequence):
"""Count the number of consecutive repetitions of string 'key' are in string 'sequence'."""
search_str = key
counter = 0
pos = 0
while True:
pos = sequence.find(search_str, pos)
if pos == -1:
return counter
counter += 1
search_str += key
def count_repeated_keys(data, seq):
"""Get repeated count for each key from data."""
# Get consecutive counts and store in a dictionnary
return {
k: count_consecutive_repetitions(k, seq) for k in data[0].keys() if k != "name"
}
def get_name_with_same_count(data, count_dict):
"""Get element in data matching the string count in `count_dict`."""
# Going through the csv data file with potential matches ine at a time
for i in data:
# Get count by converting data to integers (except for the name)
count = {k: int(v) for k, v in i.items() if k != "name"}
# Comparing each person to dna found list
if count == count_dict:
return i["name"]
return None
main()
assert count_consecutive_repetitions("ABC", "") == 0
assert count_consecutive_repetitions("ABC", "DEF") == 0
assert count_consecutive_repetitions("ABC", "ABDEFC") == 0
assert count_consecutive_repetitions("ABC", "ABCDEF") == 1
assert count_consecutive_repetitions("ABC", "DEFABC") == 1
assert count_consecutive_repetitions("ABC", "DEFABCABCABCABCGHI") == 4
assert count_consecutive_repetitions("ABC", "DEFABCDEFABCABCDEFABCABCABCGHI") == 3
Edit
I just realised that count_consecutive_repetitions
could be simplified as:
def count_consecutive_repetitions(key, sequence):
"""Count the number of consecutive repetitions of string 'key' are in string 'sequence'."""
pos = 0
for c in itertools.count(start=1):
pos = sequence.find(key * c, pos)
if pos == -1:
return c - 1
-
\$\begingroup\$ thanks thats really really helpful, i am just starting out , even this took me a few hours to implement. I am gonna look at the style guide and the docstrings. Thank You \$\endgroup\$singhsokhal– singhsokhal2021年11月26日 02:15:47 +00:00Commented Nov 26, 2021 at 2:15
- Rather than performing your own argument parsing, consider calling into the built-in
argparse
module. countDict
should becount_dict
by PEP8.- If you insist on doing your own argument parsing, avoid hard-coded indexing like
sys.argv[1]
and instead do a tuple unpack like_, csv, sequence = sys.argv
. - You've not opened your CSV file the way that Python recommends - you should be passing
newline=''
. - I don't consider this a strong use case for
DictReader
, and a regular reader should do just fine. The only column with a static title isname
; the rest are basically positional. - Your
data
should not need to exist in memory all at once. You should be able to process CSV rows one at a time until you find a match. if not key ==
should beif key !=
- You have a loop where
i
is an index into arange()
, but you also increment it yourself. This seems like a really bad idea. - Rather than slice-comparison like
if not key == seq[i+end:end+i+end]:
, usestartswith
passing a starting index
In the way of potential improvements, you may consider:
- Adding support for archive reading, eliminating the need for unzipping your data on the filesystem
- Adding PEP484 type hints
- Adding the test cases described in the problem definition
- Adding a
__main__
guard
Suggested
#!/usr/bin/env python3
# Based on https://cs50.harvard.edu/x/2021/psets/6/dna/
import csv
from argparse import ArgumentParser, Namespace
from contextlib import contextmanager
from io import TextIOWrapper
from typing import Optional, Callable, ContextManager, TextIO
from zipfile import ZipFile
def get_args() -> Namespace:
"""
Parse command-line arguments. If they're valid, return a namespace containing the csv_name, sequence_name and
optional from_zip; if they aren't valid exit the program.
"""
parser = ArgumentParser(
description='Identify to whom a sequence of DNA belongs according to '
'the Harvard CS50 2021 PSET 6 specification.',
)
parser.add_argument(
'csv_name',
help='Name of a CSV file containing the STR counts for a list of '
'individuals',
)
parser.add_argument(
'sequence_name',
help='Name of a text file containing the DNA sequence to identify',
)
parser.add_argument(
'--from-zip', '-z',
help=f'Read files out of this zip file'
)
return parser.parse_args()
@contextmanager
def open_source(archive_name: Optional[str]) -> ContextManager[Callable]:
"""
Switch between archive and regular filesystem retrieval. If archive_name is None, use regular filesystem retrieval;
otherwise retrieve streams from the named zip archive. In either case, a nested context manager is returned: the
outer context manager yields a callable to be used as "open"; and the inner context manager is invoked when that
open callable is called and actually opens the stream.
"""
if archive_name:
with ZipFile(archive_name) as archive:
@contextmanager
def file_open(file: str, *args, **kwargs) -> TextIO:
with archive.open(file) as f, \
TextIOWrapper(f, *args, **kwargs) as wrapper:
yield wrapper
yield file_open
else:
yield open
def longest_seq(haystack: str, needle: str) -> int:
"""
Return the count of the longest contiguous sequence of needles in the haystack.
"""
index = 0
best = 0
while True:
# out-of-sequence
index = haystack.find(needle, index)
if index == -1:
break
# in sequence
index += len(needle)
run_length = 1
for index in range(index, len(haystack), len(needle)):
if not haystack.startswith(needle, index):
break
run_length += 1
if best < run_length:
best = run_length
return best
def search_files(from_zip: Optional[str], csv_name: str, sequence_name: str) -> Optional[str]:
"""
Optionally loading files from the indicated zip archive, load Short Tandem
Repeat substring definitions from the CSV file's header; find the longest
contiguous count of each STR in the indicated sequence; and return the
name of the CSV row whose STR counts match. If there is no match, return
None.
"""
with open_source(from_zip) as open_file, \
open_file(csv_name, newline='') as csv_file:
with open_file(sequence_name) as sequence_file:
sequence = sequence_file.read()
reader = csv.reader(csv_file)
strs = next(reader)[1:]
target = [
str(longest_seq(sequence, substr))
for substr in strs
]
for name, *candidate in reader:
if candidate == target:
return name
return None
def test() -> None:
cases = (
('small', 'Bob'),
('small', None),
('small', None),
('small', 'Alice'),
('large', 'Lavender'),
('large', 'Luna'),
('large', 'Ron'),
('large', 'Ginny'),
('large', 'Draco'),
('large', 'Albus'),
('large', 'Hermione'),
('large', 'Lily'),
('large', None),
('large', 'Severus'),
('large', 'Sirius'),
('large', None),
('large', 'Harry'),
('large', None),
('large', 'Fred'),
('large', None),
)
for i, (size, expected) in enumerate(cases, 1):
actual = search_files(
from_zip='dna.zip',
csv_name=f'dna/databases/{size}.csv',
sequence_name=f'dna/sequences/{i}.txt',
)
assert actual == expected
def main() -> None:
args = get_args()
result = search_files(
args.from_zip, args.csv_name, args.sequence_name
) or 'No match'
print(result)
if __name__ == '__main__':
main()
Explore related questions
See similar questions with these tags.