I set out to write a small Python script to tell me cheapest domains. I manually scraped for data from iwantmyname.com into a plain text file as follows:
.COM
Commercial
14ドル.90
The most registered domain extension on the planet.
.CO 48% off!
Colombia
19ドル.90
then 39ドル.00/year
It's .com without the m.
.APP
Generic
29ドル.00
Your app needs a new home
.NET
Network
17ドル.10
Think garage startups, beards before they were cool, and hobbyist tinkering.
This file has about 2050 lines.
First I wrote a script to get the domains and prices by reading lines that start with a .
for domain and a $
for the associated price.
from operator import itemgetter
domain_prices_raw = []
with open(r'domain_prices.txt') as f:
domain_prices_raw = f.readlines()
domain_prices_raw = [x.strip() for x in domain_prices_raw]
domains = []
prices = []
for line in domain_prices_raw:
if line.startswith('.'):
domains.append(line.split(' ')[0])
if line.startswith('$'):
prices.append(float(line.replace('$', '')))
sorted_domain_prices = sorted(
list(zip(domains, prices)), key=itemgetter(1))
with open('sorted_domain_prices.txt', 'w') as f:
for domain, price in sorted_domain_prices:
f.write("{0}\t{1}\n".format(domain, price))
A bit later I realized that some domains have a low price only for the first year and subsequently have higher price per year. So I updated the script to read the lines that start with then
.
Example from the file:
.CO 48% off!
Colombia
19ドル.90
then 39ドル.00/year
Updated script is as follows:
from operator import itemgetter
domain_prices_raw = []
with open(r'domain_prices.txt') as f:
domain_prices_raw = f.readlines()
domain_prices_raw = [x.strip() for x in domain_prices_raw]
domains = []
prices = []
prices_after_first_year = []
for line in domain_prices_raw:
if line.startswith('.'):
domains.append(line.split(' ')[0])
if line.startswith('$'):
prices.append(float(line.replace('$', '')))
prices_after_first_year.append(float(line.split(' ')[1].split(
'/')[0].replace('$', '')) if line.startswith('then') else 0)
sorted_domain_prices = sorted(
list(zip(domains, prices, prices_after_first_year)), key=itemgetter(1))
with open('sorted_domain_prices.txt', 'w') as f:
for domain, price, after in sorted_domain_prices:
f.write("{0}\t{1}\t{2}\n".format(domain, price, after))
The script ran quickly enough and I was happy that I could get a working solution, if only for a few minutes.
I know I used zip
to associate adjacent items and that took care of most of the logic that I would have had to write if I weren't aware the existence of something like zip from the knowledge of other programming languages.
Could the script be improved other ways to be more pythonic, more performance oriented and if nothing else, just a more expressive and or OO way to solve the problem?
Bonus question: Could a CS degree have guided me to arrive at a proper solution?
-
3\$\begingroup\$ A CS degree exposes you to a broad variety of theoretical topics: data structures, algorithms, operating system design, compilers, graphics, machine learning, etc. From what I have observed, CS grads don't necessarily write good code — rather, that comes with practice, experience, collaboration, and feedback. Code Review helps a lot in that improvement process. \$\endgroup\$200_success– 200_success2018年05月30日 03:47:05 +00:00Commented May 30, 2018 at 3:47
2 Answers 2
One thing that makes a script like this more readable, maintainable and extendable is giving names to things. The easiest way to do that is using functions.
First I would make a function that reads and cleans your file:
def get_domain_prices(file_name):
with open(file_name) as f:
for line in f:
yield line.strip()
This is a generator producing each cleaned line as needed. There is no need to read the whole file into memory here since you process it line by line anyway.
Next, your if
conditions are mutually exclusive (a line can only start with one character), so you could use elif
the second time to save yourself one check if the first one is true.
For the parsing of the "then 39ドル.00/year"
I would use a regex. This format is fixed enough for this to be easy and also complicated enough that splitting and replacing becomes cumbersome. The regex just searches for the string "then $x.yy/year"
, where before the decimal point there must be one or more digits and after the decimal point there are exactly two digits.
import re
for line in get_domain_prices('domain_prices.txt'):
if line.startswith('.'):
if len(prices_after_first_year) < len(domains):
prices_after_first_year.append(prices[-1])
domains.append(line.split()[0])
elif line.startswith('$'):
prices.append(float(line.replace('$', '')))
elif line.startswith("then"):
price = float(re.search(r'then \$(\d+.\d\d)/year', line).groups()[0])
prices_after_first_year.append(price)
I also merged this checking into the elif
chain and put a safe guard in the first condition to check if the previous domain had th price after one year set and if not add the price of that domain.
sorted
can take a zip
object, there is no need to cast it to list
first (even in Python 3). If you re-structured the order in the zip
(and in the write
) you don't need to set the key and rely on the normal sorting of tuples (beware that this might produce a different order for domains with the same price, because then they are ordered by the next element of the tuple).
Since I guess you are using Python 3 (and if you are not yet, you should), I used f-string
s (Python 3.6+) below to make the writing a bit easier.
sorted_domain_prices = sorted(zip(prices, prices_after_first_year, domains))
with open('sorted_domain_prices.txt', 'w') as f:
for price, after, domain in sorted_domain_prices:
f.write(f"{domain}\t{price}\t{after}\n")
Putting it all together:
import re
def clean_lines(file_name):
with open(file_name) as f:
for line in f:
yield line.strip()
def parse_domain_prices(file_name):
for line in clean_lines(file_name):
if line.startswith('.'):
if len(prices_after_first_year) < len(domains):
prices_after_first_year.append(prices[-1])
domains.append(line.split(' ')[0])
elif line.startswith('$'):
prices.append(float(line.replace('$', '')))
elif line.startswith("then"):
price = float(re.search(r'then \$(\d+.\d\d)/year', line).groups()[0])
prices_after_first_year.append(price)
return prices, prices_after_first_year, domains
def write_domain_prices(sorted_domain_prices, file_name):
with open(file_name, 'w') as f:
for price, after, domain in sorted_domain_prices:
f.write(f"{domain}\t{price}\t{after}\n")
if __name__ == "__main__":
data = parse_domain_prices('domain_prices.txt')
write_domain_prices(sorted(zip(*data)), 'sorted_domain_prices.txt')
Here I modified the order in which the tuple is returned to be able to directly pass it to zip
. I also added a if __name__ == "__main__":
guard to allow importing parts of this script from another script without running it.
As a last step you should consider adding docstring
s that describe what each function does so you can still remember in 6 months.
-
\$\begingroup\$ Thank you for taking the time to write such an elaborate response. I hope you don't mind if ask you one or two questions at the most after I am done understanding and grokking your comment. :) \$\endgroup\$Animesh D– Animesh D2018年05月31日 05:45:33 +00:00Commented May 31, 2018 at 5:45
-
\$\begingroup\$ @Animesh One or two about understanding the answer are fine ;-) \$\endgroup\$Graipher– Graipher2018年05月31日 05:47:40 +00:00Commented May 31, 2018 at 5:47
Bug
Your second script does not handle subsequent-year prices correctly. It tries to extract a "then $..." price for every line of the input, instead of once per domain.
This bug illustrates two weaknesses of the design both programs, which together make it vulnerable to such mixups:
The parsing works line by line. A more robust approach would be to immediately split the blob of input text into one stanza per domain, then process each stanza independently.
The
domains
,prices
, andprices_after_first_year
are separate data structures. You hope that the corresponding entries in each list refer to the same domain, and you blindlyzip()
them together, but there isn't anything about the code that guarantees that assumption. If, for example, one of the domains is priced in € instead of ,ドル then your data would be inconsistent, and you would never find out, because the code doesn't crash.
Parsing
To extract information from text that fits a certain pattern, use regular expressions. It requires some investment of time to learn regular expressions — it is actually a concept and language that is independent of Python — but it really is the right tool to use for these kinds of tasks.
For this problem, I would use regular expressions in two ways:
To split the input into one stanza per domain, use
re.split()
. Here, I would look for a newline character followed by a dot. One tricky aspect is to put the dot inside a lookahead assertion(?=\.)
, so that only the newline is considered to be the delimiter to be discarded, and the dot is retained as the first character of each stanza.To extract the name, price, and optional subsequent-year price of each domain, I would use a regular expression with named groups — see
(?P<name>...)
in the documentation. A very tricky technicality is that the regular expression must be correctly formulated to capture "then $..." optionally.
Input / output
The open()
calls should ideally be placed next to each other, to make it easy to see what the input and output filenames are.
To write tab-delimited output, use the csv
module.
Suggested solution
import csv
from operator import itemgetter
import re
REGEX = re.compile(
r'(?P<name>\.[A-Z]+).*^\$(?P<price>[\d.]+)(?:.*then \$(?P<then>[\d.]+))?',
flags=re.MULTILINE | re.DOTALL
)
def parse_and_sort_domains(input, output, sort_attr='price'):
domains = [
REGEX.match(description).groupdict()
for description in re.split(r'\n(?=\.)', input.read())
]
writer = csv.DictWriter(
output, domains[0].keys(), dialect=csv.excel_tab, extrasaction='ignore'
)
writer.writerows(sorted(domains, key=itemgetter(sort_attr)))
with open('domain_prices.txt') as input, \
open('sorted_domain_prices.txt', 'w') as output:
parse_and_sort_domains(input, output)
-
\$\begingroup\$ This is a great comment, thanks. I was looking for advice in terms of program design and such, but you have shown me what I was overlooking or missing completely. \$\endgroup\$Animesh D– Animesh D2018年05月31日 05:49:35 +00:00Commented May 31, 2018 at 5:49