Parse and sort domain names and their prices

Question 1

I set out to write a small Python script to tell me cheapest domains. I manually scraped for data from iwantmyname.com into a plain text file as follows:

.COM
Commercial
14ドル.90
The most registered domain extension on the planet.
.CO 48% off!
Colombia
19ドル.90
then 39ドル.00/year
It's .com without the m.
.APP
Generic
29ドル.00
Your app needs a new home
.NET
Network
17ドル.10
Think garage startups, beards before they were cool, and hobbyist tinkering.

This file has about 2050 lines.

First I wrote a script to get the domains and prices by reading lines that start with a . for domain and a $ for the associated price.

from operator import itemgetter
domain_prices_raw = []
with open(r'domain_prices.txt') as f:
 domain_prices_raw = f.readlines()
domain_prices_raw = [x.strip() for x in domain_prices_raw]
domains = []
prices = []
for line in domain_prices_raw:
 if line.startswith('.'):
 domains.append(line.split(' ')[0])
 if line.startswith('$'):
 prices.append(float(line.replace('$', '')))
sorted_domain_prices = sorted(
 list(zip(domains, prices)), key=itemgetter(1))
with open('sorted_domain_prices.txt', 'w') as f:
 for domain, price in sorted_domain_prices:
 f.write("{0}\t{1}\n".format(domain, price))

A bit later I realized that some domains have a low price only for the first year and subsequently have higher price per year. So I updated the script to read the lines that start with then.

Example from the file:

.CO 48% off!
Colombia
19ドル.90
then 39ドル.00/year

Updated script is as follows:

from operator import itemgetter
domain_prices_raw = []
with open(r'domain_prices.txt') as f:
 domain_prices_raw = f.readlines()
domain_prices_raw = [x.strip() for x in domain_prices_raw]
domains = []
prices = []
prices_after_first_year = []
for line in domain_prices_raw:
 if line.startswith('.'):
 domains.append(line.split(' ')[0])
 if line.startswith('$'):
 prices.append(float(line.replace('$', '')))
 prices_after_first_year.append(float(line.split(' ')[1].split(
 '/')[0].replace('$', '')) if line.startswith('then') else 0)
sorted_domain_prices = sorted(
 list(zip(domains, prices, prices_after_first_year)), key=itemgetter(1))
with open('sorted_domain_prices.txt', 'w') as f:
 for domain, price, after in sorted_domain_prices:
 f.write("{0}\t{1}\t{2}\n".format(domain, price, after))

The script ran quickly enough and I was happy that I could get a working solution, if only for a few minutes.

I know I used zip to associate adjacent items and that took care of most of the logic that I would have had to write if I weren't aware the existence of something like zip from the knowledge of other programming languages.

Could the script be improved other ways to be more pythonic, more performance oriented and if nothing else, just a more expressive and or OO way to solve the problem?

Bonus question: Could a CS degree have guided me to arrive at a proper solution?

Question 2

A CS degree exposes you to a broad variety of theoretical topics: data structures, algorithms, operating system design, compilers, graphics, machine learning, etc. From what I have observed, CS grads don't necessarily write good code — rather, that comes with practice, experience, collaboration, and feedback. Code Review helps a lot in that improvement process.

Question 3

One thing that makes a script like this more readable, maintainable and extendable is giving names to things. The easiest way to do that is using functions.

First I would make a function that reads and cleans your file:

def get_domain_prices(file_name):
 with open(file_name) as f:
 for line in f:
 yield line.strip()

This is a generator producing each cleaned line as needed. There is no need to read the whole file into memory here since you process it line by line anyway.

Next, your if conditions are mutually exclusive (a line can only start with one character), so you could use elif the second time to save yourself one check if the first one is true.

For the parsing of the "then 39ドル.00/year" I would use a regex. This format is fixed enough for this to be easy and also complicated enough that splitting and replacing becomes cumbersome. The regex just searches for the string "then $x.yy/year", where before the decimal point there must be one or more digits and after the decimal point there are exactly two digits.

import re
for line in get_domain_prices('domain_prices.txt'):
 if line.startswith('.'):
 if len(prices_after_first_year) < len(domains):
 prices_after_first_year.append(prices[-1])
 domains.append(line.split()[0])
 elif line.startswith('$'):
 prices.append(float(line.replace('$', '')))
 elif line.startswith("then"):
 price = float(re.search(r'then \$(\d+.\d\d)/year', line).groups()[0])
 prices_after_first_year.append(price)

I also merged this checking into the elif chain and put a safe guard in the first condition to check if the previous domain had th price after one year set and if not add the price of that domain.

sorted can take a zip object, there is no need to cast it to list first (even in Python 3). If you re-structured the order in the zip (and in the write) you don't need to set the key and rely on the normal sorting of tuples (beware that this might produce a different order for domains with the same price, because then they are ordered by the next element of the tuple).

Since I guess you are using Python 3 (and if you are not yet, you should), I used f-strings (Python 3.6+) below to make the writing a bit easier.

sorted_domain_prices = sorted(zip(prices, prices_after_first_year, domains))
with open('sorted_domain_prices.txt', 'w') as f:
 for price, after, domain in sorted_domain_prices:
 f.write(f"{domain}\t{price}\t{after}\n")

Putting it all together:

import re
def clean_lines(file_name):
 with open(file_name) as f:
 for line in f:
 yield line.strip()
def parse_domain_prices(file_name):
 for line in clean_lines(file_name):
 if line.startswith('.'):
 if len(prices_after_first_year) < len(domains):
 prices_after_first_year.append(prices[-1])
 domains.append(line.split(' ')[0])
 elif line.startswith('$'):
 prices.append(float(line.replace('$', '')))
 elif line.startswith("then"):
 price = float(re.search(r'then \$(\d+.\d\d)/year', line).groups()[0])
 prices_after_first_year.append(price)
 return prices, prices_after_first_year, domains
def write_domain_prices(sorted_domain_prices, file_name):
 with open(file_name, 'w') as f:
 for price, after, domain in sorted_domain_prices:
 f.write(f"{domain}\t{price}\t{after}\n")
if __name__ == "__main__":
 data = parse_domain_prices('domain_prices.txt')
 write_domain_prices(sorted(zip(*data)), 'sorted_domain_prices.txt')

Here I modified the order in which the tuple is returned to be able to directly pass it to zip. I also added a if __name__ == "__main__": guard to allow importing parts of this script from another script without running it.

As a last step you should consider adding docstrings that describe what each function does so you can still remember in 6 months.

Question 4

Thank you for taking the time to write such an elaborate response. I hope you don't mind if ask you one or two questions at the most after I am done understanding and grokking your comment. :)

Question 5

@Animesh One or two about understanding the answer are fine ;-)

Question 6

Bug

Your second script does not handle subsequent-year prices correctly. It tries to extract a "then $..." price for every line of the input, instead of once per domain.

This bug illustrates two weaknesses of the design both programs, which together make it vulnerable to such mixups:

The parsing works line by line. A more robust approach would be to immediately split the blob of input text into one stanza per domain, then process each stanza independently.
The domains, prices, and prices_after_first_year are separate data structures. You hope that the corresponding entries in each list refer to the same domain, and you blindly zip() them together, but there isn't anything about the code that guarantees that assumption. If, for example, one of the domains is priced in € instead of ,ドル then your data would be inconsistent, and you would never find out, because the code doesn't crash.

Parsing

To extract information from text that fits a certain pattern, use regular expressions. It requires some investment of time to learn regular expressions — it is actually a concept and language that is independent of Python — but it really is the right tool to use for these kinds of tasks.

For this problem, I would use regular expressions in two ways:

To split the input into one stanza per domain, use re.split(). Here, I would look for a newline character followed by a dot. One tricky aspect is to put the dot inside a lookahead assertion (?=\.), so that only the newline is considered to be the delimiter to be discarded, and the dot is retained as the first character of each stanza.
To extract the name, price, and optional subsequent-year price of each domain, I would use a regular expression with named groups — see (?P<name>...) in the documentation. A very tricky technicality is that the regular expression must be correctly formulated to capture "then $..." optionally.

Input / output

The open() calls should ideally be placed next to each other, to make it easy to see what the input and output filenames are.

To write tab-delimited output, use the csv module.

Suggested solution

import csv
from operator import itemgetter
import re
REGEX = re.compile(
 r'(?P<name>\.[A-Z]+).*^\$(?P<price>[\d.]+)(?:.*then \$(?P<then>[\d.]+))?',
 flags=re.MULTILINE | re.DOTALL
)
def parse_and_sort_domains(input, output, sort_attr='price'):
 domains = [
 REGEX.match(description).groupdict()
 for description in re.split(r'\n(?=\.)', input.read())
 ]
 writer = csv.DictWriter(
 output, domains[0].keys(), dialect=csv.excel_tab, extrasaction='ignore'
 )
 writer.writerows(sorted(domains, key=itemgetter(sort_attr)))
with open('domain_prices.txt') as input, \
 open('sorted_domain_prices.txt', 'w') as output:
 parse_and_sort_domains(input, output)

Question 7

This is a great comment, thanks. I was looking for advice in terms of program design and such, but you have shown me what I was overlooking or missing completely.

Graipher Graipher 41.6k7 gold badges70 silver badges134 bronze badges · Accepted Answer · 2018-05-29 09:13:46Z

One thing that makes a script like this more readable, maintainable and extendable is giving names to things. The easiest way to do that is using functions.

First I would make a function that reads and cleans your file:

def get_domain_prices(file_name):
 with open(file_name) as f:
 for line in f:
 yield line.strip()

This is a generator producing each cleaned line as needed. There is no need to read the whole file into memory here since you process it line by line anyway.

Next, your if conditions are mutually exclusive (a line can only start with one character), so you could use elif the second time to save yourself one check if the first one is true.

For the parsing of the "then 39ドル.00/year" I would use a regex. This format is fixed enough for this to be easy and also complicated enough that splitting and replacing becomes cumbersome. The regex just searches for the string "then $x.yy/year", where before the decimal point there must be one or more digits and after the decimal point there are exactly two digits.

import re
for line in get_domain_prices('domain_prices.txt'):
 if line.startswith('.'):
 if len(prices_after_first_year) < len(domains):
 prices_after_first_year.append(prices[-1])
 domains.append(line.split()[0])
 elif line.startswith('$'):
 prices.append(float(line.replace('$', '')))
 elif line.startswith("then"):
 price = float(re.search(r'then \$(\d+.\d\d)/year', line).groups()[0])
 prices_after_first_year.append(price)

I also merged this checking into the elif chain and put a safe guard in the first condition to check if the previous domain had th price after one year set and if not add the price of that domain.

sorted can take a zip object, there is no need to cast it to list first (even in Python 3). If you re-structured the order in the zip (and in the write) you don't need to set the key and rely on the normal sorting of tuples (beware that this might produce a different order for domains with the same price, because then they are ordered by the next element of the tuple).

Since I guess you are using Python 3 (and if you are not yet, you should), I used f-strings (Python 3.6+) below to make the writing a bit easier.

sorted_domain_prices = sorted(zip(prices, prices_after_first_year, domains))
with open('sorted_domain_prices.txt', 'w') as f:
 for price, after, domain in sorted_domain_prices:
 f.write(f"{domain}\t{price}\t{after}\n")

Putting it all together:

import re
def clean_lines(file_name):
 with open(file_name) as f:
 for line in f:
 yield line.strip()
def parse_domain_prices(file_name):
 for line in clean_lines(file_name):
 if line.startswith('.'):
 if len(prices_after_first_year) < len(domains):
 prices_after_first_year.append(prices[-1])
 domains.append(line.split(' ')[0])
 elif line.startswith('$'):
 prices.append(float(line.replace('$', '')))
 elif line.startswith("then"):
 price = float(re.search(r'then \$(\d+.\d\d)/year', line).groups()[0])
 prices_after_first_year.append(price)
 return prices, prices_after_first_year, domains
def write_domain_prices(sorted_domain_prices, file_name):
 with open(file_name, 'w') as f:
 for price, after, domain in sorted_domain_prices:
 f.write(f"{domain}\t{price}\t{after}\n")
if __name__ == "__main__":
 data = parse_domain_prices('domain_prices.txt')
 write_domain_prices(sorted(zip(*data)), 'sorted_domain_prices.txt')

Here I modified the order in which the tuple is returned to be able to directly pass it to zip. I also added a if __name__ == "__main__": guard to allow importing parts of this script from another script without running it.

As a last step you should consider adding docstrings that describe what each function does so you can still remember in 6 months.

Thank you for taking the time to write such an elaborate response. I hope you don't mind if ask you one or two questions at the most after I am done understanding and grokking your comment. :)
@Animesh One or two about understanding the answer are fine ;-)

Stack Exchange Network

Parse and sort domain names and their prices

2 Answers 2

Bug

Parsing

Input / output

Suggested solution

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parse and sort domain names and their prices

2 Answers 2

Bug

Parsing

Input / output

Suggested solution

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions