The application
I'm making a small app (currently ~500 lines) to manage and download books and other media items. Each item is uniquely identify by its (downloader, code)
tuple and the index (save in a pickled file) contains the title, authors, and tags (e.g. "scfi-fi", "comedy") of each item. The "downloader" is a module in a collection of user-made modules to scrape certain pages for metadata (title, authors and tags). For example, the module example_dl
, given either the url example.com/book/123
or code 123
, handles scraping example.com
for the book with the code/ID of 123
, namely its metadata and possibly download the book to a directory.
The code below contains the dictionary config
for global configuration variables (stored in a config file shelf.yaml
), a short description of the classes Item
and Shelf
and the function I'd like to be reviewed, search_item()
.
Code
import sys
import os
import re
import argparse
import importlib.util
import yaml
import pickle
# Default configurations
PATH = os.path.expanduser("~/.config/shelf")
config = {
"config_file": PATH + "/shelf.yaml",
"index_file": PATH + "/data/index",
"favorites_file": PATH + "/data/favorites",
"downloaders_dir": PATH + "/downloaders",
"data_dir": PATH + "/data",
"threads": 10,
}
class Item:
"""Items stored in the Shelf
Attributes:
title: title of the media item
authors: set of authors
tags: set of tags
media: media type the item is stored as (e.g. png, md, pdf)
"""
title = None
authors = None
tags = None
media = None
class Shelf:
"""Shelf containing indexed data of items
Attributes:
downloaders: a dict of available downloaders
index: Shelf indexing containing data about all items saved. Data is read from index file
and is as follows:
(
(downloader1, code1) : item_1,
(downloader2, code2) : item_2
)
"""
global config
downloaders = dict()
index = dict()
favorites = set()
verbosity = 0
def search_item(
self,
title_regex: str,
authors: set[str],
tags: set[str],
blacklist: set[str],
broad_search=False,
favorites=False,
) -> list[tuple[str, str]]:
"""Search item in index
Args:
title_regex: regex to search title by
authors: set of authors OR comma-separated string
tags: set of tags OR comma-separated string
broad_search: If False (Default), returns only when *all* tags match. If True, returns when at least 1 tag matches.
favorites: only search items in favorties
Returns:
A 2-tuple containing the downloader and code. For example: ("test_dl", "test_code")
"""
if self.verbosity >= 2:
print(
"Searching with:\n"
f"\tTitle regex: {title_regex}\n"
f"\tAuthors: {authors}\n"
f"\tTags: {tags}\n"
f"\tBroad search: {broad_search}\n"
f"\tFavorites: {favorites}"
)
result = None
if favorites: # Search in favorites
result = self.favorites
if self.verbosity >= 2:
print(f"Filtered by favorites: {result}")
if title_regex: # Search with title regex
result = {
(k, v)
for (k, v) in self.index
if re.match(title_regex, self.index[(k, v)].title)
}
if self.verbosity >= 2:
print(f"Filtered by title regex: {result}")
if authors: # Search by authors (always broad search)
if type(authors) == str:
authors = set(authors.split(","))
if result:
result = {
(k, v) for (k, v) in result if authors & self.index[(k, v)].authors
}
else:
result = {
(k, v)
for (k, v) in self.index
if authors & self.index[(k, v)].authors
}
if self.verbosity >= 2:
print(f"Filtered by authors: {result}")
if tags: # Search by tags
if type(tags) == str:
tags = set(tags.split(","))
if broad_search: # Broad search
if result:
result = {
(k, v) for (k, v) in result if tags & self.index[(k, v)].tags
}
else:
result = {
(k, v)
for (k, v) in self.index
if tags & self.index[(k, v)].tags
}
if self.verbosity >= 2:
print(f"Filtered by broad_search: {result}")
else: # Normal search
if result:
result = {
(k, v)
for (k, v) in result
if not (tags - self.index[(k, v)].tags)
}
else:
result = {
(k, v)
for (k, v) in self.index
if not (tags - self.index[(k, v)].tags)
}
if self.verbosity >= 2:
print(f"Filtered by broad_search: {result}")
if blacklist:
if type(blacklist) == str:
blacklist = set(blacklist.split(","))
if result:
result = {
(k, v) for (k, v) in result if not blacklist & self.index[(k, v)].tags
}
else:
result = {
(k, v)
for (k, v) in self.index
if not blacklist & self.index[(k, v)].tags
}
return result
Description of search_item
The function search_item
goes runs the index
dictionary through 5 filters (kinda): favorites
, title_regex
, author
, tags
, and blacklist
. The switches available to affect the search are -t TITLE_REGEX
, -a AUTHOR
, -T TAGS
, --broad-search
, --favorites
, --blacklist
.
if result
I have if result
and else
in almost every filter to hopefully improve speed and memory usage, since not every filter is required to be used in a single search.
favorites
The Shelf
object has a set
of 2-tuples, each tuple being (downloader, code)
. Since favorites
is a subset of Shelf.index.keys()
, I simply copy the entire set
of favorites
to be further filtered.
title_regex
and author
These 2 are similar in implementation. The only major difference is that title_regex
is matched using re.match(user_defined_regex, titles_in_index)
, while author is filtered with `author in set_of_authors.
tags
The --broad-search
switch is only used with the search_book
function when tags
is used, and is ignored otherwise.
Normal Search
In normal search, all of the provided tags (comma-separated) must match the book's tags to show up in the result.
Broad search
In broad search, books show up in the result if at least 1 of the provided tags is found in the book's set
of tags.
blacklist
Basically the inverse of broad search.
2 Answers 2
Use libraries to join path components
Instead of using raw string concatenation, use os.path.join
or pathlib
. Both of these provide more robust ways of joining together path components.
Static variables in Item
should be instance variables
class Item:
title = None
authors = None
tags = None
media = None
This definition of Item
only supports one item, because the variables are all static when you probably want them to be instance variables. You can correct this by declaring a __init__
and initializing self.title
, self.authors
, etc.
class Item:
def __init__(
self, title: str, authors: set[str], tags: set[str], media: Media
) -> None:
self.title = title
self.authors = authors
self.tags = tags
self.media = media
Or even better, make Item
a NamedTuple
or a dataclass
to save yourself from writing a lot of boilerplate code.
(downloader, code)
deserves its own name
Since every Item
is uniquely identifiable by this 2-tuple, and since this particular 2-tuple shows up so often, you should really give it a proper name. Maybe something like ItemId
. This will make the code easier to read.
from typing NamedTuple
class ItemId(NamedTuple):
downloader: str
code: str
Unnecessary global
global config
global
is unnecessary here for several reasons:
- You're not changing what
config
points to anywhere in the program (at least, not anywhere in the code provided) - You can read, and thus mutate the dictionary pointed to by
config
without theglobal
keyword - Using
global
to begin with is a code smell. There are much cleaner ways of sharing/communicating state. More on this below.
Shelf
Missing from this class definition is the code where you load file contents from the paths in config
into Shelf
. Based on the above-mentioned global
keyword, I'm assuming you're doing the loading inside of Shelf
. This works, but there's a cleaner way of doing this.
- Load the files into data structures before instantiating
Shelf
, e.g. readindex_file
intoindex: dict
, readfavorites_file
intofavorites: set
. - Initialize
Shelf
with these data structures on instantiation.
By doing this, Shelf
becomes a lot easier to test because it's no longer responsible for doing file I/O and parsing of the files in config
. When you instantiate Shelf
for testing, you can just declare and pass in test data directly.
search_item
- If you declare a parameter's type to be
set[str]
, it should always be treated as such in the function body. Forauthors
,tags
, andblacklist
, you are actually treating them as aUnion[set[str], str]
which is something completely different. - I will go one step further and say that making
search_item
responsible for handling a parameter of typeUnion[set[str], str]
is not a good design.search_item
is not the right place to parse a user's command-linestr
input into aset[str]
. Any pre-processing/parsing of input like this should be done before callingsearch_item
. - The return type should be a
set
, not a list. Technically, the return type as it is written now is actually anOptional[set]
becauseresult
is initialized toNone
, and ifsearch_item
is called withtitle_regex="", authors=set(), tags=set(), blacklist=set(), favorites=False
, we do not step into any of the filter-related if conditions and simply returnresult
. This is probably not what you want. - An alternative to repeated
if self.verbosity >= 2
guards followed byprint
statements might be using thelogging
module. You can mapself.verbosity
to a logging level (DEBUG
,INFO
,WARNING
,ERROR
,CRITICAL
), configure your logger to have that logging level, and then print logs withlogging.debug
,logging.info
, etc. according to how chatty you want the messages to be. - All of the inner branching on
if result
for each filtering stage is what I would call a premature optimization, and the code is much harder to read as a result. It's better to write the code to be as clean and maintainable as possible first. If after that you still have concerns about performance, profile your code to see where you can make optimizations. tags <= self.index[(k, v)].tags
(testing if the set of desired tags is a subset of the item's tags) is more straightforward thannot (tags - self.index[(k, v)].tags)
blacklist.isdisjoint(self.index[(k, v)].tags)
is more straightforward thannot blacklist & self.index[(k, v)].tags
There is a lot of repetition at each stage where we filter the collection of candidate items in the same way (using set comprehensions), with the only difference being the predicate.
If we define the predicates (or checks) as functions that judge an Item
, with the help of functools.partial
we can generate custom predicates, all of the form (Item) -> bool
, based on the user's search query.
import re
from enum import Enum
from functools import partial
from typing import Callable, NamedTuple
class Media(Enum):
EPUB = 1
MARKDOWN = 2
PDF = 3
PNG = 4
class Item(NamedTuple):
title: str
authors: set[str]
tags: set[str]
media: Media
def title_check(title_regex: str, item: Item) -> bool:
return bool(re.match(title_regex, item.title))
def authors_check(authors: set[str], item: Item) -> bool:
return bool(authors & item.authors)
def tags_broad_check(tags: set[str], item: Item) -> bool:
return bool(tags & item.tags)
def tags_normal_check(tags: set[str], item: Item) -> bool:
return tags <= item.tags
def blacklisted_tags_check(blacklisted_tags: set[str], item: Item) -> bool:
return blacklisted_tags.isdisjoint(item.tags)
def generate_checks(
title_regex: str,
authors: set[str],
tags: set[str],
blacklisted_tags: set[str],
broad_search: bool = False,
) -> list[Callable[[Item], bool]]:
checks: list[Callable[[Item], bool]] = []
if title_regex:
checks.append(partial(title_check, title_regex))
if authors:
checks.append(partial(authors_check, authors))
if tags:
if broad_search:
checks.append(partial(tags_broad_check, tags))
else:
checks.append(partial(tags_normal_check, tags))
if blacklisted_tags:
checks.append(partial(blacklisted_tags_check, blacklisted_tags))
return checks
Then, assuming
self.index
is adict[ItemId, Item]
andself.favorites
is aset[ItemId]
search_item
can be refactored like so:
def search_item(
self,
title_regex: str,
authors: set[str],
tags: set[str],
blacklisted_tags: set[str],
broad_search: bool = False,
favorites: bool = False,
) -> set[ItemId]:
if favorites:
items_to_search = {
item_id: item
for item_id, item in self.index.items()
if item_id in self.favorites
}
else:
items_to_search = self.index
checks = generate_checks(
title_regex, authors, tags, blacklisted_tags, broad_search
)
return {
item_id
for item_id, item in items_to_search.items()
if all(check(item) for check in checks)
}
One advantage of writing it this way is that adding/removing/updating checks is easy, and won't require too many changes to search_item
.
I'm not going too deeply into flow of that solution, but I have some comments. Hopefully it will be useful to you.
- The first thing is that you should definitely refactor
search_item
by splitting that to smaller functions which will allow for better reading flow, but also it should help with finding code duplicates. Another benefit is that when in smaller functions, it's easier to comment such code (sometimes just by good name, sometimes with proper docstring). - You have quite a lot of redundant code, like for example
result = {
(k, v) for (k, v) in result if authors & self.index[(k, v)].authors
}
result = {
(k, v) for (k, v) in result if tags & self.index[(k, v)].tags
}
or even in broader scopes, with whole if result: .. if self.verbosity >= 2: ...
sections. Right, there are different attributes like authors
, tags
, etc., but it should be quite easy handle them especially if you'd split that into smaller functions.
these operations like
if tags & self.index[(k, v)].tags
are not very obvious. Please document them somewhere to avoid confusions when someone will work with your code in future.you don't need to specify whole tuple in these set comprehensions. That
{(k, v) for (k, v) in result if authors & self.index[(k, v)].authors}
could be rewritten to
{key for key in result if authors & self.index[key].authors}
most likely.
- I bet this is because of that you pasted here only a part of your code, but there you have not used imports (almost all of them).
os
andre
but I don't want to assume. \$\endgroup\$author
is to be matched againstself.index[(k, v)].tags
? \$\endgroup\$not blacklist & tags
\$\endgroup\$