Searching database of books

Question 1

The application

I'm making a small app (currently ~500 lines) to manage and download books and other media items. Each item is uniquely identify by its (downloader, code) tuple and the index (save in a pickled file) contains the title, authors, and tags (e.g. "scfi-fi", "comedy") of each item. The "downloader" is a module in a collection of user-made modules to scrape certain pages for metadata (title, authors and tags). For example, the module example_dl, given either the url example.com/book/123 or code 123, handles scraping example.com for the book with the code/ID of 123, namely its metadata and possibly download the book to a directory.

The code below contains the dictionary config for global configuration variables (stored in a config file shelf.yaml), a short description of the classes Item and Shelf and the function I'd like to be reviewed, search_item().

Code

import sys
import os
import re
import argparse
import importlib.util
import yaml
import pickle
# Default configurations
PATH = os.path.expanduser("~/.config/shelf")
config = {
 "config_file": PATH + "/shelf.yaml",
 "index_file": PATH + "/data/index",
 "favorites_file": PATH + "/data/favorites",
 "downloaders_dir": PATH + "/downloaders",
 "data_dir": PATH + "/data",
 "threads": 10,
}
class Item:
 """Items stored in the Shelf
 Attributes:
 title: title of the media item
 authors: set of authors
 tags: set of tags
 media: media type the item is stored as (e.g. png, md, pdf)
 """
 title = None
 authors = None
 tags = None
 media = None
class Shelf:
 """Shelf containing indexed data of items
 Attributes:
 downloaders: a dict of available downloaders
 index: Shelf indexing containing data about all items saved. Data is read from index file
 and is as follows:
 (
 (downloader1, code1) : item_1,
 (downloader2, code2) : item_2
 )
 """
 global config
 downloaders = dict()
 index = dict()
 favorites = set()
 verbosity = 0
 def search_item(
 self,
 title_regex: str,
 authors: set[str],
 tags: set[str],
 blacklist: set[str],
 broad_search=False,
 favorites=False,
 ) -> list[tuple[str, str]]:
 """Search item in index
 Args:
 title_regex: regex to search title by
 authors: set of authors OR comma-separated string
 tags: set of tags OR comma-separated string
 broad_search: If False (Default), returns only when *all* tags match. If True, returns when at least 1 tag matches.
 favorites: only search items in favorties
 Returns:
 A 2-tuple containing the downloader and code. For example: ("test_dl", "test_code")
 """
 if self.verbosity >= 2:
 print(
 "Searching with:\n"
 f"\tTitle regex: {title_regex}\n"
 f"\tAuthors: {authors}\n"
 f"\tTags: {tags}\n"
 f"\tBroad search: {broad_search}\n"
 f"\tFavorites: {favorites}"
 )
 result = None
 if favorites: # Search in favorites
 result = self.favorites
 if self.verbosity >= 2:
 print(f"Filtered by favorites: {result}")
 if title_regex: # Search with title regex
 result = {
 (k, v)
 for (k, v) in self.index
 if re.match(title_regex, self.index[(k, v)].title)
 }
 if self.verbosity >= 2:
 print(f"Filtered by title regex: {result}")
 if authors: # Search by authors (always broad search)
 if type(authors) == str:
 authors = set(authors.split(","))
 if result:
 result = {
 (k, v) for (k, v) in result if authors & self.index[(k, v)].authors
 }
 else:
 result = {
 (k, v)
 for (k, v) in self.index
 if authors & self.index[(k, v)].authors
 }
 if self.verbosity >= 2:
 print(f"Filtered by authors: {result}")
 if tags: # Search by tags
 if type(tags) == str:
 tags = set(tags.split(","))
 if broad_search: # Broad search
 if result:
 result = {
 (k, v) for (k, v) in result if tags & self.index[(k, v)].tags
 }
 else:
 result = {
 (k, v)
 for (k, v) in self.index
 if tags & self.index[(k, v)].tags
 }
 if self.verbosity >= 2:
 print(f"Filtered by broad_search: {result}")
 else: # Normal search
 if result:
 result = {
 (k, v)
 for (k, v) in result
 if not (tags - self.index[(k, v)].tags)
 }
 else:
 result = {
 (k, v)
 for (k, v) in self.index
 if not (tags - self.index[(k, v)].tags)
 }
 if self.verbosity >= 2:
 print(f"Filtered by broad_search: {result}")
 if blacklist:
 if type(blacklist) == str:
 blacklist = set(blacklist.split(","))
 if result:
 result = {
 (k, v) for (k, v) in result if not blacklist & self.index[(k, v)].tags
 }
 else:
 result = {
 (k, v)
 for (k, v) in self.index
 if not blacklist & self.index[(k, v)].tags
 }
 return result

Description of `search_item`

The function search_item goes runs the index dictionary through 5 filters (kinda): favorites, title_regex, author, tags, and blacklist. The switches available to affect the search are -t TITLE_REGEX, -a AUTHOR, -T TAGS, --broad-search, --favorites, --blacklist.

`if result`

I have if result and else in almost every filter to hopefully improve speed and memory usage, since not every filter is required to be used in a single search.

`favorites`

The Shelf object has a set of 2-tuples, each tuple being (downloader, code). Since favorites is a subset of Shelf.index.keys(), I simply copy the entire set of favorites to be further filtered.

`title_regex` and `author`

These 2 are similar in implementation. The only major difference is that title_regex is matched using re.match(user_defined_regex, titles_in_index), while author is filtered with `author in set_of_authors.

`tags`

The --broad-search switch is only used with the search_book function when tags is used, and is ignored otherwise.

Normal Search

In normal search, all of the provided tags (comma-separated) must match the book's tags to show up in the result.

Broad search

In broad search, books show up in the result if at least 1 of the provided tags is found in the book's set of tags.

`blacklist`

Basically the inverse of broad search.

Question 2

Can you include the imports you use please? I see os and re but I don't want to assume.

Question 3

are you sure that in blacklist, author is to be matched against self.index[(k, v)].tags?

Question 4

@hjpotter92 Right, it should be not blacklist & tags

Question 5

Use libraries to join path components

Instead of using raw string concatenation, use os.path.join or pathlib. Both of these provide more robust ways of joining together path components.

Static variables in `Item` should be instance variables

class Item:
 title = None
 authors = None
 tags = None
 media = None

This definition of Item only supports one item, because the variables are all static when you probably want them to be instance variables. You can correct this by declaring a __init__ and initializing self.title, self.authors, etc.

class Item:
 def __init__(
 self, title: str, authors: set[str], tags: set[str], media: Media
 ) -> None:
 self.title = title
 self.authors = authors
 self.tags = tags
 self.media = media

Or even better, make Item a NamedTuple or a dataclass to save yourself from writing a lot of boilerplate code.

`(downloader, code)` deserves its own name

Since every Item is uniquely identifiable by this 2-tuple, and since this particular 2-tuple shows up so often, you should really give it a proper name. Maybe something like ItemId. This will make the code easier to read.

from typing NamedTuple
class ItemId(NamedTuple):
 downloader: str
 code: str

Unnecessary `global`

global config

global is unnecessary here for several reasons:

You're not changing what config points to anywhere in the program (at least, not anywhere in the code provided)
You can read, and thus mutate the dictionary pointed to by config without the global keyword
Using global to begin with is a code smell. There are much cleaner ways of sharing/communicating state. More on this below.

`Shelf`

Missing from this class definition is the code where you load file contents from the paths in config into Shelf. Based on the above-mentioned global keyword, I'm assuming you're doing the loading inside of Shelf. This works, but there's a cleaner way of doing this.

Load the files into data structures before instantiating Shelf, e.g. read index_file into index: dict, read favorites_file into favorites: set.
Initialize Shelf with these data structures on instantiation.

By doing this, Shelf becomes a lot easier to test because it's no longer responsible for doing file I/O and parsing of the files in config. When you instantiate Shelf for testing, you can just declare and pass in test data directly.

`search_item`

If you declare a parameter's type to be set[str], it should always be treated as such in the function body. For authors, tags, and blacklist, you are actually treating them as a Union[set[str], str] which is something completely different.
I will go one step further and say that making search_item responsible for handling a parameter of type Union[set[str], str] is not a good design. search_item is not the right place to parse a user's command-line str input into a set[str]. Any pre-processing/parsing of input like this should be done before calling search_item.
The return type should be a set, not a list. Technically, the return type as it is written now is actually an Optional[set] because result is initialized to None, and if search_item is called with title_regex="", authors=set(), tags=set(), blacklist=set(), favorites=False, we do not step into any of the filter-related if conditions and simply return result. This is probably not what you want.
An alternative to repeated if self.verbosity >= 2 guards followed by print statements might be using the logging module. You can map self.verbosity to a logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL), configure your logger to have that logging level, and then print logs with logging.debug, logging.info, etc. according to how chatty you want the messages to be.
All of the inner branching on if result for each filtering stage is what I would call a premature optimization, and the code is much harder to read as a result. It's better to write the code to be as clean and maintainable as possible first. If after that you still have concerns about performance, profile your code to see where you can make optimizations.
tags <= self.index[(k, v)].tags (testing if the set of desired tags is a subset of the item's tags) is more straightforward than not (tags - self.index[(k, v)].tags)
blacklist.isdisjoint(self.index[(k, v)].tags) is more straightforward than not blacklist & self.index[(k, v)].tags

There is a lot of repetition at each stage where we filter the collection of candidate items in the same way (using set comprehensions), with the only difference being the predicate.

If we define the predicates (or checks) as functions that judge an Item, with the help of functools.partial we can generate custom predicates, all of the form (Item) -> bool, based on the user's search query.

import re
from enum import Enum
from functools import partial
from typing import Callable, NamedTuple
class Media(Enum):
 EPUB = 1
 MARKDOWN = 2
 PDF = 3
 PNG = 4
class Item(NamedTuple):
 title: str
 authors: set[str]
 tags: set[str]
 media: Media
def title_check(title_regex: str, item: Item) -> bool:
 return bool(re.match(title_regex, item.title))
def authors_check(authors: set[str], item: Item) -> bool:
 return bool(authors & item.authors)
def tags_broad_check(tags: set[str], item: Item) -> bool:
 return bool(tags & item.tags)
def tags_normal_check(tags: set[str], item: Item) -> bool:
 return tags <= item.tags
def blacklisted_tags_check(blacklisted_tags: set[str], item: Item) -> bool:
 return blacklisted_tags.isdisjoint(item.tags)
def generate_checks(
 title_regex: str,
 authors: set[str],
 tags: set[str],
 blacklisted_tags: set[str],
 broad_search: bool = False,
) -> list[Callable[[Item], bool]]:
 checks: list[Callable[[Item], bool]] = []
 if title_regex:
 checks.append(partial(title_check, title_regex))
 if authors:
 checks.append(partial(authors_check, authors))
 if tags:
 if broad_search:
 checks.append(partial(tags_broad_check, tags))
 else:
 checks.append(partial(tags_normal_check, tags))
 if blacklisted_tags:
 checks.append(partial(blacklisted_tags_check, blacklisted_tags))
 return checks

Then, assuming

self.index is a dict[ItemId, Item] and
self.favorites is a set[ItemId]

search_item can be refactored like so:

def search_item(
 self,
 title_regex: str,
 authors: set[str],
 tags: set[str],
 blacklisted_tags: set[str],
 broad_search: bool = False,
 favorites: bool = False,
) -> set[ItemId]:
 if favorites:
 items_to_search = {
 item_id: item
 for item_id, item in self.index.items()
 if item_id in self.favorites
 }
 else:
 items_to_search = self.index
 checks = generate_checks(
 title_regex, authors, tags, blacklisted_tags, broad_search
 )
 return {
 item_id
 for item_id, item in items_to_search.items()
 if all(check(item) for check in checks)
 }

One advantage of writing it this way is that adding/removing/updating checks is easy, and won't require too many changes to search_item.

Question 6

I'm not going too deeply into flow of that solution, but I have some comments. Hopefully it will be useful to you.

The first thing is that you should definitely refactor search_item by splitting that to smaller functions which will allow for better reading flow, but also it should help with finding code duplicates. Another benefit is that when in smaller functions, it's easier to comment such code (sometimes just by good name, sometimes with proper docstring).
You have quite a lot of redundant code, like for example

result = {
 (k, v) for (k, v) in result if authors & self.index[(k, v)].authors
}

result = {
 (k, v) for (k, v) in result if tags & self.index[(k, v)].tags
}

or even in broader scopes, with whole if result: .. if self.verbosity >= 2: ... sections. Right, there are different attributes like authors, tags, etc., but it should be quite easy handle them especially if you'd split that into smaller functions.

these operations like if tags & self.index[(k, v)].tags are not very obvious. Please document them somewhere to avoid confusions when someone will work with your code in future.
you don't need to specify whole tuple in these set comprehensions. That

 {(k, v) for (k, v) in result if authors & self.index[(k, v)].authors}

could be rewritten to

{key for key in result if authors & self.index[key].authors}

most likely.

I bet this is because of that you pasted here only a part of your code, but there you have not used imports (almost all of them).

Setris Setris 1,6881 gold badge10 silver badges17 bronze badges · Answer 1 · 2021-02-07 12:33:27Z

Use libraries to join path components

Instead of using raw string concatenation, use os.path.join or pathlib. Both of these provide more robust ways of joining together path components.

Static variables in `Item` should be instance variables

class Item:
 title = None
 authors = None
 tags = None
 media = None

This definition of Item only supports one item, because the variables are all static when you probably want them to be instance variables. You can correct this by declaring a __init__ and initializing self.title, self.authors, etc.

class Item:
 def __init__(
 self, title: str, authors: set[str], tags: set[str], media: Media
 ) -> None:
 self.title = title
 self.authors = authors
 self.tags = tags
 self.media = media

Or even better, make Item a NamedTuple or a dataclass to save yourself from writing a lot of boilerplate code.

`(downloader, code)` deserves its own name

Since every Item is uniquely identifiable by this 2-tuple, and since this particular 2-tuple shows up so often, you should really give it a proper name. Maybe something like ItemId. This will make the code easier to read.

from typing NamedTuple
class ItemId(NamedTuple):
 downloader: str
 code: str

Unnecessary `global`

global config

global is unnecessary here for several reasons:

You're not changing what config points to anywhere in the program (at least, not anywhere in the code provided)
You can read, and thus mutate the dictionary pointed to by config without the global keyword
Using global to begin with is a code smell. There are much cleaner ways of sharing/communicating state. More on this below.

`Shelf`

Missing from this class definition is the code where you load file contents from the paths in config into Shelf. Based on the above-mentioned global keyword, I'm assuming you're doing the loading inside of Shelf. This works, but there's a cleaner way of doing this.

Load the files into data structures before instantiating Shelf, e.g. read index_file into index: dict, read favorites_file into favorites: set.
Initialize Shelf with these data structures on instantiation.

By doing this, Shelf becomes a lot easier to test because it's no longer responsible for doing file I/O and parsing of the files in config. When you instantiate Shelf for testing, you can just declare and pass in test data directly.

`search_item`

If you declare a parameter's type to be set[str], it should always be treated as such in the function body. For authors, tags, and blacklist, you are actually treating them as a Union[set[str], str] which is something completely different.
I will go one step further and say that making search_item responsible for handling a parameter of type Union[set[str], str] is not a good design. search_item is not the right place to parse a user's command-line str input into a set[str]. Any pre-processing/parsing of input like this should be done before calling search_item.
The return type should be a set, not a list. Technically, the return type as it is written now is actually an Optional[set] because result is initialized to None, and if search_item is called with title_regex="", authors=set(), tags=set(), blacklist=set(), favorites=False, we do not step into any of the filter-related if conditions and simply return result. This is probably not what you want.
An alternative to repeated if self.verbosity >= 2 guards followed by print statements might be using the logging module. You can map self.verbosity to a logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL), configure your logger to have that logging level, and then print logs with logging.debug, logging.info, etc. according to how chatty you want the messages to be.
All of the inner branching on if result for each filtering stage is what I would call a premature optimization, and the code is much harder to read as a result. It's better to write the code to be as clean and maintainable as possible first. If after that you still have concerns about performance, profile your code to see where you can make optimizations.
tags <= self.index[(k, v)].tags (testing if the set of desired tags is a subset of the item's tags) is more straightforward than not (tags - self.index[(k, v)].tags)
blacklist.isdisjoint(self.index[(k, v)].tags) is more straightforward than not blacklist & self.index[(k, v)].tags

There is a lot of repetition at each stage where we filter the collection of candidate items in the same way (using set comprehensions), with the only difference being the predicate.

If we define the predicates (or checks) as functions that judge an Item, with the help of functools.partial we can generate custom predicates, all of the form (Item) -> bool, based on the user's search query.

import re
from enum import Enum
from functools import partial
from typing import Callable, NamedTuple
class Media(Enum):
 EPUB = 1
 MARKDOWN = 2
 PDF = 3
 PNG = 4
class Item(NamedTuple):
 title: str
 authors: set[str]
 tags: set[str]
 media: Media
def title_check(title_regex: str, item: Item) -> bool:
 return bool(re.match(title_regex, item.title))
def authors_check(authors: set[str], item: Item) -> bool:
 return bool(authors & item.authors)
def tags_broad_check(tags: set[str], item: Item) -> bool:
 return bool(tags & item.tags)
def tags_normal_check(tags: set[str], item: Item) -> bool:
 return tags <= item.tags
def blacklisted_tags_check(blacklisted_tags: set[str], item: Item) -> bool:
 return blacklisted_tags.isdisjoint(item.tags)
def generate_checks(
 title_regex: str,
 authors: set[str],
 tags: set[str],
 blacklisted_tags: set[str],
 broad_search: bool = False,
) -> list[Callable[[Item], bool]]:
 checks: list[Callable[[Item], bool]] = []
 if title_regex:
 checks.append(partial(title_check, title_regex))
 if authors:
 checks.append(partial(authors_check, authors))
 if tags:
 if broad_search:
 checks.append(partial(tags_broad_check, tags))
 else:
 checks.append(partial(tags_normal_check, tags))
 if blacklisted_tags:
 checks.append(partial(blacklisted_tags_check, blacklisted_tags))
 return checks

Then, assuming

self.index is a dict[ItemId, Item] and
self.favorites is a set[ItemId]

search_item can be refactored like so:

def search_item(
 self,
 title_regex: str,
 authors: set[str],
 tags: set[str],
 blacklisted_tags: set[str],
 broad_search: bool = False,
 favorites: bool = False,
) -> set[ItemId]:
 if favorites:
 items_to_search = {
 item_id: item
 for item_id, item in self.index.items()
 if item_id in self.favorites
 }
 else:
 items_to_search = self.index
 checks = generate_checks(
 title_regex, authors, tags, blacklisted_tags, broad_search
 )
 return {
 item_id
 for item_id, item in items_to_search.items()
 if all(check(item) for check in checks)
 }

One advantage of writing it this way is that adding/removing/updating checks is easy, and won't require too many changes to search_item.

Adam Tokarski Adam Tokarski 1113 bronze badges · Answer 2 · 2021-02-04 21:04:31Z

I'm not going too deeply into flow of that solution, but I have some comments. Hopefully it will be useful to you.

The first thing is that you should definitely refactor search_item by splitting that to smaller functions which will allow for better reading flow, but also it should help with finding code duplicates. Another benefit is that when in smaller functions, it's easier to comment such code (sometimes just by good name, sometimes with proper docstring).
You have quite a lot of redundant code, like for example

result = {
 (k, v) for (k, v) in result if authors & self.index[(k, v)].authors
}

result = {
 (k, v) for (k, v) in result if tags & self.index[(k, v)].tags
}

or even in broader scopes, with whole if result: .. if self.verbosity >= 2: ... sections. Right, there are different attributes like authors, tags, etc., but it should be quite easy handle them especially if you'd split that into smaller functions.

these operations like if tags & self.index[(k, v)].tags are not very obvious. Please document them somewhere to avoid confusions when someone will work with your code in future.
you don't need to specify whole tuple in these set comprehensions. That

 {(k, v) for (k, v) in result if authors & self.index[(k, v)].authors}

could be rewritten to

{key for key in result if authors & self.index[key].authors}

most likely.

I bet this is because of that you pasted here only a part of your code, but there you have not used imports (almost all of them).

Stack Exchange Network

Searching database of books

The application

Code

Description of `search_item`

`if result`

`favorites`

`title_regex` and `author`

`tags`

Normal Search

Broad search

`blacklist`

2 Answers 2

Use libraries to join path components

Static variables in `Item` should be instance variables

`(downloader, code)` deserves its own name

Unnecessary `global`

`Shelf`

`search_item`

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Searching database of books

The application

Code

Description of search_item

if result

favorites

title_regex and author

tags

Normal Search

Broad search

blacklist

2 Answers 2

Use libraries to join path components

Static variables in Item should be instance variables

(downloader, code) deserves its own name

Unnecessary global

Shelf

search_item

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Description of `search_item`

`if result`

`favorites`

`title_regex` and `author`

`tags`

`blacklist`

Static variables in `Item` should be instance variables

`(downloader, code)` deserves its own name

Unnecessary `global`

`Shelf`

`search_item`