A database of books, part 1

Question 1

Backstory

I collect books. A lot of books. Whitepapers, (programming) cookbooks, transcripts, important letters, overviews, you name it. Usually in PDF, otherwise my house would be too small.

To keep track of those books, I use an arcane collection system based on notes referring to notes. This obviously has some scalability problems, so for the past year I've been trying to build a proper, relational database for it.

It still isn't finished...

However, it's a project in parts and the first part has been ready many moons ago. It also hasn't changed much over the time, so I guess it's ready for review.

The problem

To have a database of books, you want at least 3 things:

Generate a list from a set of books. A directory goes in, a text file containing all books comes out.
- Eventually, a program will have to read it to extract keywords from it. I'll save that for a later version. The pre-loader.
Insert or update the list into a database. Create the database if it doesn't exist yet. The CRUDder.
Produce a view into the database for the user. The viewer.

So, I decided to split those 3 things and give each their own program. Eventually it may end up in one, but with explicit separation. Later on it may also need an automated maintenance program as a 4th module.

The database (3NF) should at least contain the name, authors and publishing year of each book. Later versions may include the size in pages, ISBN and keywords. This program does not yet support this, but ~~(削除) is (削除ここまで)~~ should've been written with those features in mind.

The pre-loader

The pre-loader will take a directory of PDFs. The test set is 222 items large, the final dataset much larger.

For now, only PDFs are supported. If I'd ever want to change that, all I'd have to do is change the following:

if file.lower().endswith(FILE_EXTENSION):

into:

for file_extension in FILE_EXTENSIONS:
 if file.lower().endswith(FILE_EXTENSIONS):

and break out of the for after handling the first successful hit, but since I'm not sure I'll ever do that the current, simple version will suffice.

All PDFs should be in the following format:

String_String_4digits.pdf

An example book would look like this:

SomeLongBookTitle_AuthorName1&AuthorName2_2017.pdf

Where first there's the title of the book, second there's the authors of the book and last there's the publishing year. Author names should be seperated by ampersand, any other character except an underscore will be considered part of an author's name.

If the author of a book is unknown, it will be UNK. If a publishing year is unknown, it will be 0000. Books published before the year 1 anno Domini are not supported.

If the file is not a PDF, it will be unidentified. Such files may or may not be a problem, but a list is kept. For example, a link may have been created instead of a copy, resulting in a .lnk instead of a .pdf.

File names can be very long. Example:

FullyIntegratedMultipurposeLow-CostK-BandFMCWRadarModuleWithSub-milimeterMeasurementPrecision_PiotrKaminski&IzabelaSlomian&KrzysztofWincza&SlawomirGruszczynski_2015.pdf

Book name: 93 characters. File name (incl. extension): 168 characters.

EvaluationOfBatteryAndBatteryChargerShort-CircuitCurrentContributionsToAFaultOnTheDcDistributionSystemAtANuclearPowerPlant_BrookhavenNationalLaboratory_2015.pdf

Book name: 122 characters. File name (incl. extension): 160 characters.

There can be a lot of authors:

TROPOMIATBDOfTheTotalAndTroposphericNO2DataProducts_JHGMVanGefffen&KFBoersma&HJEskes&JDMaasakkers&JPVeefkind_2016.pdf

5 authors.

(That, by the way, is an unfortunate example of improper naming. An artefact of being inconsistent and lacking a thorough naming scheme, but irrelevant for the database.)

The current idea is the program will deliver a list formatted like this:

EffectiveSTL ScottMeyers 0000
ElectricalEngineersReferenceBook MALaughton&DFWarne 2003
ElectroMagneticComplianceWhatIsEFT Microchip 2004

A better approach would perhaps include splitting the authors already so the next step doesn't have to do it. Seems the right thing to do. Perhaps even turning the whole list into a JSON file instead of a text file.

Code

import os
TARGET_DIRECTORY = "Books"
FILE_EXTENSION = ".pdf"
PRINT_UNIDENTIFIED_LIST = True
def print_result(good_count, bad_count, unidentified_list):
 length_unidentified_list = len(unidentified_list)
 print("Succesful books: {0}".format(good_count))
 print("Bad books: {0}".format(bad_count))
 print("Unidentified files: {0}".format(length_unidentified_list))
 if PRINT_UNIDENTIFIED_LIST and (length_unidentified_list > 0):
 print(length_unidentified_list)
 print(unidentified_list)
def make_list():
 with open("list.txt", 'w') as output_file:
 good_count, bad_count = 0, 0
 unidentified_list = []
 for file in os.listdir(TARGET_DIRECTORY):
 # Check against .lower() to also catch
 # accidental uppercased file extension
 if file.lower().endswith(FILE_EXTENSION):
 try:
 book_name, author_name, publishing_year = \
 file[:-len(FILE_EXTENSION)].split("_")
 #print(book_name, author_name, publishing_year)
 output_file.write(
 "{0} {1} {2}\n".format(
 book_name, author_name, publishing_year
 )
 )
 good_count += 1
 except ValueError:
 # Added indentation for improved UX
 print(" Could not handle the following file: {0}".format(file))
 print(" Check whether it's name is properly formatted.")
 bad_count += 1
 else:
 unidentified_list += [file]
 print_result(good_count, bad_count, unidentified_list)
def main():
 make_list()
if __name__ == '__main__':
 main()

Yes, I know 1 of my lines is slightly too long according to PEP8. That's the result of the make_list function failing the Single Responsibility Principle and lack of helper functions. If that gets fixed, the offending line will probably be fixed in the process.

That's also the reason there are no docstrings yet. The function structure as-is is flawed, even though the current function names are very descriptive.

I've tried putting the results in a dictionary, like this:

result = {
 'good_count': 0,
 'bad_count': 0,
 'unidentified_list': []
}

But it didn't improve readability a bit. I've also tried putting everything in a module and use more specialized functions. That got very messy, very fast.

Anything and everything is up for review, including the design of the whole process.

The current execution time of the program is low enough that I'm not worried, even for larger amounts of data.

Question 2

Is there a reason for you to use python2.7 over python3.x?

Question 3

@Ludisposed Mast uses Python 2 at work, using library's that havn't been updated to Python 3 yet. And so likes to stick with one version to avoid confusion.

Question 4

@Peilonrayz Makes sense now, thought I'd ask before bashing on Python2 ^^

Question 5

@Ludisposed You wouldn't be the first and probably not the last either. Besides the reasons mentioned, even some of the libraries I use at home haven't been properly ported to Python 3.x yet, so this could take a while :-)

Question 6

I would:

Change make_list to return good_count, bad_count and unidentified_list, rather than call print_result. This allows you to change the way that they are displayed at a later date.
If you invert file.lower().endswith(FILE_EXTENSION), you can use a guard clause, reducing indentation, and improving readability.
You don't need to use the try.
1. Your format can be replaced with ' '.join(), with an addition of '\n'.
2. You can check if file[:-len(FILE_EXTENSION)].split("_") has a length of three.
You should stick to one apostrophe type, unless switching to another causes the string to be easier to read. Take open("list.txt", 'w'). As opposed to 'it', "it's".
If you don't need to handle nested file types, such as .tar.gz, then you can use os.path.splitext. To simplify getting the name and extension.

Creating:

import os
TARGET_DIRECTORY = "Books"
FILE_EXTENSION = ".pdf"
PRINT_UNIDENTIFIED_LIST = True
def print_result(good_count, bad_count, unidentified_list):
 print("Succesful books: {0}".format(good_count))
 print("Bad books: {0}".format(bad_count))
 print("Unidentified files: {0}".format(len(unidentified_list)))
 if PRINT_UNIDENTIFIED_LIST and unidentified_list:
 print(len(unidentified_list))
 print(unidentified_list)
def make_list():
 with open("list.txt", "w") as output_file:
 good_count, bad_count = 0, 0
 unidentified_list = []
 for file in os.listdir(TARGET_DIRECTORY):
 name, ext = os.path.splitext(file)
 if not ext.lower() == FILE_EXTENSION:
 unidentified_list.append(file)
 continue
 values = name.split("_")
 if len(values) == 3:
 output_file.write(" ".join(values) + "\n")
 good_count += 1
 else:
 # Added indentation for improved UX
 print(" Could not handle the following file: {0}".format(file))
 print(" Check whether it's name is properly formatted.")
 bad_count += 1
 return good_count, bad_count, unidentified_list
def main():
 good_count, bad_count, unidentified_list = make_list()
 print_result(good_count, bad_count, unidentified_list)
if __name__ == '__main__':
 main()

I had a project where I was using the same sort of work flow as you, I had a list of input, and needed to split input into different handlers. And so started to use coroutines. I found them very helpful in these situations, and so an example of a basic usage of these in your application could be:

import os
from functools import wraps
TARGET_DIRECTORY = "Books"
FILE_EXTENSION = ".pdf"
PRINT_UNIDENTIFIED_LIST = True
def coroutine(fn):
 @wraps(fn)
 def inner(*args, **kwargs):
 cr = fn(*args, **kwargs)
 next(cr)
 return cr
 return inner
@coroutine
def good_files():
 count = 0
 try:
 with open("list.txt", "w") as output_file:
 while True:
 _, values = yield
 output_file.write(" ".join(values) + "\n")
 count += 1
 except GeneratorExit:
 print("Succesful books: {0}".format(count))
@coroutine
def bad_files():
 count = 0
 try:
 while True:
 file, _ = yield
 print(" Could not handle the following file: {0}".format(file))
 print(" Check whether it's name is properly formatted.")
 count += 1
 except GeneratorExit:
 print("Bad books: {0}".format(count))
@coroutine
def unidentified_files():
 files = []
 try:
 while True:
 file, _ = yield
 files.append(file)
 except GeneratorExit:
 print("Unidentified files: {0}".format(len(files)))
 if PRINT_UNIDENTIFIED_LIST and files:
 print(len(files))
 print(files)
def make_list(good_files, bad_files, unidentified_files):
 for file in os.listdir(TARGET_DIRECTORY):
 name, ext = os.path.splitext(file)
 values = name.split("_")
 handler = unidentified_files
 if ext.lower() == FILE_EXTENSION:
 handler = (good_files if len(values) == 3 else bad_files)
 handler.send((file, values))
 for handler in (good_files, bad_files, unidentified_files):
 handler.close()
def main():
 make_list(good_files(), bad_files(), unidentified_files())
if __name__ == '__main__':
 main()

Question 7

Interesting approach. It has some odd side-effects like breaking open wrappers, but I'll definitely look into it. The reason I went for try/except instead of checking length beforehand is it complies better with EAFP.

Question 8

@Mast I'm interested in those side effects, I only had some garbage collection issues from not closing coroutines before. IMO the LBYL approach is easier to understand, as at first I was confused what you were checking against.

Peilonrayz ♦Peilonrayz 44.4k7 gold badges80 silver badges157 bronze badges · Accepted Answer · 2017-12-07 10:11:37Z

I would:

Change make_list to return good_count, bad_count and unidentified_list, rather than call print_result. This allows you to change the way that they are displayed at a later date.
If you invert file.lower().endswith(FILE_EXTENSION), you can use a guard clause, reducing indentation, and improving readability.
You don't need to use the try.
1. Your format can be replaced with ' '.join(), with an addition of '\n'.
2. You can check if file[:-len(FILE_EXTENSION)].split("_") has a length of three.
You should stick to one apostrophe type, unless switching to another causes the string to be easier to read. Take open("list.txt", 'w'). As opposed to 'it', "it's".
If you don't need to handle nested file types, such as .tar.gz, then you can use os.path.splitext. To simplify getting the name and extension.

Creating:

import os
TARGET_DIRECTORY = "Books"
FILE_EXTENSION = ".pdf"
PRINT_UNIDENTIFIED_LIST = True
def print_result(good_count, bad_count, unidentified_list):
 print("Succesful books: {0}".format(good_count))
 print("Bad books: {0}".format(bad_count))
 print("Unidentified files: {0}".format(len(unidentified_list)))
 if PRINT_UNIDENTIFIED_LIST and unidentified_list:
 print(len(unidentified_list))
 print(unidentified_list)
def make_list():
 with open("list.txt", "w") as output_file:
 good_count, bad_count = 0, 0
 unidentified_list = []
 for file in os.listdir(TARGET_DIRECTORY):
 name, ext = os.path.splitext(file)
 if not ext.lower() == FILE_EXTENSION:
 unidentified_list.append(file)
 continue
 values = name.split("_")
 if len(values) == 3:
 output_file.write(" ".join(values) + "\n")
 good_count += 1
 else:
 # Added indentation for improved UX
 print(" Could not handle the following file: {0}".format(file))
 print(" Check whether it's name is properly formatted.")
 bad_count += 1
 return good_count, bad_count, unidentified_list
def main():
 good_count, bad_count, unidentified_list = make_list()
 print_result(good_count, bad_count, unidentified_list)
if __name__ == '__main__':
 main()

I had a project where I was using the same sort of work flow as you, I had a list of input, and needed to split input into different handlers. And so started to use coroutines. I found them very helpful in these situations, and so an example of a basic usage of these in your application could be:

import os
from functools import wraps
TARGET_DIRECTORY = "Books"
FILE_EXTENSION = ".pdf"
PRINT_UNIDENTIFIED_LIST = True
def coroutine(fn):
 @wraps(fn)
 def inner(*args, **kwargs):
 cr = fn(*args, **kwargs)
 next(cr)
 return cr
 return inner
@coroutine
def good_files():
 count = 0
 try:
 with open("list.txt", "w") as output_file:
 while True:
 _, values = yield
 output_file.write(" ".join(values) + "\n")
 count += 1
 except GeneratorExit:
 print("Succesful books: {0}".format(count))
@coroutine
def bad_files():
 count = 0
 try:
 while True:
 file, _ = yield
 print(" Could not handle the following file: {0}".format(file))
 print(" Check whether it's name is properly formatted.")
 count += 1
 except GeneratorExit:
 print("Bad books: {0}".format(count))
@coroutine
def unidentified_files():
 files = []
 try:
 while True:
 file, _ = yield
 files.append(file)
 except GeneratorExit:
 print("Unidentified files: {0}".format(len(files)))
 if PRINT_UNIDENTIFIED_LIST and files:
 print(len(files))
 print(files)
def make_list(good_files, bad_files, unidentified_files):
 for file in os.listdir(TARGET_DIRECTORY):
 name, ext = os.path.splitext(file)
 values = name.split("_")
 handler = unidentified_files
 if ext.lower() == FILE_EXTENSION:
 handler = (good_files if len(values) == 3 else bad_files)
 handler.send((file, values))
 for handler in (good_files, bad_files, unidentified_files):
 handler.close()
def main():
 make_list(good_files(), bad_files(), unidentified_files())
if __name__ == '__main__':
 main()

Interesting approach. It has some odd side-effects like breaking open wrappers, but I'll definitely look into it. The reason I went for try/except instead of checking length beforehand is it complies better with EAFP.
@Mast I'm interested in those side effects, I only had some garbage collection issues from not closing coroutines before. IMO the LBYL approach is easier to understand, as at first I was confused what you were checking against.

Stack Exchange Network

A database of books, part 1

Backstory

The problem

The pre-loader

Code

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

A database of books, part 1

Backstory

The problem

The pre-loader

Code

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions