Backstory
I collect books. A lot of books. Whitepapers, (programming) cookbooks, transcripts, important letters, overviews, you name it. Usually in PDF, otherwise my house would be too small.
To keep track of those books, I use an arcane collection system based on notes referring to notes. This obviously has some scalability problems, so for the past year I've been trying to build a proper, relational database for it.
It still isn't finished...
However, it's a project in parts and the first part has been ready many moons ago. It also hasn't changed much over the time, so I guess it's ready for review.
The problem
To have a database of books, you want at least 3 things:
- Generate a list from a set of books. A directory goes in, a text file containing all books comes out.
- Eventually, a program will have to read it to extract keywords from it. I'll save that for a later version. The pre-loader.
- Insert or update the list into a database. Create the database if it doesn't exist yet. The CRUDder.
- Produce a view into the database for the user. The viewer.
So, I decided to split those 3 things and give each their own program. Eventually it may end up in one, but with explicit separation. Later on it may also need an automated maintenance program as a 4th module.
The database (3NF) should at least contain the name, authors and publishing year of each book. Later versions may include the size in pages, ISBN and keywords. This program does not yet support this, but (削除) is (削除ここまで) should've been written with those features in mind.
The pre-loader
The pre-loader will take a directory of PDFs. The test set is 222 items large, the final dataset much larger.
For now, only PDFs are supported. If I'd ever want to change that, all I'd have to do is change the following:
if file.lower().endswith(FILE_EXTENSION):
into:
for file_extension in FILE_EXTENSIONS:
if file.lower().endswith(FILE_EXTENSIONS):
and break
out of the for
after handling the first successful hit, but since I'm not sure I'll ever do that the current, simple version will suffice.
All PDFs should be in the following format:
String_String_4digits.pdf
An example book would look like this:
SomeLongBookTitle_AuthorName1&AuthorName2_2017.pdf
Where first there's the title of the book, second there's the authors of the book and last there's the publishing year. Author names should be seperated by ampersand, any other character except an underscore will be considered part of an author's name.
If the author of a book is unknown, it will be UNK
. If a publishing year is unknown, it will be 0000
. Books published before the year 1 anno Domini are not supported.
If the file is not a PDF, it will be unidentified
. Such files may or may not be a problem, but a list is kept. For example, a link may have been created instead of a copy, resulting in a .lnk
instead of a .pdf
.
File names can be very long. Example:
FullyIntegratedMultipurposeLow-CostK-BandFMCWRadarModuleWithSub-milimeterMeasurementPrecision_PiotrKaminski&IzabelaSlomian&KrzysztofWincza&SlawomirGruszczynski_2015.pdf
Book name: 93 characters. File name (incl. extension): 168 characters.
EvaluationOfBatteryAndBatteryChargerShort-CircuitCurrentContributionsToAFaultOnTheDcDistributionSystemAtANuclearPowerPlant_BrookhavenNationalLaboratory_2015.pdf
Book name: 122 characters. File name (incl. extension): 160 characters.
There can be a lot of authors:
TROPOMIATBDOfTheTotalAndTroposphericNO2DataProducts_JHGMVanGefffen&KFBoersma&HJEskes&JDMaasakkers&JPVeefkind_2016.pdf
5 authors.
(That, by the way, is an unfortunate example of improper naming. An artefact of being inconsistent and lacking a thorough naming scheme, but irrelevant for the database.)
The current idea is the program will deliver a list formatted like this:
EffectiveSTL ScottMeyers 0000
ElectricalEngineersReferenceBook MALaughton&DFWarne 2003
ElectroMagneticComplianceWhatIsEFT Microchip 2004
A better approach would perhaps include splitting the authors already so the next step doesn't have to do it. Seems the right thing to do. Perhaps even turning the whole list into a JSON file instead of a text file.
Code
import os
TARGET_DIRECTORY = "Books"
FILE_EXTENSION = ".pdf"
PRINT_UNIDENTIFIED_LIST = True
def print_result(good_count, bad_count, unidentified_list):
length_unidentified_list = len(unidentified_list)
print("Succesful books: {0}".format(good_count))
print("Bad books: {0}".format(bad_count))
print("Unidentified files: {0}".format(length_unidentified_list))
if PRINT_UNIDENTIFIED_LIST and (length_unidentified_list > 0):
print(length_unidentified_list)
print(unidentified_list)
def make_list():
with open("list.txt", 'w') as output_file:
good_count, bad_count = 0, 0
unidentified_list = []
for file in os.listdir(TARGET_DIRECTORY):
# Check against .lower() to also catch
# accidental uppercased file extension
if file.lower().endswith(FILE_EXTENSION):
try:
book_name, author_name, publishing_year = \
file[:-len(FILE_EXTENSION)].split("_")
#print(book_name, author_name, publishing_year)
output_file.write(
"{0} {1} {2}\n".format(
book_name, author_name, publishing_year
)
)
good_count += 1
except ValueError:
# Added indentation for improved UX
print(" Could not handle the following file: {0}".format(file))
print(" Check whether it's name is properly formatted.")
bad_count += 1
else:
unidentified_list += [file]
print_result(good_count, bad_count, unidentified_list)
def main():
make_list()
if __name__ == '__main__':
main()
Yes, I know 1 of my lines is slightly too long according to PEP8. That's the result of the make_list
function failing the Single Responsibility Principle and lack of helper functions.
If that gets fixed, the offending line will probably be fixed in the process.
That's also the reason there are no docstrings yet. The function structure as-is is flawed, even though the current function names are very descriptive.
I've tried putting the results in a dictionary, like this:
result = {
'good_count': 0,
'bad_count': 0,
'unidentified_list': []
}
But it didn't improve readability a bit. I've also tried putting everything in a module and use more specialized functions. That got very messy, very fast.
Anything and everything is up for review, including the design of the whole process.
The current execution time of the program is low enough that I'm not worried, even for larger amounts of data.
1 Answer 1
I would:
- Change
make_list
to returngood_count
,bad_count
andunidentified_list
, rather than callprint_result
. This allows you to change the way that they are displayed at a later date. - If you invert
file.lower().endswith(FILE_EXTENSION)
, you can use a guard clause, reducing indentation, and improving readability. - You don't need to use the
try
.- Your
format
can be replaced with' '.join()
, with an addition of'\n'
. - You can check if
file[:-len(FILE_EXTENSION)].split("_")
has alen
gth of three.
- Your
- You should stick to one apostrophe type, unless switching to another causes the string to be easier to read. Take
open("list.txt", 'w')
. As opposed to'it', "it's"
. - If you don't need to handle nested file types, such as
.tar.gz
, then you can useos.path.splitext
. To simplify getting the name and extension.
Creating:
import os
TARGET_DIRECTORY = "Books"
FILE_EXTENSION = ".pdf"
PRINT_UNIDENTIFIED_LIST = True
def print_result(good_count, bad_count, unidentified_list):
print("Succesful books: {0}".format(good_count))
print("Bad books: {0}".format(bad_count))
print("Unidentified files: {0}".format(len(unidentified_list)))
if PRINT_UNIDENTIFIED_LIST and unidentified_list:
print(len(unidentified_list))
print(unidentified_list)
def make_list():
with open("list.txt", "w") as output_file:
good_count, bad_count = 0, 0
unidentified_list = []
for file in os.listdir(TARGET_DIRECTORY):
name, ext = os.path.splitext(file)
if not ext.lower() == FILE_EXTENSION:
unidentified_list.append(file)
continue
values = name.split("_")
if len(values) == 3:
output_file.write(" ".join(values) + "\n")
good_count += 1
else:
# Added indentation for improved UX
print(" Could not handle the following file: {0}".format(file))
print(" Check whether it's name is properly formatted.")
bad_count += 1
return good_count, bad_count, unidentified_list
def main():
good_count, bad_count, unidentified_list = make_list()
print_result(good_count, bad_count, unidentified_list)
if __name__ == '__main__':
main()
I had a project where I was using the same sort of work flow as you, I had a list of input, and needed to split input into different handlers. And so started to use coroutines. I found them very helpful in these situations, and so an example of a basic usage of these in your application could be:
import os
from functools import wraps
TARGET_DIRECTORY = "Books"
FILE_EXTENSION = ".pdf"
PRINT_UNIDENTIFIED_LIST = True
def coroutine(fn):
@wraps(fn)
def inner(*args, **kwargs):
cr = fn(*args, **kwargs)
next(cr)
return cr
return inner
@coroutine
def good_files():
count = 0
try:
with open("list.txt", "w") as output_file:
while True:
_, values = yield
output_file.write(" ".join(values) + "\n")
count += 1
except GeneratorExit:
print("Succesful books: {0}".format(count))
@coroutine
def bad_files():
count = 0
try:
while True:
file, _ = yield
print(" Could not handle the following file: {0}".format(file))
print(" Check whether it's name is properly formatted.")
count += 1
except GeneratorExit:
print("Bad books: {0}".format(count))
@coroutine
def unidentified_files():
files = []
try:
while True:
file, _ = yield
files.append(file)
except GeneratorExit:
print("Unidentified files: {0}".format(len(files)))
if PRINT_UNIDENTIFIED_LIST and files:
print(len(files))
print(files)
def make_list(good_files, bad_files, unidentified_files):
for file in os.listdir(TARGET_DIRECTORY):
name, ext = os.path.splitext(file)
values = name.split("_")
handler = unidentified_files
if ext.lower() == FILE_EXTENSION:
handler = (good_files if len(values) == 3 else bad_files)
handler.send((file, values))
for handler in (good_files, bad_files, unidentified_files):
handler.close()
def main():
make_list(good_files(), bad_files(), unidentified_files())
if __name__ == '__main__':
main()
-
\$\begingroup\$ Interesting approach. It has some odd side-effects like breaking open wrappers, but I'll definitely look into it. The reason I went for
try/except
instead of checking length beforehand is it complies better with EAFP. \$\endgroup\$2017年12月07日 11:22:09 +00:00Commented Dec 7, 2017 at 11:22 -
\$\begingroup\$ @Mast I'm interested in those side effects, I only had some garbage collection issues from not closing coroutines before. IMO the LBYL approach is easier to understand, as at first I was confused what you were checking against. \$\endgroup\$2017年12月07日 11:29:41 +00:00Commented Dec 7, 2017 at 11:29
Explore related questions
See similar questions with these tags.
python2.7
overpython3.x
? \$\endgroup\$