I'm working on a project where the file name consists of actual dates but the data for the dates are split into multiple files.
Developed the following program to count the number of files for each date (part of the filename) and also the total size. Wondering, if the same can be written in a better way.
import os
import glob
import os
import collections
directory_name = "\\SpeicifDir\\"
# Get a list of files (file paths) in the given directory
list_of_files = filter( os.path.isfile,
glob.glob(directory_name + '*.txt') )
mapOfDateFileSize = collections.defaultdict(list)
# For all the files
for file_path in list_of_files:
file_size = os.stat(file_path).st_size
filename = os.path.basename(file_path)
splitFilename = filename.split('-')
# Extract the file and split the file using - as a separator
dtForFile = splitFilename[1] + "-" + splitFilename[2] + "-" + splitFilename[3]
# Get the file name and size
if dtForFile in mapOfDateFileSize:
dataFromDictionary = mapOfDateFileSize[dtForFile]
dataFromDictionary = dataFromDictionary[0]
totalCount = dataFromDictionary[0]
totalSize = dataFromDictionary[1]
totalCount = totalCount + 1
totalSize = totalSize + file_size
# Update the file size and count
mapOfDateFileSize[dtForFile] = [ (totalCount, totalSize) ]
else:
mapOfDateFileSize[dtForFile].append((1,file_size))
# For each date get the total size, total file count
for dt,elements in mapOfDateFileSize.items():
dataFromDictionary = elements[0]
totalCount = dataFromDictionary[0]
totalSize = dataFromDictionary[1]
print (dt, ",", totalCount , ",", totalSize)
-
\$\begingroup\$ Can you give an example of the text files you are looking for? \$\endgroup\$N3buchadnezzar– N3buchadnezzar2022年01月29日 11:37:58 +00:00Commented Jan 29, 2022 at 11:37
2 Answers 2
Default Dictionary
You are using this wrong.
...
mapOfDateFileSize = collections.defaultdict(list)
for file_path in list_of_files:
...
if dtForFile in mapOfDateFileSize:
...
mapOfDateFileSize[dtForFile] = [ (totalCount, totalSize) ]
else:
mapOfDateFileSize[dtForFile].append((1,file_size))
...
The whole point of a default dictionary is so you can use the mapOfDateFileSize[key]
without worrying about whether the key exists or not. Since you explicitly test if dtForFile in mapOfDateFileSize
, you could use a regular dictionary, and create the list yourself in the else:
clause with:
...
else:
mapOfDateFileSize[dtForFile] = [(1, file_size)]
...
Unnecessary nesting
dataFromDictionary = mapOfDateFileSize[dtForFile]
dataFromDictionary = dataFromDictionary[0]
totalCount = dataFromDictionary[0]
totalSize = dataFromDictionary[1]
You are storing a tuple, in a one-element list, in a dictionary. What is the purpose of this one-element list? It is not adding anything, and making things way more complicated.
if dtForFile in mapOfDateFileSize:
total_count, total_size = mapOfDateFileSize[dtForFile]
total_count += 1
total_size += file_size
# Update the file size and count
mapOfDateFileSize[dtForFile] = (total_count, total_size)
else:
mapOfDateFileSize[dtForFile] = (1, file_size)
# Tuple unpacking even works when iterating through the dictionary!
for dt, (total_count, total_size) in mapOfDateFileSize.items():
print (dt, ",", total_count, ",", total_size)
A Real Default Dictionary
The if dtForFile in mapOfDateFileSize
is still annoying. You wanted to use a defaultdict
. What you need is a default dict that returns the tuple (0, 0)
if no entry exists.
...
mapOfDateFileSize = collections.defaultdict(lambda: (0, 0))
for file_path in list_of_files:
...
total_count, total_size = mapOfDateFileSize[dtForFile]
total_count += 1
total_size += file_size
# Update the file size and count
mapOfDateFileSize[dtForFile] = (total_count, total_size)
...
PEP 8
The Style Guide for Python Code has many guidelines you should follow. Of particular note, variable names should be snake_case
, not bumpyWordsCommonlyUsedInOtherProgrammingLanguages
since they are really hard to read.
An eye for detail
Importing your file causes several lines to change, and PEP warnings poping up. I recommend switching to an editor that provides black
on save. For instance
totalSize = dataFromDictionary[1]
Should really be
totalSize = dataFromDictionary[1]
This indicates you being a bit lazy. There are several instances of this. Your imports are this
import os
import glob
import os
import collections
Which, after running isort
over it, becomes this
import collections
import glob
import os
Similarly directory_name = "\\SpeicifDir\\"
should probably be directory_name = "\\SpecificfDir\\"
no? And speaking of,here you suddenly use snake_case
when throughout the code camelCase
is used?
Comments
If you find yourself using multiple comments, it is a good indication that your structure needs to change. Use more functions, docstrings etc. Comments should explain why not how.
Modern solutions
The standard way to handle files and paths today is pathlib
. However, I think that a pandas
dataframe would be a more natural container for this type of information.
A pandas
solution might look like this
import pandas as pd
directory = Path.cwd() # current working dir
textfiles = get_files(directory, suffix=".txt")
size_by_date, count_by_date = filesizes_and_counts(textfiles)
dates = list(size_by_date.keys())
sizes = list(size_by_date.values())
counts = list(count_by_date.values())
system_info = pd.DataFrame(list(zip(dates, sizes, counts)))
system_info.columns = ["Date", "Filesize", "Files"]
See below for how the functions are defined.
All in all a modern solution might look like this
from collections import defaultdict
from pathlib import Path
from datetime import datetime
from typing import Generator, Optional, Iterable
MATCH_ANY = "*"
def get_files(
directory: Path, suffix, recursive: bool = False
) -> Generator[Path, None, None]:
folder_iter = directory.rglob if recursive else directory.glob
yield from (path for path in folder_iter(f"{MATCH_ANY}{suffix}"))
def filename_date(filename: str, delimiter="-") -> Optional[datetime]:
split_file = filename.split(delimiter)
if len(split_file) < 3:
return
try:
*_, day, month, year = split_file
return datetime(day=int(day), month=int(month), year=int(year))
except ValueError:
return None
def filesizes_and_counts(files: Iterable[Path]) -> tuple[Filesizes, Filecounts]:
file_sizes = defaultdict(float)
file_counts = defaultdict(int)
for path in files:
if (date := filename_date(path.name)) is None:
continue
file_sizes[date] += path.stat().st_size
file_counts[date] += 1
return file_sizes, file_counts
if __name__ == "__main__":
directory = Path.cwd() # current working dir
textfiles = get_files(directory, suffix=".txt")
sizes, counts = filesizes_and_counts(textfiles)
dates = sizes.keys()
for date in dates:
print(
f"{date.strftime('%Y-%m-%d')}: "
f"{sizes[date]:>20} bytes, "
f"files: {counts[date]}"
)
EDIT: Here is a full blown pandas solution which I think works very nicely
Most of the code is the same, but we let pandas
filter our data.
from pathlib import Path
from datetime import datetime
from typing import Generator, Optional, Iterable
import pandas as pd
MATCH_ANY = "*"
def get_files(
directory: Path, suffix, recursive: bool = False
) -> Generator[Path, None, None]:
folder_iter = directory.rglob if recursive else directory.glob
yield from (path for path in folder_iter(f"{MATCH_ANY}{suffix}"))
def filename_date(filename: str, delimiter="-") -> Optional[tuple[str, datetime]]:
split_file = filename.split(delimiter)
if len(split_file) < 3:
return
try:
*name, day, month, year = split_file
dt = datetime(day=int(day), month=int(month), year=int(year))
return delimiter.join(name), dt
except ValueError:
return None
def info(files: Iterable[Path]) -> Optional[pd.DataFrame]:
info_list = []
for path in files:
if (m := filename_date(path.stem)) is None:
continue
name, date = m
file_info = (date, name, path.stat().st_size, str(path))
info_list.append(file_info)
if not info_list:
return None
data = pd.DataFrame(info_list)
data.columns = ["Date", "Name", "Filesize", "Path"]
data = data.sort_values(by=["Date", "Filesize"], ascending=False)
return data.reset_index(drop=True)
def group_info_by_date(data: pd.DataFrame):
group_by_dates = data.groupby(["Date"], as_index=False)
df_by_dates = group_by_dates["Filesize"].sum()
df_by_dates["Files"] = group_by_dates.size()["size"]
df_by_dates = df_by_dates.sort_values(by=["Filesize", "Date"], ascending=False)
return df_by_dates.reset_index(drop=True)
if __name__ == "__main__":
directory = Path.cwd() # current working dir
textfiles = get_files(directory, suffix=".txt")
if (data := info(textfiles)) is not None:
df_by_dates = group_info_by_date(data)
print(df_by_dates)
print()
print(data[["Date", "Name", "Filesize"]])
-
\$\begingroup\$ for
-> Generator[Path, None, None]
you can also write-> Iterator[Path]
, which is a bit less specific, but looks simpler and is enough in most case \$\endgroup\$njzk2– njzk22022年01月29日 20:43:18 +00:00Commented Jan 29, 2022 at 20:43 -
\$\begingroup\$ I'm not yet convinced about using the morse operator at all \$\endgroup\$njzk2– njzk22022年01月29日 20:44:49 +00:00Commented Jan 29, 2022 at 20:44
-
\$\begingroup\$ the whole split + *name + delimiter.join can be simplifier by using
split_file = filename.rsplit(delimiter, maxsplit=3)
\$\endgroup\$njzk2– njzk22022年01月29日 20:48:54 +00:00Commented Jan 29, 2022 at 20:48 -
\$\begingroup\$ @njzk2 Morse operator? Otherwise I agree! \$\endgroup\$N3buchadnezzar– N3buchadnezzar2022年01月29日 21:06:11 +00:00Commented Jan 29, 2022 at 21:06
-
\$\begingroup\$ I meant walrus, sorry \$\endgroup\$njzk2– njzk22022年01月29日 21:50:31 +00:00Commented Jan 29, 2022 at 21:50