Get the total file size from a directory for each date using map and pairs

Question 1

I'm working on a project where the file name consists of actual dates but the data for the dates are split into multiple files.

Developed the following program to count the number of files for each date (part of the filename) and also the total size. Wondering, if the same can be written in a better way.

import os
import glob
import os
import collections
directory_name = "\\SpeicifDir\\"
# Get a list of files (file paths) in the given directory 
list_of_files = filter( os.path.isfile,
 glob.glob(directory_name + '*.txt') )
mapOfDateFileSize = collections.defaultdict(list)
# For all the files
for file_path in list_of_files:
 file_size = os.stat(file_path).st_size
 filename = os.path.basename(file_path)
 splitFilename = filename.split('-')
 # Extract the file and split the file using - as a separator
 dtForFile = splitFilename[1] + "-" + splitFilename[2] + "-" + splitFilename[3]
 # Get the file name and size
 if dtForFile in mapOfDateFileSize:
 dataFromDictionary = mapOfDateFileSize[dtForFile]
 dataFromDictionary = dataFromDictionary[0]
 totalCount = dataFromDictionary[0]
 totalSize = dataFromDictionary[1]
 totalCount = totalCount + 1 
 totalSize = totalSize + file_size
 # Update the file size and count
 mapOfDateFileSize[dtForFile] = [ (totalCount, totalSize) ]
 else:
 mapOfDateFileSize[dtForFile].append((1,file_size))
# For each date get the total size, total file count
for dt,elements in mapOfDateFileSize.items():
 dataFromDictionary = elements[0]
 totalCount = dataFromDictionary[0]
 totalSize = dataFromDictionary[1]
 print (dt, ",", totalCount , ",", totalSize)

Question 2

Can you give an example of the text files you are looking for?

Question 3

Default Dictionary

You are using this wrong.

...
mapOfDateFileSize = collections.defaultdict(list)
for file_path in list_of_files:
 ...
 if dtForFile in mapOfDateFileSize:
 ...
 mapOfDateFileSize[dtForFile] = [ (totalCount, totalSize) ]
 else:
 mapOfDateFileSize[dtForFile].append((1,file_size))
...

The whole point of a default dictionary is so you can use the mapOfDateFileSize[key] without worrying about whether the key exists or not. Since you explicitly test if dtForFile in mapOfDateFileSize, you could use a regular dictionary, and create the list yourself in the else: clause with:

...
 else:
 mapOfDateFileSize[dtForFile] = [(1, file_size)]
...

Unnecessary nesting

 dataFromDictionary = mapOfDateFileSize[dtForFile]
 dataFromDictionary = dataFromDictionary[0]
 totalCount = dataFromDictionary[0]
 totalSize = dataFromDictionary[1]

You are storing a tuple, in a one-element list, in a dictionary. What is the purpose of this one-element list? It is not adding anything, and making things way more complicated.

 if dtForFile in mapOfDateFileSize:
 total_count, total_size = mapOfDateFileSize[dtForFile]
 total_count += 1 
 total_size += file_size
 # Update the file size and count
 mapOfDateFileSize[dtForFile] = (total_count, total_size)
 else:
 mapOfDateFileSize[dtForFile] = (1, file_size)
# Tuple unpacking even works when iterating through the dictionary!
for dt, (total_count, total_size) in mapOfDateFileSize.items():
 print (dt, ",", total_count, ",", total_size)

A Real Default Dictionary

The if dtForFile in mapOfDateFileSize is still annoying. You wanted to use a defaultdict. What you need is a default dict that returns the tuple (0, 0) if no entry exists.

...
mapOfDateFileSize = collections.defaultdict(lambda: (0, 0))
for file_path in list_of_files:
 ...
 total_count, total_size = mapOfDateFileSize[dtForFile]
 total_count += 1 
 total_size += file_size
 # Update the file size and count
 mapOfDateFileSize[dtForFile] = (total_count, total_size)
...

PEP 8

The Style Guide for Python Code has many guidelines you should follow. Of particular note, variable names should be snake_case, not bumpyWordsCommonlyUsedInOtherProgrammingLanguages since they are really hard to read.

Question 4

An eye for detail

Importing your file causes several lines to change, and PEP warnings poping up. I recommend switching to an editor that provides black on save. For instance

 totalSize = dataFromDictionary[1]

Should really be

 totalSize = dataFromDictionary[1]

This indicates you being a bit lazy. There are several instances of this. Your imports are this

import os
import glob
import os
import collections

Which, after running isort over it, becomes this

import collections
import glob
import os

Similarly directory_name = "\\SpeicifDir\\" should probably be directory_name = "\\SpecificfDir\\" no? And speaking of,here you suddenly use snake_case when throughout the code camelCase is used?

Comments

If you find yourself using multiple comments, it is a good indication that your structure needs to change. Use more functions, docstrings etc. Comments should explain why not how.

Modern solutions

The standard way to handle files and paths today is pathlib. However, I think that a pandas dataframe would be a more natural container for this type of information.

A pandas solution might look like this

import pandas as pd
directory = Path.cwd() # current working dir
textfiles = get_files(directory, suffix=".txt")
size_by_date, count_by_date = filesizes_and_counts(textfiles)
dates = list(size_by_date.keys())
sizes = list(size_by_date.values())
counts = list(count_by_date.values())
system_info = pd.DataFrame(list(zip(dates, sizes, counts)))
system_info.columns = ["Date", "Filesize", "Files"]

See below for how the functions are defined.

All in all a modern solution might look like this

from collections import defaultdict
from pathlib import Path
from datetime import datetime
from typing import Generator, Optional, Iterable
MATCH_ANY = "*"
def get_files(
 directory: Path, suffix, recursive: bool = False
) -> Generator[Path, None, None]:
 folder_iter = directory.rglob if recursive else directory.glob
 yield from (path for path in folder_iter(f"{MATCH_ANY}{suffix}"))
def filename_date(filename: str, delimiter="-") -> Optional[datetime]:
 split_file = filename.split(delimiter)
 if len(split_file) < 3:
 return
 try:
 *_, day, month, year = split_file
 return datetime(day=int(day), month=int(month), year=int(year))
 except ValueError:
 return None
def filesizes_and_counts(files: Iterable[Path]) -> tuple[Filesizes, Filecounts]:
 file_sizes = defaultdict(float)
 file_counts = defaultdict(int)
 for path in files:
 if (date := filename_date(path.name)) is None:
 continue
 file_sizes[date] += path.stat().st_size
 file_counts[date] += 1
 return file_sizes, file_counts
if __name__ == "__main__":
 directory = Path.cwd() # current working dir
 textfiles = get_files(directory, suffix=".txt")
 sizes, counts = filesizes_and_counts(textfiles)
 dates = sizes.keys()
 for date in dates:
 print(
 f"{date.strftime('%Y-%m-%d')}: "
 f"{sizes[date]:>20} bytes, "
 f"files: {counts[date]}"
 )

EDIT: Here is a full blown pandas solution which I think works very nicely

enter image description here

Most of the code is the same, but we let pandas filter our data.

from pathlib import Path
from datetime import datetime
from typing import Generator, Optional, Iterable
import pandas as pd
MATCH_ANY = "*"
def get_files(
 directory: Path, suffix, recursive: bool = False
) -> Generator[Path, None, None]:
 folder_iter = directory.rglob if recursive else directory.glob
 yield from (path for path in folder_iter(f"{MATCH_ANY}{suffix}"))
def filename_date(filename: str, delimiter="-") -> Optional[tuple[str, datetime]]:
 split_file = filename.split(delimiter)
 if len(split_file) < 3:
 return
 try:
 *name, day, month, year = split_file
 dt = datetime(day=int(day), month=int(month), year=int(year))
 return delimiter.join(name), dt
 except ValueError:
 return None
def info(files: Iterable[Path]) -> Optional[pd.DataFrame]:
 info_list = []
 for path in files:
 if (m := filename_date(path.stem)) is None:
 continue
 name, date = m
 file_info = (date, name, path.stat().st_size, str(path))
 info_list.append(file_info)
 if not info_list:
 return None
 data = pd.DataFrame(info_list)
 data.columns = ["Date", "Name", "Filesize", "Path"]
 data = data.sort_values(by=["Date", "Filesize"], ascending=False)
 return data.reset_index(drop=True)
def group_info_by_date(data: pd.DataFrame):
 group_by_dates = data.groupby(["Date"], as_index=False)
 df_by_dates = group_by_dates["Filesize"].sum()
 df_by_dates["Files"] = group_by_dates.size()["size"]
 df_by_dates = df_by_dates.sort_values(by=["Filesize", "Date"], ascending=False)
 return df_by_dates.reset_index(drop=True)
if __name__ == "__main__":
 directory = Path.cwd() # current working dir
 textfiles = get_files(directory, suffix=".txt")
 if (data := info(textfiles)) is not None:
 df_by_dates = group_info_by_date(data)
 print(df_by_dates)
 print()
 print(data[["Date", "Name", "Filesize"]])

Question 5

for -> Generator[Path, None, None] you can also write -> Iterator[Path], which is a bit less specific, but looks simpler and is enough in most case

Question 6

I'm not yet convinced about using the morse operator at all

Question 7

the whole split + *name + delimiter.join can be simplifier by using split_file = filename.rsplit(delimiter, maxsplit=3)

Question 8

@njzk2 Morse operator? Otherwise I agree!

Question 9

I meant walrus, sorry

score 6 · Accepted Answer · 2022-01-28 23:27:31Z

Default Dictionary

You are using this wrong.

...
mapOfDateFileSize = collections.defaultdict(list)
for file_path in list_of_files:
 ...
 if dtForFile in mapOfDateFileSize:
 ...
 mapOfDateFileSize[dtForFile] = [ (totalCount, totalSize) ]
 else:
 mapOfDateFileSize[dtForFile].append((1,file_size))
...

The whole point of a default dictionary is so you can use the mapOfDateFileSize[key] without worrying about whether the key exists or not. Since you explicitly test if dtForFile in mapOfDateFileSize, you could use a regular dictionary, and create the list yourself in the else: clause with:

...
 else:
 mapOfDateFileSize[dtForFile] = [(1, file_size)]
...

Unnecessary nesting

 dataFromDictionary = mapOfDateFileSize[dtForFile]
 dataFromDictionary = dataFromDictionary[0]
 totalCount = dataFromDictionary[0]
 totalSize = dataFromDictionary[1]

You are storing a tuple, in a one-element list, in a dictionary. What is the purpose of this one-element list? It is not adding anything, and making things way more complicated.

 if dtForFile in mapOfDateFileSize:
 total_count, total_size = mapOfDateFileSize[dtForFile]
 total_count += 1 
 total_size += file_size
 # Update the file size and count
 mapOfDateFileSize[dtForFile] = (total_count, total_size)
 else:
 mapOfDateFileSize[dtForFile] = (1, file_size)
# Tuple unpacking even works when iterating through the dictionary!
for dt, (total_count, total_size) in mapOfDateFileSize.items():
 print (dt, ",", total_count, ",", total_size)

A Real Default Dictionary

The if dtForFile in mapOfDateFileSize is still annoying. You wanted to use a defaultdict. What you need is a default dict that returns the tuple (0, 0) if no entry exists.

...
mapOfDateFileSize = collections.defaultdict(lambda: (0, 0))
for file_path in list_of_files:
 ...
 total_count, total_size = mapOfDateFileSize[dtForFile]
 total_count += 1 
 total_size += file_size
 # Update the file size and count
 mapOfDateFileSize[dtForFile] = (total_count, total_size)
...

PEP 8

The Style Guide for Python Code has many guidelines you should follow. Of particular note, variable names should be snake_case, not bumpyWordsCommonlyUsedInOtherProgrammingLanguages since they are really hard to read.

Stack Exchange Network

Get the total file size from a directory for each date using map and pairs

2 Answers 2

Default Dictionary

Unnecessary nesting

A Real Default Dictionary

PEP 8

An eye for detail

Comments

Modern solutions

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Get the total file size from a directory for each date using map and pairs

2 Answers 2

Default Dictionary

Unnecessary nesting

A Real Default Dictionary

PEP 8

An eye for detail

Comments

Modern solutions

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions