2
\$\begingroup\$

I'm working on a project where the file name consists of actual dates but the data for the dates are split into multiple files.

Developed the following program to count the number of files for each date (part of the filename) and also the total size. Wondering, if the same can be written in a better way.

import os
import glob
import os
import collections
directory_name = "\\SpeicifDir\\"
# Get a list of files (file paths) in the given directory 
list_of_files = filter( os.path.isfile,
 glob.glob(directory_name + '*.txt') )
mapOfDateFileSize = collections.defaultdict(list)
# For all the files
for file_path in list_of_files:
 file_size = os.stat(file_path).st_size
 filename = os.path.basename(file_path)
 splitFilename = filename.split('-')
 # Extract the file and split the file using - as a separator
 dtForFile = splitFilename[1] + "-" + splitFilename[2] + "-" + splitFilename[3]
 # Get the file name and size
 if dtForFile in mapOfDateFileSize:
 dataFromDictionary = mapOfDateFileSize[dtForFile]
 dataFromDictionary = dataFromDictionary[0]
 totalCount = dataFromDictionary[0]
 totalSize = dataFromDictionary[1]
 totalCount = totalCount + 1 
 totalSize = totalSize + file_size
 # Update the file size and count
 mapOfDateFileSize[dtForFile] = [ (totalCount, totalSize) ]
 else:
 mapOfDateFileSize[dtForFile].append((1,file_size))
# For each date get the total size, total file count
for dt,elements in mapOfDateFileSize.items():
 dataFromDictionary = elements[0]
 totalCount = dataFromDictionary[0]
 totalSize = dataFromDictionary[1]
 print (dt, ",", totalCount , ",", totalSize)
asked Jan 28, 2022 at 22:23
\$\endgroup\$
1
  • \$\begingroup\$ Can you give an example of the text files you are looking for? \$\endgroup\$ Commented Jan 29, 2022 at 11:37

2 Answers 2

6
\$\begingroup\$

Default Dictionary

You are using this wrong.

...
mapOfDateFileSize = collections.defaultdict(list)
for file_path in list_of_files:
 ...
 if dtForFile in mapOfDateFileSize:
 ...
 mapOfDateFileSize[dtForFile] = [ (totalCount, totalSize) ]
 else:
 mapOfDateFileSize[dtForFile].append((1,file_size))
...

The whole point of a default dictionary is so you can use the mapOfDateFileSize[key] without worrying about whether the key exists or not. Since you explicitly test if dtForFile in mapOfDateFileSize, you could use a regular dictionary, and create the list yourself in the else: clause with:

...
 else:
 mapOfDateFileSize[dtForFile] = [(1, file_size)]
...

Unnecessary nesting

 dataFromDictionary = mapOfDateFileSize[dtForFile]
 dataFromDictionary = dataFromDictionary[0]
 totalCount = dataFromDictionary[0]
 totalSize = dataFromDictionary[1]

You are storing a tuple, in a one-element list, in a dictionary. What is the purpose of this one-element list? It is not adding anything, and making things way more complicated.

 if dtForFile in mapOfDateFileSize:
 total_count, total_size = mapOfDateFileSize[dtForFile]
 total_count += 1 
 total_size += file_size
 # Update the file size and count
 mapOfDateFileSize[dtForFile] = (total_count, total_size)
 else:
 mapOfDateFileSize[dtForFile] = (1, file_size)
# Tuple unpacking even works when iterating through the dictionary!
for dt, (total_count, total_size) in mapOfDateFileSize.items():
 print (dt, ",", total_count, ",", total_size)

A Real Default Dictionary

The if dtForFile in mapOfDateFileSize is still annoying. You wanted to use a defaultdict. What you need is a default dict that returns the tuple (0, 0) if no entry exists.

...
mapOfDateFileSize = collections.defaultdict(lambda: (0, 0))
for file_path in list_of_files:
 ...
 total_count, total_size = mapOfDateFileSize[dtForFile]
 total_count += 1 
 total_size += file_size
 # Update the file size and count
 mapOfDateFileSize[dtForFile] = (total_count, total_size)
...

PEP 8

The Style Guide for Python Code has many guidelines you should follow. Of particular note, variable names should be snake_case, not bumpyWordsCommonlyUsedInOtherProgrammingLanguages since they are really hard to read.

answered Jan 28, 2022 at 23:27
\$\endgroup\$
0
1
\$\begingroup\$

An eye for detail

Importing your file causes several lines to change, and PEP warnings poping up. I recommend switching to an editor that provides black on save. For instance

 totalSize = dataFromDictionary[1]

Should really be

 totalSize = dataFromDictionary[1]

This indicates you being a bit lazy. There are several instances of this. Your imports are this

import os
import glob
import os
import collections

Which, after running isort over it, becomes this

import collections
import glob
import os

Similarly directory_name = "\\SpeicifDir\\" should probably be directory_name = "\\SpecificfDir\\" no? And speaking of,here you suddenly use snake_case when throughout the code camelCase is used?

Comments

If you find yourself using multiple comments, it is a good indication that your structure needs to change. Use more functions, docstrings etc. Comments should explain why not how.

Modern solutions

The standard way to handle files and paths today is pathlib. However, I think that a pandas dataframe would be a more natural container for this type of information.

A pandas solution might look like this

import pandas as pd
directory = Path.cwd() # current working dir
textfiles = get_files(directory, suffix=".txt")
size_by_date, count_by_date = filesizes_and_counts(textfiles)
dates = list(size_by_date.keys())
sizes = list(size_by_date.values())
counts = list(count_by_date.values())
system_info = pd.DataFrame(list(zip(dates, sizes, counts)))
system_info.columns = ["Date", "Filesize", "Files"]

See below for how the functions are defined.

All in all a modern solution might look like this

from collections import defaultdict
from pathlib import Path
from datetime import datetime
from typing import Generator, Optional, Iterable
MATCH_ANY = "*"
def get_files(
 directory: Path, suffix, recursive: bool = False
) -> Generator[Path, None, None]:
 folder_iter = directory.rglob if recursive else directory.glob
 yield from (path for path in folder_iter(f"{MATCH_ANY}{suffix}"))
def filename_date(filename: str, delimiter="-") -> Optional[datetime]:
 split_file = filename.split(delimiter)
 if len(split_file) < 3:
 return
 try:
 *_, day, month, year = split_file
 return datetime(day=int(day), month=int(month), year=int(year))
 except ValueError:
 return None
def filesizes_and_counts(files: Iterable[Path]) -> tuple[Filesizes, Filecounts]:
 file_sizes = defaultdict(float)
 file_counts = defaultdict(int)
 for path in files:
 if (date := filename_date(path.name)) is None:
 continue
 file_sizes[date] += path.stat().st_size
 file_counts[date] += 1
 return file_sizes, file_counts
if __name__ == "__main__":
 directory = Path.cwd() # current working dir
 textfiles = get_files(directory, suffix=".txt")
 sizes, counts = filesizes_and_counts(textfiles)
 dates = sizes.keys()
 for date in dates:
 print(
 f"{date.strftime('%Y-%m-%d')}: "
 f"{sizes[date]:>20} bytes, "
 f"files: {counts[date]}"
 )

EDIT: Here is a full blown pandas solution which I think works very nicely

enter image description here

Most of the code is the same, but we let pandas filter our data.

from pathlib import Path
from datetime import datetime
from typing import Generator, Optional, Iterable
import pandas as pd
MATCH_ANY = "*"
def get_files(
 directory: Path, suffix, recursive: bool = False
) -> Generator[Path, None, None]:
 folder_iter = directory.rglob if recursive else directory.glob
 yield from (path for path in folder_iter(f"{MATCH_ANY}{suffix}"))
def filename_date(filename: str, delimiter="-") -> Optional[tuple[str, datetime]]:
 split_file = filename.split(delimiter)
 if len(split_file) < 3:
 return
 try:
 *name, day, month, year = split_file
 dt = datetime(day=int(day), month=int(month), year=int(year))
 return delimiter.join(name), dt
 except ValueError:
 return None
def info(files: Iterable[Path]) -> Optional[pd.DataFrame]:
 info_list = []
 for path in files:
 if (m := filename_date(path.stem)) is None:
 continue
 name, date = m
 file_info = (date, name, path.stat().st_size, str(path))
 info_list.append(file_info)
 if not info_list:
 return None
 data = pd.DataFrame(info_list)
 data.columns = ["Date", "Name", "Filesize", "Path"]
 data = data.sort_values(by=["Date", "Filesize"], ascending=False)
 return data.reset_index(drop=True)
def group_info_by_date(data: pd.DataFrame):
 group_by_dates = data.groupby(["Date"], as_index=False)
 df_by_dates = group_by_dates["Filesize"].sum()
 df_by_dates["Files"] = group_by_dates.size()["size"]
 df_by_dates = df_by_dates.sort_values(by=["Filesize", "Date"], ascending=False)
 return df_by_dates.reset_index(drop=True)
if __name__ == "__main__":
 directory = Path.cwd() # current working dir
 textfiles = get_files(directory, suffix=".txt")
 if (data := info(textfiles)) is not None:
 df_by_dates = group_info_by_date(data)
 print(df_by_dates)
 print()
 print(data[["Date", "Name", "Filesize"]])
answered Jan 29, 2022 at 12:37
\$\endgroup\$
5
  • \$\begingroup\$ for -> Generator[Path, None, None] you can also write -> Iterator[Path], which is a bit less specific, but looks simpler and is enough in most case \$\endgroup\$ Commented Jan 29, 2022 at 20:43
  • \$\begingroup\$ I'm not yet convinced about using the morse operator at all \$\endgroup\$ Commented Jan 29, 2022 at 20:44
  • \$\begingroup\$ the whole split + *name + delimiter.join can be simplifier by using split_file = filename.rsplit(delimiter, maxsplit=3) \$\endgroup\$ Commented Jan 29, 2022 at 20:48
  • \$\begingroup\$ @njzk2 Morse operator? Otherwise I agree! \$\endgroup\$ Commented Jan 29, 2022 at 21:06
  • \$\begingroup\$ I meant walrus, sorry \$\endgroup\$ Commented Jan 29, 2022 at 21:50

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.