I want to check null values in excel files where there are many sub folders. I have wrote a python code to check the null values in excel files in each and every folder and the code was working fine but I need to simplify the code.
import os ,uuidtre
import numpy as np
import pandas as pd
import logging
path =r"C:\Users\Documents\python\Emp"
files = os.listdir(path)
for filename in files:
pathchild=path+'\\'+filename
file = os.listdir(pathchild)
files_xls = [f for f in file if f[-4:]== 'xlsx' ]
child_folders= [f for f in file if f[-4:]!= 'xlsx']
for f1 in files_xls:
filepath=pathchild+'\\'+f1
df = pd.read_excel(filepath, engine='openpyxl')
count_nan = df.isnull().sum().sum()
logging.basicConfig(filename='nanTest.log', level=logging.INFO,format='%(message)s')
if count_nan ==0:
none=' No none value in the excel file'
else:
none=' There is none values founded in the excel file'
output='File name: '+ f1, 'File path: '+ filepath, none
logging.info(output)
for f2 in child_folders:
patha=pathchild+'\\'+f2
file1 = os.listdir(patha)
files_xls1 = [f for f in file1 if f[-4:]== 'xlsx']
for f3 in files_xls1:
filechildpath=patha+'\\'+f3
df = pd.read_excel(filechildpath, engine='openpyxl')
count_nan = df.isnull().sum().sum()
if count_nan ==0:
none=' No none value in the excel file'
else:
none=' There is none values founded in the excel file'
output='File name: '+ f3,'File path: '+ filepath, none
logging.info(output)
1 Answer 1
Use print()
or write()
for normal program output. The logging
module is generally for recording information about how the program is running, e.g., for debugging, performance monitoring, etc.
I'd recommend using pathlib
in the standard library. Path.rglob()
would be useful, it recursively searches directories for files with names that match a pattern, such as "*.xlsx". Using it lets you eliminate the nested loops and a lot of the file/path handling code.
pd.read_excel()
accepts a pathlib.Path
for the first argument.
The code collects the counts and path for each file, so the information can be sorted. This makes it easier to see which files have nulls, because they are grouped together.
Use f-strings
instead of string concatenation.
I don't have excel files on this machine, so this code isn't tested:
import pandas as pd
import pathlib
null_counts = []
start_path = pathlib.Path(r"C:\Users\Documents\python\Emp")
for file_path in start_path.rglob('*.xlsx'):
df = pd.read_excel(file_path, engine='openpyxl')
count_nan = df.isnull().sum().sum()
null_counts.append((count_nan, file_path))
null_counts.sort(reverse=True)
with open('nanTest.log', 'wt') as output:
for count, filepath in null_counts:
print(f"{count:5} {filepath}", file=output)