First program in Python for data analysis

Question 1

I'm just starting my adventure with python and I wanted to share my first project with you and ask for feedback and advice. This is a script for my friend to automate all calculations and database operations.

What my code does:

Merge together the csv files in the specified folder forming two sets.
Calculate the difference in days between the two date columns.
Filter the data over a given date range.
Checking that the id from the data frame L is not part of the M.
Creates output tables where the results will be stored.
Mainly calculation of the average of individual columns after preparation
Saving the results in a table

import re
from datetime import datetime
import pandas as pd
import numpy as np
from tkinter import filedialog
from tkinter import *
import os, sys
import glob
from matplotlib.backends.backend_pdf import PdfPages
import time
listOfFiles = []
tempDF = pd.DataFrame()
tempDF2 = pd.DataFrame()
def get_filenames():
 Tk().withdraw()
 print("Initializing Dialogue... \\nPlease select a file.")
 tk_filenames = filedialog.askdirectory()
 tempDir = tk_filenames
 return tempDir
choosen_dir = get_filenames()
os.chdir(choosen_dir)
for file in glob.glob("*.csv"):
 listOfFiles.append(file)
for file in listOfFiles:
 if file.startswith('L'):
 Table = pd.read_csv(file, sep=';')
 DF1 = pd.DataFrame(Table)
 tempDF = tempDF.append(DF1, sort=False)
 elif file.startswith('M'):
 Table2 = pd.read_csv(file, sep=';')
 DF2 = pd.DataFrame(Table2)
 tempDF2 = tempDF2.append(DF2, sort=False)
 else:
 print('404')
DFL = tempDF
DFM = tempDF2
class ToDate():
 def change_to_date(self, a, b):
 a[b] = pd.to_datetime(a[b])
dat = ToDate()
dat.change_to_date(DFL, 'LDUR')
dat.change_to_date(DFL, 'LDOP')
dat.change_to_date(DFM, 'LDOP')
dat.change_to_date(DFM, 'LDUR')
DFL['M_WAGUR'] = DFL['LDOP'] - DFL['LDUR']
DFM['M_WAGUR'] = DFM['LDUR'] - DFM['LDOP']
DFL['M_WAGUR'] = DFL['M_WAGUR'] / np.timedelta64(1, 'D')
DFM['M_WAGUR'] = DFM['M_WAGUR'] / np.timedelta64(1, 'D')
date1 = input('Date range from YYYY-MM-DD')
date2 = input('Date range to YYYY-MM-DD')
DFL = DFL[(DFL['LDOP'] >= date1) & (DFL['LDOP'] <= date2)]
DFM = DFM[(DFM['LDUR'] >= date1) & (DFM['LDUR'] <= date2)]
DFM_BL = DFM
ListL = DFL.ID1.tolist()
DFM_BL = DFM_BL[~DFM_BL.ID1.isin(ListL)]
def createDT(name):
 temp = pd.DataFrame(columns = ['Race','number of individuals','littering youngsters',
 'first litter.','more than first litter',
 'amount_21','lose_21','nip_count',
 'age','between_litter'],index =[1])
 temp['Race'] = name
 return temp
def sortedDT(DT, col, number):
 newDT = DT[DT[col] == number]
 return newDT
def months(DT, col, col2):
 newDT = DT[DT[col] != DT[col2]]
 return newDT
def birth(DFL, DFM):
 if len(DFL) > 0:
 pigs = DFL[['LIL11', 'LIL21']]
 pigs2 = DFM[['LIL11', 'LIL21']]
 pigs = pigs.append(pigs2)
 pigs['loses'] = pigs['LIL11'] - pigs['LIL21']
 pigs['ST21'] = (pigs['loses'] / pigs['LIL11']) * 100
 return pigs
 else:
 pigs = DFM[['LIL11', 'LIL21']]
 pigs['loses'] = pigs['LIL11'] - pigs['LIL21']
 pigs['ST21'] = (pigs['loses'] / pigs['LIL11']) * 100
 return pigs
def sum_digits(n):
 s = 0
 while n.all():
 s += n % 10
 n //= 10
 return s
def all_count(DFL, DFM, DFM_BL, n, new):
 if len(DFL) > 0:
 s = sum_digits(n)
 pigs = birth(DFL, DFM)
 DFM_BL = DFM_BL.drop_duplicates(subset='ID1', keep="first").copy()
 new['first litter'] = len(DFL)
 new['more than first litter'] = len(DFL) + len(DFM)
 new['number of individuals'] = len(DFL) + len(DFM_BL)
 new['littering youngsters'] = np.around(pigs['LIL11'].mean(), decimals=2)
 new['amount_21'] = np.around(pigs['LIL21'].mean(), decimals=2)
 new['lose_21'] = np.around(pigs['ST21'].mean(), decimals=2)
 new['age'] = np.around(DFL['M_WAGUR'].mean())
 new['between_litter'] = np.around(DFM['M_WAGUR'].mean())
 new['nip_count'] = np.around(s.mean(), decimals=2)
 return new
 else:
 pigs = birth(DFL, DFM)
 dfm = DFM.drop_duplicates(subset='LPROS1', keep="first").copy()
 new['first litter'] = "-"
 new['more than first litter'] = len(DFM)
 new['number of individuals'] = len(dfm)
 new['littering youngsters'] = np.around(pigs['LIL11'].mean(), decimals=2)
 new['amount_21'] = np.around(pigs['LIL21'].mean(), decimals=2)
 new['lose_21'] = np.around(pigs['ST21'].mean(), decimals=2)
 new['age'] = "-"
 new['between_litter'] = np.around(DFM['M_WAGUR'].mean())
 new['nip_count'] = "-"
 return new
W1R43 = createDT('W1R43')
L_W1R43 = sortedDT(DFL,'ID1',10).copy()
M_W1R43 = sortedDT(DFM,'ID1',10).copy()
BL_W1R43 = sortedDT(DFM_BL,'ID1',10).copy()
all_count(L_W1R43,MW1R43,BL_W1R43,W1R43['LSUT1'],W1R43)
last = pd.concat([W1R43, ... ])
last.to_csv('Results.txt', header=True, index=False, sep='\t', mode='w')
last.to_html('Results.html',index=False)

In the future, I would like to transfer this code to Django as one of the tools.

Please advise me what I should change, what I should avoid and if there is something correct in this code you can also let me know.

Question 2

I would begin to use MVP since you want to use Django. Separate the view from the logic. You have several functions spread in the code together with dataframes definitions etc. Put them into classes (one specifically for the view, Tkinter in this case). Another for callbacks (you use tkinter and CLI at the same time, use callbacks to return values to the user via Tkinter, CLI, Django etc.). Also use if __name__ == '__main__': and a main() so it's easier to know where we should begin to read your code. If you need help, just ping.

Question 3

@Raphael Please post answers in the answer boxes.

Question 4

@Peilonrayz This is far from be an answer in my opinion, just some tips.

Question 5

@Raphael, I agree with Peilonrayz. I think that you should take your comment and make it into a review, it sounds like you have some good information there, why not get some recognition for it?

Question 6

Here are some of the issues I've found with your code. But there are many more left and it is very hard to review your code without seeing some sample data files. Consider rewriting your code, taking these suggestions into account and then posting another request for code review.

Avoid mixing import styles

import re
from datetime import datetime
import pandas as pd
import numpy as np
from tkinter import filedialog
from tkinter import *
import os, sys
import glob
from matplotlib.backends.backend_pdf import PdfPages
import time

It is ugly and can be confusing to mix import styles. Choose the one you like, either from foo import bar or import foo [as bar], and stick with that as much as possible.

Wrap code in `main`

Writing code at the module level means that the global namespace is modified when the code runs. Suppose you have the following code:

y = ...
...
for i in range(x):
 ...
for i in range(y):
 ...

You decide to refactor the second for-loop into its own function, but you forget to pass y as a parameter:

def new_fun():
 for i in range(y):
 ...
y = ...
...
for i in range(x):
 ...
new_fun()

The code contains a bug but still runs. It can cause nasty and hard to debug bugs (has happened to me several times before I wised up). Therefore, it's good practice to always wrap the main code in a main, even for small scripts. Like this:

def main():
 ...
if __name__ == '__main__':
 main()

Follow PEP 8

Follow PEP 8 - Python's style guide. So use snake-case instead of camel-case for variable names. E.g list_of_files instead of listOfFiles.

Naming

Ensure that names are accurate, understandable and follows PEP 8. E.g

choosen_dir = get_filenames()
os.chdir(choosen_dir)

The function gets a directory not filenames, so it should be called get_dir.

The name file should be avoided because it clashes with a Python builtin. I'd replace it with fname.

Also names such as W1R43, L_W1R43 and ListL are very cryptic. I have looked at the code for a while but I still have no idea what they are about.

Intermediate lists

You have code like

listOfFiles = []
...
for file in glob.glob("*.csv"):
 listOfFiles.append(file)
for file in listOfFiles:
 ...

Instead you can just write

for file in glob.glob("*.csv")

without creating the listOfFiles list.

ToDate

I don't understand the purpose of this class. I think you can just write:

df_l['LDUR'] = pd.to_datetime(df_l['LDUR'])
df_l['LDOP'] = pd.to_datetime(df_l['LDOP'])
df_m['LDUR'] = pd.to_datetime(df_m['LDUR'])
df_m['LDOP'] = pd.to_datetime(df_m['LDOP'])

Question 7

The first thing I would do is change the

from tkinter import filedialog
from tkinter import *

to

import tkinter as tk

It is generally seen as bad form to 'pollute' the namespace by doing from tkinter import *, see the last paragraph in PEP 8's section on imports. By using import tkinter as tk and referencing everything as e.g. tk.filedialog it stays clear where the function came from.

Question 8

Welcome to Code Review! Would you please explain your reasoning a little bit, add some meat to your answer?

Gaslight Deceive Subvert 1,39210 silver badges18 bronze badges · Accepted Answer · 2020-01-08 07:45:05Z

Here are some of the issues I've found with your code. But there are many more left and it is very hard to review your code without seeing some sample data files. Consider rewriting your code, taking these suggestions into account and then posting another request for code review.

Avoid mixing import styles

import re
from datetime import datetime
import pandas as pd
import numpy as np
from tkinter import filedialog
from tkinter import *
import os, sys
import glob
from matplotlib.backends.backend_pdf import PdfPages
import time

It is ugly and can be confusing to mix import styles. Choose the one you like, either from foo import bar or import foo [as bar], and stick with that as much as possible.

Wrap code in `main`

Writing code at the module level means that the global namespace is modified when the code runs. Suppose you have the following code:

y = ...
...
for i in range(x):
 ...
for i in range(y):
 ...

You decide to refactor the second for-loop into its own function, but you forget to pass y as a parameter:

def new_fun():
 for i in range(y):
 ...
y = ...
...
for i in range(x):
 ...
new_fun()

The code contains a bug but still runs. It can cause nasty and hard to debug bugs (has happened to me several times before I wised up). Therefore, it's good practice to always wrap the main code in a main, even for small scripts. Like this:

def main():
 ...
if __name__ == '__main__':
 main()

Follow PEP 8

Follow PEP 8 - Python's style guide. So use snake-case instead of camel-case for variable names. E.g list_of_files instead of listOfFiles.

Naming

Ensure that names are accurate, understandable and follows PEP 8. E.g

choosen_dir = get_filenames()
os.chdir(choosen_dir)

The function gets a directory not filenames, so it should be called get_dir.

The name file should be avoided because it clashes with a Python builtin. I'd replace it with fname.

Also names such as W1R43, L_W1R43 and ListL are very cryptic. I have looked at the code for a while but I still have no idea what they are about.

Intermediate lists

You have code like

listOfFiles = []
...
for file in glob.glob("*.csv"):
 listOfFiles.append(file)
for file in listOfFiles:
 ...

Instead you can just write

for file in glob.glob("*.csv")

without creating the listOfFiles list.

ToDate

I don't understand the purpose of this class. I think you can just write:

df_l['LDUR'] = pd.to_datetime(df_l['LDUR'])
df_l['LDOP'] = pd.to_datetime(df_l['LDOP'])
df_m['LDUR'] = pd.to_datetime(df_m['LDUR'])
df_m['LDOP'] = pd.to_datetime(df_m['LDOP'])

Stack Exchange Network

First program in Python for data analysis

2 Answers 2

Avoid mixing import styles

Wrap code in `main`

Follow PEP 8

Naming

Intermediate lists

ToDate

You must log in to answer this question.

Hot Network Questions

First program in Python for data analysis

2 Answers 2

Avoid mixing import styles

Wrap code in main

Follow PEP 8

Naming

Intermediate lists

ToDate

You must log in to answer this question.

Related

Hot Network Questions

Wrap code in `main`