Create a summary report by summing the amount's of accounts with the same name and within the date range specified

Question 1

I have heard from my friend that it is bad practice to normally loop though the whole database to meet certain criteria. He mentioned something about the proper way being that you index the objects of interest.

What I want to achieve here is make a report for our company. So I do this by summing up all the items that have the same account that is within the filter date range set by start_date and end_date to start.

Some variable definitions:

Account name: data_entries.iloc[j, 4]

Account type: data_listofaccounts.iloc[i, 1]

Account amount: data_entries.iloc[j, 5]

So is there a more efficient way to write this code that will be less computationally taxing on the computer specifically. (minimize computational requirement)

import pandas as pd
import datetime
entries_csv = "C:\\Users\\Pops\\Desktop\\Entries.csv"
listofaccounts_csv = "C:\\Users\\Pops\\Desktop\\List of Accounts.csv"
data_entries = pd.read_csv(entries_csv)
data_listofaccounts = pd.read_csv(listofaccounts_csv)
data_entries['VOUCHER DATE'] = pd.to_datetime(data_entries['VOUCHER DATE'], format="%m/%d/%Y").dt.date
summary_amount = [0]*(len(data_listofaccounts) + 1)
summary = (('DEBIT ACCOUNT', 'DEBIT AMOUNT'),)
start_date = datetime.date(2018, 4, 1)
end_date = datetime.date(2018, 10, 30)
for i in range(0, len(data_listofaccounts)):
 for j in range(0, len(data_entries)):
 if start_date <= data_entries.iloc[j, 1] <= end_date:
 if data_listofaccounts.iloc[i, 0] == data_entries.iloc[j, 4]\
 and (data_listofaccounts.iloc[i, 1] == "CURRENT ASSET" or
 data_listofaccounts.iloc[i, 1] == "FIXED ASSET" or
 data_listofaccounts.iloc[i, 1] == "EXPENSE"):
 summary_amount[i] += data_entries.iloc[j, 5]
 elif data_listofaccounts.iloc[i, 0] == data_entries.iloc[j, 4]\
 and (data_listofaccounts.iloc[i, 1] == "CURRENT LIABILITY" or
 data_listofaccounts.iloc[i, 1] == "LONG TERM LIABILITY" or
 data_listofaccounts.iloc[i, 1] == "EQUITY"):
 summary_amount[i] -= data_entries.iloc[j, 5]
 summary += ((data_listofaccounts.iloc[i, 0], "{:,}".format(round(summary_amount[i], 2))),)

Entries sample data: enter image description here

List of Accounts sample data: enter image description here

List of Accounts contains unique account names while in the Entries worksheet, it can be repeated.

Question 2

How does this work: data_listofaccounts.iloc[i, 1] == "CURRENT ASSET" or "FIXED ASSET" or "EXPENSE"?

Question 3

@hjpotter92 if the account type of row i which is located in the 2nd column equal to any of those 3 strings, it will be satisfied

Question 4

@MarcSantos I'm not familiar with pandas, but in normal python; it would be represented as <condition> or True or True, which would always be True irrespective of the condition (which is <value> == "CURRENT ASSET". Can you provide a link to docs where pandas mentions this behaviour of comparison?

Question 5

@TobySpeight is that revision okay? I am not quite sure how to phrase my concern by stating what it does. Or should I just write down create a summary report with criteria or something

Question 6

@hjpotter92 I believe I already fixed it with the new edit.

Question 7

The kind of operation you’re doing is called a join: you want to associate data from a DataFrame to data from another one based on a shared information on a given column.

To join a DataFrame to another one or to a Series, you need to respect a simple rule: either you join on index or a column is joined to an index; and they must be of similar nature. So in your case, since you need to join on the name of the account, one of your DataFrame must be indexed by this name. Since it is its purpose, you need to reindex data_listofaccounts by its 'Account Name' column:

data_listofaccounts = pd.read_csv(listofaccounts_csv)
data_listofaccounts = data_listofaccounts.set_index(['Account Name'])

Then, before joining, you can filter out data that is out of your study range so the join is performed on less data:

filtered = data_entries[(start_date <= data_entries['VOUCHER DATE']) & (data_entries['VOUCHER DATE'] <= end_date)]

And thus the data you’re interested in is accessed using:

data_entries = pd.read_csv(entries_csv)
data_entries['VOUCHER DATE'] = pd.to_datetime(data_entries['VOUCHER DATE'], format="%m/%d/%Y")
data_listofaccounts = pd.read_csv(listofaccounts_csv)
data_listofaccounts = data_listofaccounts.set_index(['Account Name'])
start_date = datetime.date(2018, 4, 1)
end_date = datetime.date(2018, 10, 30)
date_mask = (start_date <= data_entries['VOUCHER DATE']) & (data_entries['VOUCHER DATE'] <= end_date)
interesting = data_entries[date_mask].join(data_listofaccounts, on='DEBIT ACCOUNT')

And then each row of interesting will have all the information needed: the transaction date, the name of the account, its type and the amount spent.

But this is all without taking into account the kind of operations you want to perform afterwards: grouping by name and summing the amounts. You can perform this operation directly before joining and it will simplify the process altogether:

data_entries = pd.read_csv(entries_csv)
data_entries['VOUCHER DATE'] = pd.to_datetime(data_entries['VOUCHER DATE'], format="%m/%d/%Y")
start_date = datetime.date(2018, 4, 1)
end_date = datetime.date(2018, 10, 30)
date_mask = (start_date <= data_entries['VOUCHER DATE']) & (data_entries['VOUCHER DATE'] <= end_date)
amount_per_account = data_entries[date_mask].groupby(['DEBIT ACCOUNT']).sum()

This will return a DataFrame indexed by the accounts names whose 'DEBIT AMOUNT' column is the sum of each row pertaining to this account. You then just need to join with data_listofaccounts to know if this sum should be positive or negative based on the 'PARENT NODE' column.

summary = data_listofaccounts.join(amount_per_account, on='Account Name', how='outer').fillna(0)
debit_mask = (summary.Type == 'CURRENT LIABILITY') | (summary.Type == 'LONG TERM LIABILITY') | (summary.Type == 'EQUITY')
summary[debit_mask]['DEBIT AMOUNT'] = -summary[debit_mask]['DEBIT AMOUNT']

Other improvements pertaining to coding style:

you should define functions to organize your code
you should guard your code using if __name__ == '__main__'
you don't need to say that a variable contain some data_; same for namming a collection, you don't need to say what kind of collection hold the data (besides, in your case it is misleading as your listofaccounts is in fact a DataFrame); so data_listofaccounts => accounts
you should follow PEP8 namming conventions

And to pandas:

you can limit the amount of data retrieved from your CSVs by using the usecols argument; this will lead to less data manipulation afterwards and thus more speed.

Question 8

Thanks for the very detailed answer. What did you mean by you don't need to say that a variable contain some data_

Question 9

@MarcSantos data_something => something

Question 10

Btw, I get an error on this line

debit_mask = (summary.Type == 'CURRENT LIABILITY') | (summary.Type == 'LONG TERM LIABILITY') | (summary.type == 'EQUITY')

. The error message is AttributeError: 'DataFrame' object has no attribute 'type'. Did you mean

debit_mask = (summary['Type'] == 'CURRENT LIABILITY') | (summary['Type'] == 'LONG TERM LIABILITY') | (summary['Type'] == 'EQUITY')

.

Question 11

@MarcSantos there is a typo, I wrote summary.type instead of summary.Type. Your version should work too.

Question 12

Thanks a lot. Is my version and your version exactly the same? I never knew you could reference it they way you did

score 2 · Accepted Answer · 2018-07-23 11:26:04Z

The kind of operation you’re doing is called a join: you want to associate data from a DataFrame to data from another one based on a shared information on a given column.

To join a DataFrame to another one or to a Series, you need to respect a simple rule: either you join on index or a column is joined to an index; and they must be of similar nature. So in your case, since you need to join on the name of the account, one of your DataFrame must be indexed by this name. Since it is its purpose, you need to reindex data_listofaccounts by its 'Account Name' column:

data_listofaccounts = pd.read_csv(listofaccounts_csv)
data_listofaccounts = data_listofaccounts.set_index(['Account Name'])

Then, before joining, you can filter out data that is out of your study range so the join is performed on less data:

filtered = data_entries[(start_date <= data_entries['VOUCHER DATE']) & (data_entries['VOUCHER DATE'] <= end_date)]

And thus the data you’re interested in is accessed using:

data_entries = pd.read_csv(entries_csv)
data_entries['VOUCHER DATE'] = pd.to_datetime(data_entries['VOUCHER DATE'], format="%m/%d/%Y")
data_listofaccounts = pd.read_csv(listofaccounts_csv)
data_listofaccounts = data_listofaccounts.set_index(['Account Name'])
start_date = datetime.date(2018, 4, 1)
end_date = datetime.date(2018, 10, 30)
date_mask = (start_date <= data_entries['VOUCHER DATE']) & (data_entries['VOUCHER DATE'] <= end_date)
interesting = data_entries[date_mask].join(data_listofaccounts, on='DEBIT ACCOUNT')

And then each row of interesting will have all the information needed: the transaction date, the name of the account, its type and the amount spent.

But this is all without taking into account the kind of operations you want to perform afterwards: grouping by name and summing the amounts. You can perform this operation directly before joining and it will simplify the process altogether:

data_entries = pd.read_csv(entries_csv)
data_entries['VOUCHER DATE'] = pd.to_datetime(data_entries['VOUCHER DATE'], format="%m/%d/%Y")
start_date = datetime.date(2018, 4, 1)
end_date = datetime.date(2018, 10, 30)
date_mask = (start_date <= data_entries['VOUCHER DATE']) & (data_entries['VOUCHER DATE'] <= end_date)
amount_per_account = data_entries[date_mask].groupby(['DEBIT ACCOUNT']).sum()

This will return a DataFrame indexed by the accounts names whose 'DEBIT AMOUNT' column is the sum of each row pertaining to this account. You then just need to join with data_listofaccounts to know if this sum should be positive or negative based on the 'PARENT NODE' column.

summary = data_listofaccounts.join(amount_per_account, on='Account Name', how='outer').fillna(0)
debit_mask = (summary.Type == 'CURRENT LIABILITY') | (summary.Type == 'LONG TERM LIABILITY') | (summary.Type == 'EQUITY')
summary[debit_mask]['DEBIT AMOUNT'] = -summary[debit_mask]['DEBIT AMOUNT']

Other improvements pertaining to coding style:

you should define functions to organize your code
you should guard your code using if __name__ == '__main__'
you don't need to say that a variable contain some data_; same for namming a collection, you don't need to say what kind of collection hold the data (besides, in your case it is misleading as your listofaccounts is in fact a DataFrame); so data_listofaccounts => accounts
you should follow PEP8 namming conventions

And to pandas:

you can limit the amount of data retrieved from your CSVs by using the usecols argument; this will lead to less data manipulation afterwards and thus more speed.

Thanks for the very detailed answer. What did you mean by you don't need to say that a variable contain some data_
Btw, I get an error on this line debit_mask = (summary.Type == 'CURRENT LIABILITY') | (summary.Type == 'LONG TERM LIABILITY') | (summary.type == 'EQUITY'). The error message is AttributeError: 'DataFrame' object has no attribute 'type'. Did you mean debit_mask = (summary['Type'] == 'CURRENT LIABILITY') | (summary['Type'] == 'LONG TERM LIABILITY') | (summary['Type'] == 'EQUITY').
@MarcSantos there is a typo, I wrote summary.type instead of summary.Type. Your version should work too.
Thanks a lot. Is my version and your version exactly the same? I never knew you could reference it they way you did

Stack Exchange Network

Create a summary report by summing the amount's of accounts with the same name and within the date range specified

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Create a summary report by summing the amount's of accounts with the same name and within the date range specified

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions