I have heard from my friend that it is bad practice to normally loop though the whole database to meet certain criteria. He mentioned something about the proper way being that you index the objects of interest.
What I want to achieve here is make a report for our company. So I do this by summing up all the items that have the same account that is within the filter date range set by start_date
and end_date
to start.
Some variable definitions:
Account name: data_entries.iloc[j, 4]
Account type: data_listofaccounts.iloc[i, 1]
Account amount: data_entries.iloc[j, 5]
So is there a more efficient way to write this code that will be less computationally taxing on the computer specifically. (minimize computational requirement)
import pandas as pd
import datetime
entries_csv = "C:\\Users\\Pops\\Desktop\\Entries.csv"
listofaccounts_csv = "C:\\Users\\Pops\\Desktop\\List of Accounts.csv"
data_entries = pd.read_csv(entries_csv)
data_listofaccounts = pd.read_csv(listofaccounts_csv)
data_entries['VOUCHER DATE'] = pd.to_datetime(data_entries['VOUCHER DATE'], format="%m/%d/%Y").dt.date
summary_amount = [0]*(len(data_listofaccounts) + 1)
summary = (('DEBIT ACCOUNT', 'DEBIT AMOUNT'),)
start_date = datetime.date(2018, 4, 1)
end_date = datetime.date(2018, 10, 30)
for i in range(0, len(data_listofaccounts)):
for j in range(0, len(data_entries)):
if start_date <= data_entries.iloc[j, 1] <= end_date:
if data_listofaccounts.iloc[i, 0] == data_entries.iloc[j, 4]\
and (data_listofaccounts.iloc[i, 1] == "CURRENT ASSET" or
data_listofaccounts.iloc[i, 1] == "FIXED ASSET" or
data_listofaccounts.iloc[i, 1] == "EXPENSE"):
summary_amount[i] += data_entries.iloc[j, 5]
elif data_listofaccounts.iloc[i, 0] == data_entries.iloc[j, 4]\
and (data_listofaccounts.iloc[i, 1] == "CURRENT LIABILITY" or
data_listofaccounts.iloc[i, 1] == "LONG TERM LIABILITY" or
data_listofaccounts.iloc[i, 1] == "EQUITY"):
summary_amount[i] -= data_entries.iloc[j, 5]
summary += ((data_listofaccounts.iloc[i, 0], "{:,}".format(round(summary_amount[i], 2))),)
Entries sample data: enter image description here
List of Accounts sample data: enter image description here
List of Accounts
contains unique account names while in the Entries
worksheet, it can be repeated.
1 Answer 1
The kind of operation you’re doing is called a join: you want to associate data from a DataFrame
to data from another one based on a shared information on a given column.
To join a DataFrame
to another one or to a Series
, you need to respect a simple rule: either you join on index or a column is joined to an index; and they must be of similar nature. So in your case, since you need to join on the name of the account, one of your DataFrame
must be indexed by this name. Since it is its purpose, you need to reindex data_listofaccounts
by its 'Account Name'
column:
data_listofaccounts = pd.read_csv(listofaccounts_csv)
data_listofaccounts = data_listofaccounts.set_index(['Account Name'])
Then, before joining, you can filter out data that is out of your study range so the join is performed on less data:
filtered = data_entries[(start_date <= data_entries['VOUCHER DATE']) & (data_entries['VOUCHER DATE'] <= end_date)]
And thus the data you’re interested in is accessed using:
data_entries = pd.read_csv(entries_csv)
data_entries['VOUCHER DATE'] = pd.to_datetime(data_entries['VOUCHER DATE'], format="%m/%d/%Y")
data_listofaccounts = pd.read_csv(listofaccounts_csv)
data_listofaccounts = data_listofaccounts.set_index(['Account Name'])
start_date = datetime.date(2018, 4, 1)
end_date = datetime.date(2018, 10, 30)
date_mask = (start_date <= data_entries['VOUCHER DATE']) & (data_entries['VOUCHER DATE'] <= end_date)
interesting = data_entries[date_mask].join(data_listofaccounts, on='DEBIT ACCOUNT')
And then each row of interesting
will have all the information needed: the transaction date, the name of the account, its type and the amount spent.
But this is all without taking into account the kind of operations you want to perform afterwards: grouping by name and summing the amounts. You can perform this operation directly before joining and it will simplify the process altogether:
data_entries = pd.read_csv(entries_csv)
data_entries['VOUCHER DATE'] = pd.to_datetime(data_entries['VOUCHER DATE'], format="%m/%d/%Y")
start_date = datetime.date(2018, 4, 1)
end_date = datetime.date(2018, 10, 30)
date_mask = (start_date <= data_entries['VOUCHER DATE']) & (data_entries['VOUCHER DATE'] <= end_date)
amount_per_account = data_entries[date_mask].groupby(['DEBIT ACCOUNT']).sum()
This will return a DataFrame
indexed by the accounts names whose 'DEBIT AMOUNT'
column is the sum of each row pertaining to this account. You then just need to join with data_listofaccounts
to know if this sum should be positive or negative based on the 'PARENT NODE'
column.
summary = data_listofaccounts.join(amount_per_account, on='Account Name', how='outer').fillna(0)
debit_mask = (summary.Type == 'CURRENT LIABILITY') | (summary.Type == 'LONG TERM LIABILITY') | (summary.Type == 'EQUITY')
summary[debit_mask]['DEBIT AMOUNT'] = -summary[debit_mask]['DEBIT AMOUNT']
Other improvements pertaining to coding style:
- you should define functions to organize your code
- you should guard your code using
if __name__ == '__main__'
- you don't need to say that a variable contain some
data_
; same for namming a collection, you don't need to say what kind of collection hold the data (besides, in your case it is misleading as yourlistofaccounts
is in fact aDataFrame
); sodata_listofaccounts
=>accounts
- you should follow PEP8 namming conventions
And to pandas
:
- you can limit the amount of data retrieved from your CSVs by using the
usecols
argument; this will lead to less data manipulation afterwards and thus more speed.
-
\$\begingroup\$ Thanks for the very detailed answer. What did you mean by
you don't need to say that a variable contain some data_
\$\endgroup\$Pherdindy– Pherdindy2018年07月23日 21:16:45 +00:00Commented Jul 23, 2018 at 21:16 -
\$\begingroup\$ @MarcSantos
data_something
=>something
\$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2018年07月23日 21:18:49 +00:00Commented Jul 23, 2018 at 21:18 -
\$\begingroup\$ Btw, I get an error on this line
debit_mask = (summary.Type == 'CURRENT LIABILITY') | (summary.Type == 'LONG TERM LIABILITY') | (summary.type == 'EQUITY')
. The error message isAttributeError: 'DataFrame' object has no attribute 'type'
. Did you meandebit_mask = (summary['Type'] == 'CURRENT LIABILITY') | (summary['Type'] == 'LONG TERM LIABILITY') | (summary['Type'] == 'EQUITY')
. \$\endgroup\$Pherdindy– Pherdindy2018年07月26日 10:48:53 +00:00Commented Jul 26, 2018 at 10:48 -
\$\begingroup\$ @MarcSantos there is a typo, I wrote
summary.type
instead ofsummary.Type
. Your version should work too. \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2018年07月26日 11:03:58 +00:00Commented Jul 26, 2018 at 11:03 -
\$\begingroup\$ Thanks a lot. Is my version and your version exactly the same? I never knew you could reference it they way you did \$\endgroup\$Pherdindy– Pherdindy2018年07月26日 15:35:29 +00:00Commented Jul 26, 2018 at 15:35
data_listofaccounts.iloc[i, 1] == "CURRENT ASSET" or "FIXED ASSET" or "EXPENSE"
? \$\endgroup\$i
which is located in the 2nd column equal to any of those 3 strings, it will be satisfied \$\endgroup\$<condition> or True or True
, which would always beTrue
irrespective of thecondition
(which is<value> == "CURRENT ASSET"
. Can you provide a link to docs where pandas mentions this behaviour of comparison? \$\endgroup\$create a summary report with criteria
or something \$\endgroup\$