This is my first time coding in my life. I don't really know much about coding. I googled whatever I need and combined them all together. It works great, but want to know if there is any improvement needed. Is there a better language to write in?
import os, csv, re, time
from datetime import date, timedelta
SendIDs =[]
BouncedEAes=[]
SubKeys=[]
MerchantIDs=[]
EventDates=[]
BounceCategories=[]
BounceReasones=[]
#SendTimes=[] #To find the most recent date that a welcome email was sent.
BounceDate=0
f = open("SendJobs.csv") #To get Send IDs and dates for Welcome emails
for row in csv.reader(f):
if "Welcome" in row[7]:
SendIDs.append(row[1])
# SendTimes.append(
# time.strftime("%Y%m%d",
# time.strptime(row[5],"%m/%d/%Y %H:%M:%S %p")))
f.close()
# if not os.path.exists('SendIDs.csv'):
# open('SendIDs.csv', 'w').close()
# f = open("SendIDs.csv")
# for row in csv.reader(f):
# SendIDs.append(row[0])
# UniqSendIDs = {}.fromkeys(SendIDs).keys()
# f.close()
# f = open("SendIDs.csv","w")
# with f as output:
# writer = csv.writer(output, lineterminator='\n')
# for item in UniqSendIDs:
# writer.writerow([item])
# f.close()
f = open('Bounces.csv')
for row in csv.reader(f):
for item in SendIDs: #OR UniqSendIDs
if item == row[1]:
SubKeys.append(row[2])
BouncedEAes.append(row[3])
BounceCategories.append(row[8])
BounceReasones.append(row[10])
#EventDate: Only need bounce date, NO time required.
BounceDate = time.strptime(row[6],"%m/%d/%Y %H:%M:%S %p")
BounceDate = time.strftime("%m/%d/%Y", BounceDate)
EventDates.append(BounceDate)
f.close()
f = open('Attributes.csv')
for row in csv.reader(f):
for item in BouncedEAes:
if item == row[2]:
MerchantIDs.append(row[4])
f.close()
SubKeys.insert(0,"SubscriberKey")
BouncedEAes.insert(0,"EmailAddress")
MerchantIDs.insert(0,"Merchant_Number")
EventDates.insert(0,"EventDate")
BounceCategories.insert(0,"BounceCategory")
BounceReasones.insert(0,"BounceReason")
new_data=[[SubKeys[i],BouncedEAes[i], MerchantIDs[i], EventDates[i], BounceCategories[i], BounceReasones[i]] for i in range(len(BouncedEAes))]
# if len(SendTimes)==0:
# yesterday = date.today() - timedelta(1)
# SendTimes.append("NoWelcomeEmailsSent_"+yesterday.strftime("%Y%m%d"))
# if len(BouncedEAes)==1:
# SendTimes.append("NoBouncedEmails_"+SendTimes[-1])
# FILENAME FORMAT: Welcome-Cadence-#2-RESULT-YYYYMMDD
# YYYYMMDD: One day before the day the file was received.
yesterday = date.today() - timedelta(1)
f=open("Welcome-"+yesterday.strftime('%Y%m%d')+".csv","wb")
output=csv.writer(f)
for row in new_data:
output.writerow(row)
f.close()
"""
ERROR MESSAGES
"""
#new_data=[[SubKeys[i],BouncedEAes[i], MerchantIDs[i], EventDates[i], BounceCategories[i], BounceReasones[i]] for i in range(len(BouncedEAes))]
#IndexError: list index out of range
"""
Check Attributes.csv
"""
3 Answers 3
If this is your first program ever, then I am impressed. The code is easy to understand, and aside from the commented-out code that you left there, quite clean.
What you could improve
Better use of data structures, part 1:
When processing
Bounces.csv
, you matchSendIDs
with column 1 using...for row in csv.reader(f): for item in SendIDs: if item == row[1]: ...
In Python, a simpler way to see if an item exists in an array is using the
in
operator:for row in csv.reader(f): if row[1] in SendIDs: ...
However, if
SendIDs
is a list, then checking for membership could require looking at every element of the list. A more efficient data structure for such lookups is aset
. Aset
supports constant-time lookups, no matter how manySendIDs
there are.Building
SendIDs
:Your first loop would be better written using a list comprehension. Files should be opened using a
with
block, as @Morwenn mentioned.with open('SendJobs.csv') as f: SendIDs = set(row[1] for row in csv.reader(f) if "Welcome" in row[7])
Better use of data structures, part 2:
The same inefficiency occurs in this loop, where you have to scan the whole list of
BouncedEAs
to match the merchant number:f = open('Attributes.csv') for row in csv.reader(f): for item in BouncedEAes: if item == row[2]: MerchantIDs.append(row[4]) f.close()
Here, the kind of lookup you want is not just to find out whether an entry exists, but to do a mapping. For that, you need a dictionary. Here's how you can build one:
# Lookup table from BouncedEAs to MerchantIDs with open('Attributes.csv') as f: MerchantIDForBouncedEA = {row[2]: row[4] for row in csv.reader(f)}
Avoid using a list for each column.
You have six lists, each representing a column. Each bounce record is split across those list, correlated by the position within the list. The code will be much cleaner if you store all of the data in one list, where each element represents an entire row in the
Bounces.csv
file.With that dictionary, you can easily look up the merchant number for any e-mail address. The preparation work above lets us build the desired data structure by processing
Bounces.csv
after the other two files.with open('Bounces.csv') as f: Bounces = [{ 'SubscriberKey': row[2], 'EmailAddress': row[3], 'Merchant_Number': MerchantIDForBouncedEA[row[3]], 'BounceCategory': row[8], 'BounceReason': row[10], #EventDate: Only need bounce date, NO time required. 'EventDate': time.strftime('%m/%d/%Y', time.strptime(row[6], "%m/%d/%Y %H:%M:%S %p")), } for row in csv.reader(f) if row[1] in SendIDs]
Suggested solution
In addition to the main concerns above, I've tweaked the output routine a bit for maintainability.
Note that strftime()
also doubles as a string formatter.
import csv, time
from datetime import date, timedelta
with open('SendJobs.csv') as f:
SendIDs = set(row[1] for row in csv.reader(f) if "Welcome" in row[7])
# Lookup table from BouncedEAs to MerchantIDs
with open('Attributes.csv') as f:
MerchantIDForBouncedEA = {row[2]: row[4] for row in csv.reader(f)}
with open('Bounces.csv') as f:
Bounces = [{
'SubscriberKey': row[2],
'EmailAddress': row[3],
'Merchant_Number': MerchantIDForBouncedEA[row[3]],
'BounceCategory': row[8],
'BounceReason': row[10],
#EventDate: Only need bounce date, NO time required.
'EventDate': time.strftime('%m/%d/%Y', time.strptime(row[6], "%m/%d/%Y %H:%M:%S %p")),
} for row in csv.reader(f) if row[1] in SendIDs]
OUTPUT_FIELDS = [
'SubscriberKey',
'EmailAddress',
'Merchant_Number',
'EventDate',
'BounceCategory',
'BounceReason',
]
yesterday = date.today() - timedelta(1)
with open(yesterday.strftime('Welcome-%Y%m%d.csv'), "wb") as f:
output = csv.writer(f)
output.writerow(OUTPUT_FIELDS)
for bounce in Bounces:
output.writerow([bounce[field] for field in OUTPUT_FIELDS])
if it's really your first time coding, then that's great: you actually managed to use some parts of the Python standard library that people being exposed for the first time to the language tend to needlessly reinvent, such as the csv
module :)
Python version
The first big question: you used Python, but why did you use the version 2.7? This version (minus the bugfixes) is already 5 years old and its maintenance should have been dropped this year (it has been extended to 2020 since it's still widely used). But if you start from scratch, you might as well start with a fresh new version, such as Python 3.4. Generally speaking, don't use Python 2.7 unless you're forced to (there are still reasons); try to adapt to the most recent version instead.
Commented out code
Generally speaking, you should never leave commented-out pieces of code in your code. They will only hinder readability. I must admit that you probably don't use source control source control software if it's your first time programming, but you might want to look at software like Git in the future so that every version of your code is saved somewhere and you don't have to keep outdated remnants of code in your source.
Useless stuff in general
More generally, remove from your code what is unneeded, such as unused variables or unused includes. In your case, you could get rid of the re
, os
and time
imports: you don't use re
, you only use os
in commented-out code that shouldn't belong, and what you use from time
is actually also available in datetime
.
The with
statement
One thing that is easy to forget is closing files. You didn't forget to close them, which is great, but Python provides a superior alternative to closing files in the form of the with
statement. Here are your first lines of code rewritten with it:
with open("Welcome-"+yesterday.strftime('%Y%m%d')+".csv","wb") as f:
output=csv.writer(f)
for row in new_data:
output.writerow(row)
As you can see, I didn't manually call f.close()
, this is automatically done when we leave the with
block. Actually, with
isn't limited to closing files, it can do many other things, but it depends on the type you use it with.
Use the PEP 8
Python has a style guide which is rather good and people tend to follow it. Therefore, you could try to rewrite your code following that style guide so that it looks more like what people want to read when they read Python code. For example, I will rewrite once again the previous piece of code, following the PEP8:
with open("Welcome-" + yesterday.strftime('%Y%m%d') + ".csv", "wb") as f:
output = csv.writer(f)
for row in new_data:
output.writerow(row)
Basically, it's still the same. I only added spaces. But even those can matter when you read code :)
-
\$\begingroup\$ Thank you!!! I used Python 2.7 because it came with Mac. I didn't even know Python 2.7 was that old. I will rewrite my code using Python 3.4. Git seems really useful too. I named my files *_v1, *_v2 like that. Thank you again. \$\endgroup\$gamekaze– gamekaze2015年04月30日 01:42:48 +00:00Commented Apr 30, 2015 at 1:42
-
\$\begingroup\$ I don't see any reason why the code couldn't just run on Python 3 as is. \$\endgroup\$200_success– 200_success2015年04月30日 21:23:10 +00:00Commented Apr 30, 2015 at 21:23
-
\$\begingroup\$ @200_success True, it can. But the question was tagged Python 2.7, so I found it valuable to mention it. \$\endgroup\$Morwenn– Morwenn2015年05月04日 07:24:52 +00:00Commented May 4, 2015 at 7:24
Consider using a database
The best kind of code is code that you don't have to write.
A completely different approach to this problem is not to solve it in Python, but to take advantage of SQL. Your three CSV files are basically tables in a relational database. Joining tables is a very routine thing in SQL.
If you have no database at all, the quickest way to get started would be with SQLite. A setup to solve your problem would look something like this (I've used placeholders for irrelevant column names):
CREATE TABLE SendJobs (Zero, ID, Two, Three, Four, Five, Six, Subject);
CREATE TABLE Attributes (Zero, One, EmailAddress, Three, Merchant_Number);
CREATE TABLE Bounces
( Zero, SendID, SubscriberKey, EmailAddress, Four, Five, Timestamp, Seven, Category, Nine, Reason
);
CREATE VIEW WelcomeBounces AS
SELECT Bounces.SubscriberKey
, Bounces.EmailAddress
, Attributes.Merchant_Number
, date(Bounces.Timestamp) AS EventDate
, Bounces.Category AS BounceCategory
, Bounces.Reason AS BounceReason
FROM
Bounces
INNER JOIN SendJobs
ON SendJobs.ID = Bounces.SendID
INNER JOIN Attributes
ON Attributes.EmailAddress = Bounces.EmailAddress
WHERE
SendJobs.Subject LIKE '%Welcome%';
Then, after importing the three CSV files, it's just a matter of executing
sqlite> .mode csv
sqlite> .header on
sqlite> .output Welcome.csv
sqlite> SELECT * FROM WelcomeBounces;
in the SQLite shell. Your problem is such a good match for SQL that it would be worthwhile to create and discard a temporary SQLite database just to solve it.
-
\$\begingroup\$ Hello, thank you for mentioning SQL. I did write a sql query for that .The problem is that I have to automate the process. Also, the original file is a .zip file with three csv files which is exported to my FTP from a company's CRM everyday. SELECT a.SubscriberKey, a.EmailAddress, a.Merchant_Number, CONVERT(DATE, b.EventDate) AS EventDate, b.BounceCategory, b.BounceReason FROM Attributes a INNER JOIN Bounces b ON a.SubscriberKey = b.SubscriberKey INNER JOIN SendJobs c ON b.SendID = c.SendID WHERE c.EmailName LIKE 'Welcome%' \$\endgroup\$gamekaze– gamekaze2015年04月30日 21:18:50 +00:00Commented Apr 30, 2015 at 21:18