Merging three CSV files with a primary key

Question 1

This is my first time coding in my life. I don't really know much about coding. I googled whatever I need and combined them all together. It works great, but want to know if there is any improvement needed. Is there a better language to write in?

import os, csv, re, time
from datetime import date, timedelta
SendIDs =[]
BouncedEAes=[]
SubKeys=[]
MerchantIDs=[]
EventDates=[]
BounceCategories=[]
BounceReasones=[]
#SendTimes=[] #To find the most recent date that a welcome email was sent.
BounceDate=0
f = open("SendJobs.csv") #To get Send IDs and dates for Welcome emails
for row in csv.reader(f):
 if "Welcome" in row[7]:
 SendIDs.append(row[1])
 # SendTimes.append(
 # time.strftime("%Y%m%d",
 # time.strptime(row[5],"%m/%d/%Y %H:%M:%S %p")))
f.close()
# if not os.path.exists('SendIDs.csv'):
# open('SendIDs.csv', 'w').close()
# f = open("SendIDs.csv")
# for row in csv.reader(f):
# SendIDs.append(row[0])
# UniqSendIDs = {}.fromkeys(SendIDs).keys()
# f.close()
# f = open("SendIDs.csv","w")
# with f as output:
# writer = csv.writer(output, lineterminator='\n')
# for item in UniqSendIDs:
# writer.writerow([item])
# f.close()
f = open('Bounces.csv')
for row in csv.reader(f):
 for item in SendIDs: #OR UniqSendIDs
 if item == row[1]:
 SubKeys.append(row[2])
 BouncedEAes.append(row[3])
 BounceCategories.append(row[8])
 BounceReasones.append(row[10])
 #EventDate: Only need bounce date, NO time required.
 BounceDate = time.strptime(row[6],"%m/%d/%Y %H:%M:%S %p")
 BounceDate = time.strftime("%m/%d/%Y", BounceDate)
 EventDates.append(BounceDate)
f.close()
f = open('Attributes.csv')
for row in csv.reader(f):
 for item in BouncedEAes:
 if item == row[2]:
 MerchantIDs.append(row[4])
f.close()
SubKeys.insert(0,"SubscriberKey")
BouncedEAes.insert(0,"EmailAddress")
MerchantIDs.insert(0,"Merchant_Number")
EventDates.insert(0,"EventDate")
BounceCategories.insert(0,"BounceCategory")
BounceReasones.insert(0,"BounceReason")
new_data=[[SubKeys[i],BouncedEAes[i], MerchantIDs[i], EventDates[i], BounceCategories[i], BounceReasones[i]] for i in range(len(BouncedEAes))]
# if len(SendTimes)==0:
# yesterday = date.today() - timedelta(1)
# SendTimes.append("NoWelcomeEmailsSent_"+yesterday.strftime("%Y%m%d"))
# if len(BouncedEAes)==1:
# SendTimes.append("NoBouncedEmails_"+SendTimes[-1])
# FILENAME FORMAT: Welcome-Cadence-#2-RESULT-YYYYMMDD
# YYYYMMDD: One day before the day the file was received.
yesterday = date.today() - timedelta(1)
f=open("Welcome-"+yesterday.strftime('%Y%m%d')+".csv","wb")
output=csv.writer(f)
for row in new_data:
 output.writerow(row)
f.close()
"""
ERROR MESSAGES
"""
#new_data=[[SubKeys[i],BouncedEAes[i], MerchantIDs[i], EventDates[i], BounceCategories[i], BounceReasones[i]] for i in range(len(BouncedEAes))]
#IndexError: list index out of range
"""
Check Attributes.csv
"""

Question 2

If this is your first program ever, then I am impressed. The code is easy to understand, and aside from the commented-out code that you left there, quite clean.

What you could improve

Better use of data structures, part 1:

When processing Bounces.csv, you match SendIDs with column 1 using...
```
for row in csv.reader(f):
 for item in SendIDs:
 if item == row[1]:
 ...
```
In Python, a simpler way to see if an item exists in an array is using the in operator:
```
for row in csv.reader(f):
 if row[1] in SendIDs:
 ...
```
However, if SendIDs is a list, then checking for membership could require looking at every element of the list. A more efficient data structure for such lookups is a set. A set supports constant-time lookups, no matter how many SendIDs there are.
Building SendIDs:

Your first loop would be better written using a list comprehension. Files should be opened using a with block, as @Morwenn mentioned.
```
with open('SendJobs.csv') as f:
 SendIDs = set(row[1] for row in csv.reader(f) if "Welcome" in row[7])
```
Better use of data structures, part 2:

The same inefficiency occurs in this loop, where you have to scan the whole list of BouncedEAs to match the merchant number:
```
f = open('Attributes.csv')
for row in csv.reader(f):
 for item in BouncedEAes:
 if item == row[2]:
 MerchantIDs.append(row[4])
f.close()
```
Here, the kind of lookup you want is not just to find out whether an entry exists, but to do a mapping. For that, you need a dictionary. Here's how you can build one:
```
# Lookup table from BouncedEAs to MerchantIDs
with open('Attributes.csv') as f:
 MerchantIDForBouncedEA = {row[2]: row[4] for row in csv.reader(f)}
```
Avoid using a list for each column.

You have six lists, each representing a column. Each bounce record is split across those list, correlated by the position within the list. The code will be much cleaner if you store all of the data in one list, where each element represents an entire row in the Bounces.csv file.

With that dictionary, you can easily look up the merchant number for any e-mail address. The preparation work above lets us build the desired data structure by processing Bounces.csv after the other two files.
```
with open('Bounces.csv') as f:
 Bounces = [{
 'SubscriberKey': row[2],
 'EmailAddress': row[3],
 'Merchant_Number': MerchantIDForBouncedEA[row[3]],
 'BounceCategory': row[8],
 'BounceReason': row[10],
 #EventDate: Only need bounce date, NO time required.
 'EventDate': time.strftime('%m/%d/%Y', time.strptime(row[6], "%m/%d/%Y %H:%M:%S %p")),
 } for row in csv.reader(f) if row[1] in SendIDs]
```

Python version

The first big question: you used Python, but why did you use the version 2.7? This version (minus the bugfixes) is already 5 years old and its maintenance should have been dropped this year (it has been extended to 2020 since it's still widely used). But if you start from scratch, you might as well start with a fresh new version, such as Python 3.4. Generally speaking, don't use Python 2.7 unless you're forced to (there are still reasons); try to adapt to the most recent version instead.

Commented out code

Generally speaking, you should never leave commented-out pieces of code in your code. They will only hinder readability. I must admit that you probably don't use source control source control software if it's your first time programming, but you might want to look at software like Git in the future so that every version of your code is saved somewhere and you don't have to keep outdated remnants of code in your source.

Useless stuff in general

More generally, remove from your code what is unneeded, such as unused variables or unused includes. In your case, you could get rid of the re, os and time imports: you don't use re, you only use os in commented-out code that shouldn't belong, and what you use from time is actually also available in datetime.

The `with` statement

One thing that is easy to forget is closing files. You didn't forget to close them, which is great, but Python provides a superior alternative to closing files in the form of the with statement. Here are your first lines of code rewritten with it:

with open("Welcome-"+yesterday.strftime('%Y%m%d')+".csv","wb") as f:
 output=csv.writer(f)
 for row in new_data:
 output.writerow(row)

As you can see, I didn't manually call f.close(), this is automatically done when we leave the with block. Actually, with isn't limited to closing files, it can do many other things, but it depends on the type you use it with.

Use the PEP 8

Python has a style guide which is rather good and people tend to follow it. Therefore, you could try to rewrite your code following that style guide so that it looks more like what people want to read when they read Python code. For example, I will rewrite once again the previous piece of code, following the PEP8:

with open("Welcome-" + yesterday.strftime('%Y%m%d') + ".csv", "wb") as f:
 output = csv.writer(f)
 for row in new_data:
 output.writerow(row)

Basically, it's still the same. I only added spaces. But even those can matter when you read code :)

Question 4

Thank you!!! I used Python 2.7 because it came with Mac. I didn't even know Python 2.7 was that old. I will rewrite my code using Python 3.4. Git seems really useful too. I named my files *_v1, *_v2 like that. Thank you again.

Question 5

I don't see any reason why the code couldn't just run on Python 3 as is.

Question 6

@200_success True, it can. But the question was tagged Python 2.7, so I found it valuable to mention it.

Question 7

Consider using a database

The best kind of code is code that you don't have to write.

A completely different approach to this problem is not to solve it in Python, but to take advantage of SQL. Your three CSV files are basically tables in a relational database. Joining tables is a very routine thing in SQL.

If you have no database at all, the quickest way to get started would be with SQLite. A setup to solve your problem would look something like this (I've used placeholders for irrelevant column names):

CREATE TABLE SendJobs (Zero, ID, Two, Three, Four, Five, Six, Subject);
CREATE TABLE Attributes (Zero, One, EmailAddress, Three, Merchant_Number);
CREATE TABLE Bounces
( Zero, SendID, SubscriberKey, EmailAddress, Four, Five, Timestamp, Seven, Category, Nine, Reason
);
CREATE VIEW WelcomeBounces AS
 SELECT Bounces.SubscriberKey
 , Bounces.EmailAddress
 , Attributes.Merchant_Number
 , date(Bounces.Timestamp) AS EventDate
 , Bounces.Category AS BounceCategory
 , Bounces.Reason AS BounceReason
 FROM
 Bounces
 INNER JOIN SendJobs
 ON SendJobs.ID = Bounces.SendID 
 INNER JOIN Attributes
 ON Attributes.EmailAddress = Bounces.EmailAddress 
 WHERE
 SendJobs.Subject LIKE '%Welcome%';

Then, after importing the three CSV files, it's just a matter of executing

sqlite> .mode csv
sqlite> .header on
sqlite> .output Welcome.csv
sqlite> SELECT * FROM WelcomeBounces;

in the SQLite shell. Your problem is such a good match for SQL that it would be worthwhile to create and discard a temporary SQLite database just to solve it.

Question 8

Hello, thank you for mentioning SQL. I did write a sql query for that .The problem is that I have to automate the process. Also, the original file is a .zip file with three csv files which is exported to my FTP from a company's CRM everyday. SELECT a.SubscriberKey, a.EmailAddress, a.Merchant_Number, CONVERT(DATE, b.EventDate) AS EventDate, b.BounceCategory, b.BounceReason FROM Attributes a INNER JOIN Bounces b ON a.SubscriberKey = b.SubscriberKey INNER JOIN SendJobs c ON b.SendID = c.SendID WHERE c.EmailName LIKE 'Welcome%'

200_success 200_success 145k22 gold badges190 silver badges478 bronze badges · Accepted Answer · 2015-04-29 19:11:06Z

If this is your first program ever, then I am impressed. The code is easy to understand, and aside from the commented-out code that you left there, quite clean.

What you could improve

Better use of data structures, part 1:

When processing Bounces.csv, you match SendIDs with column 1 using...
```
for row in csv.reader(f):
 for item in SendIDs:
 if item == row[1]:
 ...
```
In Python, a simpler way to see if an item exists in an array is using the in operator:
```
for row in csv.reader(f):
 if row[1] in SendIDs:
 ...
```
However, if SendIDs is a list, then checking for membership could require looking at every element of the list. A more efficient data structure for such lookups is a set. A set supports constant-time lookups, no matter how many SendIDs there are.
Building SendIDs:

Your first loop would be better written using a list comprehension. Files should be opened using a with block, as @Morwenn mentioned.
```
with open('SendJobs.csv') as f:
 SendIDs = set(row[1] for row in csv.reader(f) if "Welcome" in row[7])
```
Better use of data structures, part 2:

The same inefficiency occurs in this loop, where you have to scan the whole list of BouncedEAs to match the merchant number:
```
f = open('Attributes.csv')
for row in csv.reader(f):
 for item in BouncedEAes:
 if item == row[2]:
 MerchantIDs.append(row[4])
f.close()
```
Here, the kind of lookup you want is not just to find out whether an entry exists, but to do a mapping. For that, you need a dictionary. Here's how you can build one:
```
# Lookup table from BouncedEAs to MerchantIDs
with open('Attributes.csv') as f:
 MerchantIDForBouncedEA = {row[2]: row[4] for row in csv.reader(f)}
```
Avoid using a list for each column.

You have six lists, each representing a column. Each bounce record is split across those list, correlated by the position within the list. The code will be much cleaner if you store all of the data in one list, where each element represents an entire row in the Bounces.csv file.

With that dictionary, you can easily look up the merchant number for any e-mail address. The preparation work above lets us build the desired data structure by processing Bounces.csv after the other two files.
```
with open('Bounces.csv') as f:
 Bounces = [{
 'SubscriberKey': row[2],
 'EmailAddress': row[3],
 'Merchant_Number': MerchantIDForBouncedEA[row[3]],
 'BounceCategory': row[8],
 'BounceReason': row[10],
 #EventDate: Only need bounce date, NO time required.
 'EventDate': time.strftime('%m/%d/%Y', time.strptime(row[6], "%m/%d/%Y %H:%M:%S %p")),
 } for row in csv.reader(f) if row[1] in SendIDs]
```

Stack Exchange Network

Merging three CSV files with a primary key

3 Answers 3

What you could improve

Suggested solution

Python version

Commented out code

Useless stuff in general

The `with` statement

Use the PEP 8

Consider using a database

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Merging three CSV files with a primary key

3 Answers 3

What you could improve

Suggested solution

Python version

Commented out code

Useless stuff in general

The with statement

Use the PEP 8

Consider using a database

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

The `with` statement