Python/ Pandas CSV Parsing

Question 1

I used JotForm Configurable list widget to collect data, but having troubles parsing the resulting data correctly. When I use

testdf = pd.read_csv ("TestLoad.csv")

The data is read in as two records and the details are stored in the "Information" column. I understand why it is parsed the way it is, but I would like to break out the details into multiple records as noted below.

Any help would be appreciated.

Sample CSV

"Date","Information","Type"
"2015-12-06","First: Tom, Last: Smith, School: MCAA; First: Tammy, Last: Smith, School: MCAA;","New"
"2015-12-06","First: Jim, Last: Jones, School: MCAA; First: Jane, Last: Jones, School: MCAA;","New"

Current Result

Date Information Type
2015年12月06日 First: Tom, Last: Smith, School: MCAA; First: Tammy, Last: Smith, School: MCAA; New
2015年12月06日 First: Jim, Last: Jones, School: MCAA; First: Jane, Last: Jones, School: MCAA; New

Desired Result

Date First Last School Type
2015年12月06日 Tom Smith MCAA New
2015年12月06日 Tammy Smith MCAA New
2015年12月06日 Jim Jones MCAA New
2015年12月06日 Jane Jones MCAA New

Question 2

This is useless text that is required to keep an answer from being downvoted by the moderators. Here is the data I used:

"Date","Information","Type"
"2015-12-07","First: Jim, Last: Jones, School: MCAA; First: Jane, Last: Jones, School: MCAA;","Old"
"2015-12-06","First: Tom, Last: Smith, School: MCAA; First: Tammy, Last: Smith, School: MCAA;","New"

import pandas as pd
import numpy as np
import csv
import re
import itertools as it
import pprint
import datetime as dt
records = [] #Construct a complete record for each person
colon_pairs = r"""
 (\w+) #Match a 'word' character, one or more times, captured in group 1, followed by..
 : #A colon, followed by...
 \s* #Whitespace, 0 or more times, followed by...
 (\w+) #A 'word' character, one or more times, captured in group 2.
"""
colon_pairs_per_person = 3
with open("csv1.csv", encoding='utf-8') as f:
 next(f) #skip header line
 record = {}
 for date, info, the_type in csv.reader(f):
 info_parser = re.finditer(colon_pairs, info, flags=re.X)
 for i, match_obj in enumerate(info_parser):
 key, val = match_obj.groups()
 record[key] = val
 if (i+1) % colon_pairs_per_person == 0: #then done with info for a person
 record['Date'] = dt.datetime.strptime(date, '%Y-%m-%d') #So that you can sort the DataFrame rows by date.
 record['Type'] = the_type
 records.append(record)
 record = {}
pprint.pprint(records)
df = pd.DataFrame(
 sorted(records, key=lambda record: record['Date'])
)
print(df)
df.set_index('Date', inplace=True)
print(df)
--output:--
[{'Date': datetime.datetime(2015, 12, 7, 0, 0),
 'First': 'Jim',
 'Last': 'Jones',
 'School': 'MCAA',
 'Type': 'Old'},
 {'Date': datetime.datetime(2015, 12, 7, 0, 0),
 'First': 'Jane',
 'Last': 'Jones',
 'School': 'MCAA',
 'Type': 'Old'},
 {'Date': datetime.datetime(2015, 12, 6, 0, 0),
 'First': 'Tom',
 'Last': 'Smith',
 'School': 'MCAA',
 'Type': 'New'},
 {'Date': datetime.datetime(2015, 12, 6, 0, 0),
 'First': 'Tammy',
 'Last': 'Smith',
 'School': 'MCAA',
 'Type': 'New'}]
 Date First Last School Type
0 2015年12月06日 Tom Smith MCAA New
1 2015年12月06日 Tammy Smith MCAA New
2 2015年12月07日 Jim Jones MCAA Old
3 2015年12月07日 Jane Jones MCAA Old
 First Last School Type
Date 
2015年12月06日 Tom Smith MCAA New
2015年12月06日 Tammy Smith MCAA New
2015年12月07日 Jim Jones MCAA Old
2015年12月07日 Jane Jones MCAA Old

Question 3

7stud - thanks for the solution. This is the method approach I ended up using since the number of people within a record could be 1:n

Question 4

I used regex separators with the python engine so I could specify multiple separators. Then, I used the usecols parameter to specify which columns in the csv file you want in your dataframe. The header will not be read from file, and I skipped the first row since it doesn't have any data. I read in the first and second set of records into 2 dataframes, and then concatenate the 2 data frames.

a = pd.read_csv('sample.csv', sep=',|:|;', skiprows = 1, usecols = (0,2,4,6, 14), header = None, engine='python')
b = pd.read_csv('sample.csv', sep=',|:|;', skiprows = 1, usecols = (0,8,10,12,14), header = None, engine='python')
a.columns = ['Date', 'First', "Last", 'School', 'Type']
b.columns = ['Date', 'First', "Last", 'School', 'Type']
final_data = pd.concat([a,b], axis = 0)

If you need the order preserved, such that the second names appear right below the first name, you can sort using the indices. I use mergesort, because it is a stable sort and this ensures that the first Information record (record on the right) will be above the Information record on the left.

final_data.sort_index(kind='mergesort', inplace = True)
>>>final_data
 Date First Last School Type
0 "2015-12-06" Tom Smith MCAA "New"
0 "2015-12-06" Tammy Smith MCAA "New"
1 "2015-12-06" Jim Jones MCAA "New"
1 "2015-12-06" Jane Jones MCAA "New"

Edit: Including the second set of the record into the data. Changed the axis to 0.

Question 5

Thanks for the approach. I was able to replicate, however when I tried it, the code didn't pick up the second names in each of the rows (e.g. Tammy Smith and Jane Jones). Is there something I need to do differently to iterate through the text in the "Information" column?

Question 6

@Zymurgist66 Do the records have to appear such that Tom Smith has to appear right above Timmy Smith? Anyways, I edited my response to read both set of names and provided an option so the order can be maintained.

Question 7

user1435522 - no the order is not relevant. The initial examples I was testing only had 2 people per record. As I tried to use on the entire dataset, I found that the number of people could be 1:n so I ended up needing to iterate through the people.

7stud 48.8k14 gold badges107 silver badges137 bronze badges · Accepted Answer · 2015-12-08 05:23:05Z

This is useless text that is required to keep an answer from being downvoted by the moderators. Here is the data I used:

"Date","Information","Type"
"2015-12-07","First: Jim, Last: Jones, School: MCAA; First: Jane, Last: Jones, School: MCAA;","Old"
"2015-12-06","First: Tom, Last: Smith, School: MCAA; First: Tammy, Last: Smith, School: MCAA;","New"

import pandas as pd
import numpy as np
import csv
import re
import itertools as it
import pprint
import datetime as dt
records = [] #Construct a complete record for each person
colon_pairs = r"""
 (\w+) #Match a 'word' character, one or more times, captured in group 1, followed by..
 : #A colon, followed by...
 \s* #Whitespace, 0 or more times, followed by...
 (\w+) #A 'word' character, one or more times, captured in group 2.
"""
colon_pairs_per_person = 3
with open("csv1.csv", encoding='utf-8') as f:
 next(f) #skip header line
 record = {}
 for date, info, the_type in csv.reader(f):
 info_parser = re.finditer(colon_pairs, info, flags=re.X)
 for i, match_obj in enumerate(info_parser):
 key, val = match_obj.groups()
 record[key] = val
 if (i+1) % colon_pairs_per_person == 0: #then done with info for a person
 record['Date'] = dt.datetime.strptime(date, '%Y-%m-%d') #So that you can sort the DataFrame rows by date.
 record['Type'] = the_type
 records.append(record)
 record = {}
pprint.pprint(records)
df = pd.DataFrame(
 sorted(records, key=lambda record: record['Date'])
)
print(df)
df.set_index('Date', inplace=True)
print(df)
--output:--
[{'Date': datetime.datetime(2015, 12, 7, 0, 0),
 'First': 'Jim',
 'Last': 'Jones',
 'School': 'MCAA',
 'Type': 'Old'},
 {'Date': datetime.datetime(2015, 12, 7, 0, 0),
 'First': 'Jane',
 'Last': 'Jones',
 'School': 'MCAA',
 'Type': 'Old'},
 {'Date': datetime.datetime(2015, 12, 6, 0, 0),
 'First': 'Tom',
 'Last': 'Smith',
 'School': 'MCAA',
 'Type': 'New'},
 {'Date': datetime.datetime(2015, 12, 6, 0, 0),
 'First': 'Tammy',
 'Last': 'Smith',
 'School': 'MCAA',
 'Type': 'New'}]
 Date First Last School Type
0 2015年12月06日 Tom Smith MCAA New
1 2015年12月06日 Tammy Smith MCAA New
2 2015年12月07日 Jim Jones MCAA Old
3 2015年12月07日 Jane Jones MCAA Old
 First Last School Type
Date 
2015年12月06日 Tom Smith MCAA New
2015年12月06日 Tammy Smith MCAA New
2015年12月07日 Jim Jones MCAA Old
2015年12月07日 Jane Jones MCAA Old

7stud - thanks for the solution. This is the method approach I ended up using since the number of people within a record could be 1:n

CollectivesTM on Stack Overflow

Python/ Pandas CSV Parsing

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related