I'm looking for a simple way of parsing complex text files into a pandas DataFrame. Below is a sample file, what I want the result to look like after parsing, and my current method.
Is there any way to make it more concise/faster/more pythonic/more readable?
I have also put this question on Stack Overflow. I eventually wrote a blog article to explain this to beginners.
Here is a sample file:
Sample text
A selection of students from Riverdale High and Hogwarts took part in a quiz. This is a record of their scores.
School = Riverdale High
Grade = 1
Student number, Name
0, Phoebe
1, Rachel
Student number, Score
0, 3
1, 7
Grade = 2
Student number, Name
0, Angela
1, Tristan
2, Aurora
Student number, Score
0, 6
1, 3
2, 9
School = Hogwarts
Grade = 1
Student number, Name
0, Ginny
1, Luna
Student number, Score
0, 8
1, 7
Grade = 2
Student number, Name
0, Harry
1, Hermione
Student number, Score
0, 5
1, 10
Grade = 3
Student number, Name
0, Fred
1, George
Student number, Score
0, 0
1, 0
Here is what I want the result to look like after parsing:
Name Score
School Grade Student number
Hogwarts 1 0 Ginny 8
1 Luna 7
2 0 Harry 5
1 Hermione 10
3 0 Fred 0
1 George 0
Riverdale High 1 0 Phoebe 3
1 Rachel 7
2 0 Angela 6
1 Tristan 3
2 Aurora 9
Here is how I currently parse it:
import re
import pandas as pd
def parse(filepath):
"""
Parse text at given filepath
Parameters
----------
filepath : str
Filepath for file to be parsed
Returns
-------
data : pd.DataFrame
Parsed data
"""
data = []
with open(filepath, 'r') as file:
line = file.readline()
while line:
reg_match = _RegExLib(line)
if reg_match.school:
school = reg_match.school.group(1)
if reg_match.grade:
grade = reg_match.grade.group(1)
grade = int(grade)
if reg_match.name_score:
value_type = reg_match.name_score.group(1)
line = file.readline()
while line.strip():
number, value = line.strip().split(',')
value = value.strip()
dict_of_data = {
'School': school,
'Grade': grade,
'Student number': number,
value_type: value
}
data.append(dict_of_data)
line = file.readline()
line = file.readline()
data = pd.DataFrame(data)
data.set_index(['School', 'Grade', 'Student number'], inplace=True)
# consolidate df to remove nans
data = data.groupby(level=data.index.names).first()
# upgrade Score from float to integer
data = data.apply(pd.to_numeric, errors='ignore')
return data
class _RegExLib:
"""Set up regular expressions"""
# use https://regexper.com to visualise these if required
_reg_school = re.compile('School = (.*)\n')
_reg_grade = re.compile('Grade = (.*)\n')
_reg_name_score = re.compile('(Name|Score)')
def __init__(self, line):
# check whether line has a positive match with all of the regular expressions
self.school = self._reg_school.match(line)
self.grade = self._reg_grade.match(line)
self.name_score = self._reg_name_score.search(line)
if __name__ == '__main__':
filepath = 'sample.txt'
data = parse(filepath)
print(data)
1 Answer 1
There are a few performance tricks we can apply here:
add
__slots__
to the class definition should help with memory and performance as well:class _RegExLib: """Set up regular expressions""" # use https://regexper.com to visualise these if required _reg_school = re.compile(r'School = (.*)\n') _reg_grade = re.compile(r'Grade = (.*)\n') _reg_name_score = re.compile(r'(Name|Score)') __slots__ = ['school', 'grade', 'name_score'] def __init__(self, line): # check whether line has a positive match with all of the regular expressions self.school = self._reg_school.match(line) self.grade = self._reg_grade.match(line) self.name_score = self._reg_name_score.search(line)
using
next()
instead of.readline()
should be faster as it uses a lookahead buffer internally:with open(filepath, 'r') as file: line = next(file) while line: reg_match = _RegExLib(line) if reg_match.school: school = reg_match.school.group(1) if reg_match.grade: grade = reg_match.grade.group(1) grade = int(grade) if reg_match.name_score: value_type = reg_match.name_score.group(1) line = next(file, None) while line and line.strip(): number, value = line.strip().split(',') value = value.strip() dict_of_data = { 'School': school, 'Grade': grade, 'Student number': number, value_type: value } data.append(dict_of_data) line = next(file, None) line = next(file, None)
Some of the code style and other notes:
file
is a builtin (in Python 2.x only), consider a different variable name- define your regular expression strings as raw strings
- you can probably replace
.*
wildcard with a more concrete\d+
for the "grade" regex:Grade = (\d+)\n
-
3\$\begingroup\$ Note that
file
is only a standardized name in Python 2 (it isn't reserved, anyway). \$\endgroup\$Daniel– Daniel2017年12月26日 20:59:59 +00:00Commented Dec 26, 2017 at 20:59 -
1\$\begingroup\$ @Coal_ ah, good point, used a more accurate
builtin
word, thanks! \$\endgroup\$alecxe– alecxe2017年12月26日 21:00:58 +00:00Commented Dec 26, 2017 at 21:00