13
\$\begingroup\$

I'm looking for a simple way of parsing complex text files into a pandas DataFrame. Below is a sample file, what I want the result to look like after parsing, and my current method.

Is there any way to make it more concise/faster/more pythonic/more readable?

I have also put this question on Stack Overflow. I eventually wrote a blog article to explain this to beginners.

Here is a sample file:

Sample text
A selection of students from Riverdale High and Hogwarts took part in a quiz. This is a record of their scores.
School = Riverdale High
Grade = 1
Student number, Name
0, Phoebe
1, Rachel
Student number, Score
0, 3
1, 7
Grade = 2
Student number, Name
0, Angela
1, Tristan
2, Aurora
Student number, Score
0, 6
1, 3
2, 9
School = Hogwarts
Grade = 1
Student number, Name
0, Ginny
1, Luna
Student number, Score
0, 8
1, 7
Grade = 2
Student number, Name
0, Harry
1, Hermione
Student number, Score
0, 5
1, 10
Grade = 3
Student number, Name
0, Fred
1, George
Student number, Score
0, 0
1, 0

Here is what I want the result to look like after parsing:

 Name Score
School Grade Student number 
Hogwarts 1 0 Ginny 8
 1 Luna 7
 2 0 Harry 5
 1 Hermione 10
 3 0 Fred 0
 1 George 0
Riverdale High 1 0 Phoebe 3
 1 Rachel 7
 2 0 Angela 6
 1 Tristan 3
 2 Aurora 9

Here is how I currently parse it:

import re
import pandas as pd
def parse(filepath):
 """
 Parse text at given filepath
 Parameters
 ----------
 filepath : str
 Filepath for file to be parsed
 Returns
 -------
 data : pd.DataFrame
 Parsed data
 """
 data = []
 with open(filepath, 'r') as file:
 line = file.readline()
 while line:
 reg_match = _RegExLib(line)
 if reg_match.school:
 school = reg_match.school.group(1)
 if reg_match.grade:
 grade = reg_match.grade.group(1)
 grade = int(grade)
 if reg_match.name_score:
 value_type = reg_match.name_score.group(1)
 line = file.readline()
 while line.strip():
 number, value = line.strip().split(',')
 value = value.strip()
 dict_of_data = {
 'School': school,
 'Grade': grade,
 'Student number': number,
 value_type: value
 }
 data.append(dict_of_data)
 line = file.readline()
 line = file.readline()
 data = pd.DataFrame(data)
 data.set_index(['School', 'Grade', 'Student number'], inplace=True)
 # consolidate df to remove nans
 data = data.groupby(level=data.index.names).first()
 # upgrade Score from float to integer
 data = data.apply(pd.to_numeric, errors='ignore')
 return data
class _RegExLib:
 """Set up regular expressions"""
 # use https://regexper.com to visualise these if required
 _reg_school = re.compile('School = (.*)\n')
 _reg_grade = re.compile('Grade = (.*)\n')
 _reg_name_score = re.compile('(Name|Score)')
 def __init__(self, line):
 # check whether line has a positive match with all of the regular expressions
 self.school = self._reg_school.match(line)
 self.grade = self._reg_grade.match(line)
 self.name_score = self._reg_name_score.search(line)
if __name__ == '__main__':
 filepath = 'sample.txt'
 data = parse(filepath)
 print(data)
Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Dec 26, 2017 at 20:26
\$\endgroup\$

1 Answer 1

15
+50
\$\begingroup\$

There are a few performance tricks we can apply here:

  • add __slots__ to the class definition should help with memory and performance as well:

    class _RegExLib:
     """Set up regular expressions"""
     # use https://regexper.com to visualise these if required
     _reg_school = re.compile(r'School = (.*)\n')
     _reg_grade = re.compile(r'Grade = (.*)\n')
     _reg_name_score = re.compile(r'(Name|Score)')
     __slots__ = ['school', 'grade', 'name_score']
     def __init__(self, line):
     # check whether line has a positive match with all of the regular expressions
     self.school = self._reg_school.match(line)
     self.grade = self._reg_grade.match(line)
     self.name_score = self._reg_name_score.search(line)
    
  • using next() instead of .readline() should be faster as it uses a lookahead buffer internally:

    with open(filepath, 'r') as file:
     line = next(file)
     while line:
     reg_match = _RegExLib(line)
     if reg_match.school:
     school = reg_match.school.group(1)
     if reg_match.grade:
     grade = reg_match.grade.group(1)
     grade = int(grade)
     if reg_match.name_score:
     value_type = reg_match.name_score.group(1)
     line = next(file, None)
     while line and line.strip():
     number, value = line.strip().split(',')
     value = value.strip()
     dict_of_data = {
     'School': school,
     'Grade': grade,
     'Student number': number,
     value_type: value
     }
     data.append(dict_of_data)
     line = next(file, None)
     line = next(file, None)
    

Some of the code style and other notes:

  • file is a builtin (in Python 2.x only), consider a different variable name
  • define your regular expression strings as raw strings
  • you can probably replace .* wildcard with a more concrete \d+ for the "grade" regex: Grade = (\d+)\n
answered Dec 26, 2017 at 20:56
\$\endgroup\$
2
  • 3
    \$\begingroup\$ Note that file is only a standardized name in Python 2 (it isn't reserved, anyway). \$\endgroup\$ Commented Dec 26, 2017 at 20:59
  • 1
    \$\begingroup\$ @Coal_ ah, good point, used a more accurate builtin word, thanks! \$\endgroup\$ Commented Dec 26, 2017 at 21:00

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.