1
\$\begingroup\$

I have a class that is importing a CSV file into a Pandas DataFrame. However, this CSV has a multiline header that changes length seemingly at random. Sometimes it is 27 lines, sometimes 31, sometimes something else. It is just dirty data in so many other ways as well, look at the header separator.

As a result, the code that I have got working has this file being read three times. The first time is to check the length of the header, the second is to read the body of data into one DataFrame, and the third is to read the Header into another DataFrame (this is later parsed out into a dictionary, but it is super messy and I find it easiest to do it this way. The code I have is below (taken from a larger class, I can add more of the class if required.):

"""
Program to import and process Valeport SVP files
"""
import pandas as pd
from dateutil import parser
from icecream import ic
class ValeportSVPImport:
 """
 Importer Class for Valeport .000 format SVP files
 The header header is parsed off the main body of data.
 The data is added to a Pandas DataFrame
 The processed header data is exported as a dictionary.
 - Keys - Valeport Settings
 - Values - Values from the profile
 """
 def __init__(self, path):
 """
 Set up for the
 :param path: Filepath of the .000 file that is to be parsed
 :return data: Pandas DataFrame containing the SVP Raw data.
 :return header_dict: Dictionary containing the SVP header data.
 """
 self.path = path
 # Find header length, even if irregular
 self.header_size = self.__header_size__()
 # split data into the header and the import
 self.data, self.header = self.__import_svp__()
 # Parse the relevant information out of the header
 self.header_dict = self.__parse_header__()
 def __header_size__(self):
 """
 Finds the header row for the data
 The size of the header file in a .000 file can change, this code takes that into account
 :return: number of header row
 """
 with open(self.path, "r") as filename:
 lines = filename.readlines()
 i = 0
 for line in lines:
 split = line.split('\t')
 if len(split) == 9:
 break
 else:
 i = i + 1
 return i
 def __import_svp__(self):
 """
 Parse the CSV files into Pandas DataFrames
 :return: DataFrame (data) - RAW SVP Data, DataFrame (data_header) RAW Header Data
 """
 i = self.header_size
 # Import body of data
 data = pd.read_csv(self.path, sep=r"\t", header=i, engine='python')
 # Drop any columns that are completed void
 data = data.dropna(axis=1, how='all')
 # Import the header
 data_header = pd.read_csv(self.path, sep=':\t', nrows=i - 1, header=None, engine='python')
 return data, data_header
 def __parse_header__(self):
 """
 Strip out all of the relevant header material.
 :return: Dictionary - {Setting: Value}
 """
 df = self.header
 df.columns = ['KEY', 'VALUE']
 df['KEY'] = df['KEY'].str.strip(':')
 df['KEY'] = df['KEY'].str.strip()
 # Rows from the header that will be added into the header dictionary
 required_data = ['Model Name', 'Site Information', 'Serial No.', 'Sample Mode',
 'Tare Setting', 'Tare Time Stamp', 'Density', 'Gravity', 'Time Stamp']
 # return a dict of {Key Name : [Line Number, Value]
 header_dict = {x: [df[df['KEY'] == x].index[0], df['VALUE'].iloc[df[df['KEY'] == x].index[0]]]
 for x in required_data}
 # Turn timestamp into a DateTime string
 header_dict['Time Stamp'][1] = parser.parse(header_dict['Time Stamp'][1])
 # Process Tare Date/Time
 header_dict['Tare Time Stamp'][1], header_dict['Tare Setting'][1] = self.__tare_time__(header_dict)
 # Remove the line number out of the dictionary before it is returned
 return {k: header_dict[k][1] for k in header_dict}
 @staticmethod
 def __tare_time__(header_dict):
 """
 Process the Tare data out of the header.path
 - Sometimes the Tare time and data is in with the settings, other times it is not.
 - In both cases the Time and Date are parsed into DateTime format.
 :param header_dict: Dictionary containing all header information.
 :return: Tare DateTime, Tare Vale
 """
 # If the
 if header_dict['Tare Time Stamp'][1] is None and len(header_dict['Tare Setting'][1].split(';')) == 2:
 tare_split = header_dict['Tare Setting'][1].split(';')
 tare_dt = parser.parse(tare_split[1])
 tare_setting = float(tare_split[0])
 else:
 tare_dt = parser.parse(header_dict['Tare Time Stamp'][1])
 tare_setting = float(header_dict['Tare Setting'])
 return tare_dt, tare_setting
if __name__ == '__main__':
 svp = ValeportSVPImport(path=r"C:\Users\...")
 ic(svp.header_dict)
 ic(svp.data)

Below is an example of the data that I am trying to import:

Previous File Location : 0
No of Bytes Stored in Previous File : 0
Model Name : MIDAS SVX2 3000
File Name : C:\....
Site Information : xxx
Serial No. : 41850
No. of Modules Connected : 
Fitted Address List : 
Parameters for each module : 
User Calibrations : 
Secondary Cal Used : 
Gain : 
Offset : 
Gain Control Settings : 
SD Selected Flag : 
Average Mode : NONE
Moving Average Length: 1
Sample Mode : CONT
Sample Interval : 1
Sample Rate : 1
Sample Period : 1
Tare Setting : 10.822;15/06/2021 06:20:05
Tare Time Stamp : 
Density : 
Gravity : 
Time Stamp : 15/06/2021 07:51:17
External PSU Voltage : 24.553
Date / Time SOUND VELOCITY;M/SEC PRESSURE;DBAR TEMPERATURE;C CONDUCTIVITY;MS/CM Calc. SALINITY;PSU Calc. DENSITY;KG/M3 Calc. SOUND VELOCITY;M/SEC 
15/06/2021 07:51:17 0.000 0.006 24.021 0.002 0.012 -2.698 1494.063 
15/06/2021 07:51:18 0.000 -0.002 24.023 0.002 0.012 -2.699 1494.069 
15/06/2021 07:51:19 0.000 0.015 24.025 0.002 0.012 -2.699 1494.074 
15/06/2021 07:51:20 0.000 0.019 24.025 0.002 0.012 -2.699 1494.074 
15/06/2021 07:51:21 0.000 -0.012 24.026 0.002 0.012 -2.700 1494.077 
15/06/2021 07:51:22 0.000 0.007 24.025 0.002 0.012 -2.699 1494.074 
15/06/2021 07:51:23 0.000 0.008 24.028 0.002 0.012 -2.700 1494.082 
15/06/2021 07:51:24 0.000 0.009 24.029 0.002 0.012 -2.700 1494.085 
15/06/2021 07:51:25 0.000 0.002 24.028 0.002 0.012 -2.700 1494.082 
15/06/2021 07:51:26 0.000 0.002 24.024 0.002 0.012 -2.699 1494.071 

and

Previous File Location : 786432
No of Bytes Stored in Previous File : 0
Model Name : MIDAS SVX2 3000
File Name : UNKNOWN
Site Information : xxxx
Serial No. : 29681
No. of Modules Connected : 3
Fitted Address List : 12;21;49;
Parameters for each module : 1;2;1;
User Calibrations :
15;0.000000e+00;0.000000e+00;0.000000e+00;0.000000e+00;1.000000e+00;0.000000e+00
15;0.000000e+00;0.000000e+00;0.000000e+00;0.000000e+00;1.000000e+00;0.000000e+00
15;0.000000e+00;0.000000e+00;0.000000e+00;0.000000e+00;1.000000e+00;0.000000e+00
15;0.000000e+00;0.000000e+00;0.000000e+00;0.000000e+00;1.000000e+00;0.000000e+00
Secondary Cal Used : 1;0;0;1;1;0;1;0;0;
Gain : 1000;0;0;10000;1000;0;500;0;0;
Offset : 0;0;0;0;-20000;0;-20000;0;0;
Gain Control Settings : 226;19;0;
SD Selected Flag : 1
Average Mode : NONE
Moving Average Length : 1
Sample Mode : CONT
Sample Interval : 60
Sample Rate : 1
Sample Period : 1
Tare Setting : 10.353
Tare Time Stamp : 15/06/2021 07:20:18
Density : 1027.355
Gravity : 9.807
Time Stamp : 15/06/2021 07:21:52
External PSU Voltage : 11
Date / Time SOUND VELOCITY;M/SEC PRESSURE;DBAR TEMPERATURE;C CONDUCTIVITY;MS/CM Calc. SALINITY; PSU Calc. DENSITY ANOMALY; KG/M3 [EOS-80] Calc. SOS; M/SEC 
15/06/2021 07:21:52.000 0.000 0.019 24.309 0.000 0.012 -2.769 1494.852 
15/06/2021 07:21:53.000 0.000 -0.002 24.310 0.000 0.012 -2.770 1494.855 
15/06/2021 07:21:54.000 0.000 0.012 24.311 -0.002 0.012 -2.770 1494.858 
15/06/2021 07:21:55.000 0.000 0.008 24.315 0.000 0.012 -2.771 1494.868 
15/06/2021 07:21:56.000 0.000 0.006 24.325 0.000 0.012 -2.774 1494.896 
15/06/2021 07:21:57.000 0.000 0.004 24.324 0.000 0.012 -2.773 1494.893

The program as a whole is starting to get quite large and is slowing down a bit, so I am going back over it to spot any choke points. Hoping this is one of them.

Any suggestions would be greatly welcomed.

asked Jun 27, 2021 at 23:40
\$\endgroup\$
5
  • 1
    \$\begingroup\$ I think this would benefit from some sample data. Ideally, this would be done not in an attached file but just formatted as code, with as few headers and data as possible while still conveying the complexity as the problem. \$\endgroup\$ Commented Jun 28, 2021 at 0:04
  • \$\begingroup\$ @Kraigolas - Good point. Data examples now added \$\endgroup\$ Commented Jun 28, 2021 at 0:22
  • \$\begingroup\$ Other notes: 1) We would probably benefit from having the whole class. It would be nice to see the expected output or even be able to run the code ourselves. 2) "So I am going back over it to spot any choke points. Hoping this is one of them." is a bad strategy. Python provides cProfile to profile your code. Always use a profiler to see where the choke points are instead of guessing, cProfile is not very difficult to get started with. It's made to be easy \$\endgroup\$ Commented Jun 28, 2021 at 0:30
  • 1
    \$\begingroup\$ @Kraigolas - Thanks for the tip about cProfile, I will look into that. I have also added the entire class now, so hopefully, that will help. This is just one class in a larger program that is way too chunky to upload here (and I haven't uploaded it to GitHub yet, so can't add a link to that). \$\endgroup\$ Commented Jun 28, 2021 at 0:40
  • \$\begingroup\$ The sample data doesn't appear to contain any \ts in it. So the code doesn't parse the sample data correctly. \$\endgroup\$ Commented Jun 29, 2021 at 4:38

1 Answer 1

1
\$\begingroup\$

Per PEP 8, never use leading and trailing double underscores in names of your own functions or methods. They are for python's magic methods. If you want to indicate a method is protected (not the public API, but useable by subclasses), use a single leading underscore. To indicate it is private (not for use by a subclass) use a leading double underscore (but not trailing).

open() returns a file object; filename is a mis-descriptive variable name.

Don't use readlines() to slurp the entire file into a list and then iterate over the list of lines. Iterate over the file directly.

Use enumerate() instead of a loop counter. So, __header_size__() could be somethings like:

def _header_size(self):
 with open(self.path, "r") as datafile:
 for i, line in enumerate(datafile):
 parts = line.split('\t')
 if len(parts) == 9:
 break
 return i

The first argument to read_csv() can be a file-like object, e.g an object with a read() method (like an open file or StringIO). So, the strategy for reading these data files in a single pass is to read the data file line-by-line parsing the header as you go. When you reach the line with the column names, parse the column names. Then let read_csv() parse the table.

# These are the keys in the header that we care about
HEADER_KEYS = ['Model Name', 'Site Information', 'Serial No.',
 'Sample Mode', 'Tare Setting', 'Tare Time Stamp',
 'Density', 'Gravity', 'Time Stamp']
class ValeportSVPImport:
 def __init__(self, path):
 self.header = dict.fromkeys()
 
 with open(path) as datafile:
 for line in datafile:
 # is this line the start of the data table?
 # original code used `if len(line.split('\t') == 9:`
 if line.startswith('Date / Time'):
 columns = parse_column_names(line)
 break
 
 # is this part of a key : value pair?
 if ':' in line:
 key, value = line.split(':', maxsplit=1)
 key = key.strip()
 
 # is it a key we're interested in?
 if key in header:
 self.header[key] = value.strip()
 self.data = pd.read_csv(f, sep='\s+', header=None, names=columns)

Note 1: I didn't implement parse_column_names() because the sample data appears to be missing the '\t' characters.

Note 2: If the line with the column names can be parsed with the data by read_csv() you could use datafile.tell() before reading every line and datafile.seek() to backup in the file and then let read_csv() read the column names and the data.

answered Jun 29, 2021 at 8:13
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.