Parsing oscilloscope binary data

Question 1

I wrote a simply parser for Tektronix's internal binary file format .isf. The code is supposed to read in the file and return the header data as well as the actual oszi-data.

Since I mostly code by myself, I would like your opinion on whether my code follows common practice and whether it can be improved.

import numpy as np
import os.path
def read_curve(binaryfile):
 """
 Reads one tektronix .isf file and returns a dictionary containing
 all tags as keys. The actual data is stored in the key "data".
 """
 postfixs = [".isf", ".ISF"]
 if os.path.splitext(binaryfile)[-1] not in postfixs:
 raise ValueError("File type unkown.")
 with open(binaryfile, 'rb') as bfile:
 # read header
 header = {}
 current = 0
 while True:
 current, name = _read_chunk(bfile, " ")
 if name != ":CURVE":
 current, value = _read_chunk(bfile, ";")
 assert name not in header
 header[name] = value
 else:
 # ":CURVE" is the last tag of the header, followed by
 # a '#' and a 7 digit number.
 header[name] = bfile.read(8)
 current = bfile.tell()
 break
 assert header["ENCDG"] == "BINARY"
 # read data as numpy array
 header["data"] = _read_data(bfile, current, header)
 return header
def _read_chunk(headerfile, delimiter):
 """
 Reads one chunk of header data. Based on delimiter, this may be a tag
 (ended by " ") or the value of a tag (ended by ";").
 """
 chunk = []
 while True:
 c = headerfile.read(1)
 if c != delimiter:
 chunk.append(c)
 else:
 return headerfile.tell(), "".join(chunk)
def _read_data(bfile, position, header):
 """
 Reads in the binary data as numpy array.
 Apparently, there are only 1d-signals stored in .isf files, so a 1d-array
 is read.
 Returns a 2d-array with timepoints and datapoints aligned.
 """
 # determine the datatype from header tags
 datatype = ">" if header["BYT_OR"] == "MSB" else "<"
 if header["BN_FMT"] == "RI":
 datatype += "i"
 else:
 datatype += "u"
 datatype += header[":WFMPRE:BYT_NR"]
 bfile.seek(position)
 data = np.fromfile(bfile, datatype)
 assert data.size == int(header["NR_PT"])
 # calculate true values
 data = data * float(header["YMULT"]) + float(header["YZERO"])
 # create timepoints
 t = np.arange(data.size)
 t = t * float(header["XINCR"]) + float(header["XZERO"])
 # create single array
 res = np.concatenate((t[None, :], data[None, :]), axis=0)
 return res

Edit: I posted my revised code here: Parsing oscilloscope data, follow up

Question 2

How is this supposed to work? Do I incorporate suggested changes in my code and update it here as an edit?

Question 3

I have rolled back Rev 4 → 3. Please see What to do when someone answers .

Question 4

Sorry for that... Some of these "what to do" articles I only find when presented by a moderator...

Question 5

Disclaimer: I haven't touch a tektronix device for years.

_read_chunk fails if WFID contains a semicolon.
Don't read data blindly till EOF. The header promises that many samples, but they might be trailing data.
The 7 digit number following a :CURVE # is not necessarily 7 digit. You must parse it as well (it is a single byte of count, aka X, followed by count digit number, aka YYY, which represents the byte length of the data).

I would also assert that YYY equals to NR_PT * BYT_NR

Question 6

Thanks for pointing that out! As explained in a comment above (a few seconds ago), I have no documentation of the data format at hand. As far as I understood it, the ";" should be the delimiter of different tags, how could that fail? Or did I misunderstand you? Thank you especially for the explanation about the number following :CURVE, I will incorporate that.

Question 7

Regarding EOF: I believe numpy.fromfile deals with that, since it is capable of reading the whole file without the need to know its size first. What do you suppose I should do instead?

Question 8

Sorry for spamming comments here, I only now realized what you meant exactly with the semicolon. You are right of course, I have to check for " and escape the part inside from searching for ;.

Question 9

I chose this as accepted answer because it showed the most severe design flaws.

Question 10

This seems pretty nice. I have just some nitpicks about this part:

postfixs = [".isf", ".ISF"]
if os.path.splitext(binaryfile)[-1] not in postfixs:
 raise ValueError("File type unkown.")

Nitpicks:

I generally don't like repeating the same thing in lowercase and uppercase versions. I prefer to use lowercase, and convert untrusted input to lowercase. However, that will make this check less strict: iSF and ISf also become valid extensions, which you might not want.
I'd spell out "postfixs" as "postfixes", or better yet, use the more common term "extensions"
Instead of a list of allowed extensions, I'd use a set. Although in this example you surely won't be hitting performance issues, it's a matter of principle and habit

So I'd write this as:

extensions = set([".isf"])
if os.path.splitext(binaryfile)[-1].lower() not in extensions:
 raise ValueError("File type unknown.")

As for the rest, this seems pretty clean to me. I especially like the assert statements documenting your assumptions.

Question 11

I honestly never thought about using a set at that point. Thanks for pointing is out! The .lower() part is nice as well - I don't really believe there will be file associated with .iSf that won't fit to this parser...

Question 12

Anytime. You have my vote too ;-)

Question 13

_read_chunk() does not need to return the filepointer position as you never use it. After leaving the main loop you (have to) read the filepointer explicitely. Thus, current is oblivious.

Speaking of naming, 'current' can be a lot of things...'currentposition' or the like would be self-documenting. Likewise, a 'bfile' and a 'binaryfile' might be an 'isf_file'; it isn't strictly binary after all.

The guidelines for python style (PEP8) point out that blank lines be used sparsely. I would omit them especially at the beginning of a code block like a loop as to visually underline the block.

_read_chunk() makes me feel uncomfortable - you know it's text data but resort to byte-wise reading and searching (for the delims). Built-in string search is much cleaner and easier to follow. You could read in the first 1 MB or so and parse all string data up to the '#' (if this is unique). Its string position gives you the starting point for reading the binary data then. Anyway, very concise and clean programming style, a pleasure to read!

Edit: According to this Tektronix doc you need to parse the next few chars after the ':CURVE' token like this (same what vnp hinted at):

else:
 # ":CURVE" is the last tag of the header, followed by
 # a '#', a single ASCII length char and a variable-length ASCII digit number.
 skip = int(bfile.read(1))
 assert 1 <= skip <= 9
 numrecs = int(bfile.read(skip))
 assert numrecs == header['NR_PT'] * header['BYT_NR']
 current = bfile.tell()
 break

Question 14

The problem is, I am not sure about the exact specifications of the data format, since it does not seem to be documented (at least not for customers). I therefore do not know whether '#' is unique, which is why I tested for :CURVE, which seems to be the last tag. Is reading a precompiled amount of data and parsing it using string searching really a better solution? I do not safely know, how much data I need to read in to be sure i got the whole header... Why is reading text byte by byte (equals character by character, right?) a bad solution?

Question 15

I compare it to human reading - you wouldn't read a book letter by letter either. Based on the assumption that the header data is small (a few KB at most) it won't make much difference if you keep it the way you coded it, performance-wise. You are relying on assumptions a lot - for instance, that all header tags you are using later on are indeed found in the header.

Question 16

That is true, I depend on the relevant tags to be present. But then, if they are not present, I do not know how to interpret the data anyways, so the file is broken then. I may of course catch that case and print a meaningful message...

vnp vnp 58.5k4 gold badges55 silver badges144 bronze badges · Accepted Answer · 2015-05-17 03:37:30Z

1

\$\begingroup\$

Disclaimer: I haven't touch a tektronix device for years.

_read_chunk fails if WFID contains a semicolon.
Don't read data blindly till EOF. The header promises that many samples, but they might be trailing data.
The 7 digit number following a :CURVE # is not necessarily 7 digit. You must parse it as well (it is a single byte of count, aka X, followed by count digit number, aka YYY, which represents the byte length of the data).

I would also assert that YYY equals to NR_PT * BYT_NR

Share

answered May 17, 2015 at 3:37

vnp's user avatar

vnp vnp

58.5k4 gold badges55 silver badges144 bronze badges

\$\endgroup\$

4

\$\begingroup\$ Thanks for pointing that out! As explained in a comment above (a few seconds ago), I have no documentation of the data format at hand. As far as I understood it, the ";" should be the delimiter of different tags, how could that fail? Or did I misunderstand you? Thank you especially for the explanation about the number following :CURVE, I will incorporate that. \$\endgroup\$

Dux
– Dux

2015年05月17日 10:33:04 +00:00
Commented May 17, 2015 at 10:33
\$\begingroup\$ Regarding EOF: I believe numpy.fromfile deals with that, since it is capable of reading the whole file without the need to know its size first. What do you suppose I should do instead? \$\endgroup\$

Dux
– Dux

2015年05月17日 10:45:07 +00:00
Commented May 17, 2015 at 10:45
\$\begingroup\$ Sorry for spamming comments here, I only now realized what you meant exactly with the semicolon. You are right of course, I have to check for " and escape the part inside from searching for ;. \$\endgroup\$

Dux
– Dux

2015年05月17日 10:47:01 +00:00
Commented May 17, 2015 at 10:47
\$\begingroup\$ I chose this as accepted answer because it showed the most severe design flaws. \$\endgroup\$

Dux
– Dux

2015年05月18日 07:21:37 +00:00
Commented May 18, 2015 at 7:21

Add a comment |

Stack Exchange Network

Parsing oscilloscope binary data

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Parsing oscilloscope binary data

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions