I wrote a simply parser for Tektronix's internal binary file format .isf
. The code is supposed to read in the file and return the header data as well as the actual oszi-data.
Since I mostly code by myself, I would like your opinion on whether my code follows common practice and whether it can be improved.
import numpy as np
import os.path
def read_curve(binaryfile):
"""
Reads one tektronix .isf file and returns a dictionary containing
all tags as keys. The actual data is stored in the key "data".
"""
postfixs = [".isf", ".ISF"]
if os.path.splitext(binaryfile)[-1] not in postfixs:
raise ValueError("File type unkown.")
with open(binaryfile, 'rb') as bfile:
# read header
header = {}
current = 0
while True:
current, name = _read_chunk(bfile, " ")
if name != ":CURVE":
current, value = _read_chunk(bfile, ";")
assert name not in header
header[name] = value
else:
# ":CURVE" is the last tag of the header, followed by
# a '#' and a 7 digit number.
header[name] = bfile.read(8)
current = bfile.tell()
break
assert header["ENCDG"] == "BINARY"
# read data as numpy array
header["data"] = _read_data(bfile, current, header)
return header
def _read_chunk(headerfile, delimiter):
"""
Reads one chunk of header data. Based on delimiter, this may be a tag
(ended by " ") or the value of a tag (ended by ";").
"""
chunk = []
while True:
c = headerfile.read(1)
if c != delimiter:
chunk.append(c)
else:
return headerfile.tell(), "".join(chunk)
def _read_data(bfile, position, header):
"""
Reads in the binary data as numpy array.
Apparently, there are only 1d-signals stored in .isf files, so a 1d-array
is read.
Returns a 2d-array with timepoints and datapoints aligned.
"""
# determine the datatype from header tags
datatype = ">" if header["BYT_OR"] == "MSB" else "<"
if header["BN_FMT"] == "RI":
datatype += "i"
else:
datatype += "u"
datatype += header[":WFMPRE:BYT_NR"]
bfile.seek(position)
data = np.fromfile(bfile, datatype)
assert data.size == int(header["NR_PT"])
# calculate true values
data = data * float(header["YMULT"]) + float(header["YZERO"])
# create timepoints
t = np.arange(data.size)
t = t * float(header["XINCR"]) + float(header["XZERO"])
# create single array
res = np.concatenate((t[None, :], data[None, :]), axis=0)
return res
Edit: I posted my revised code here: Parsing oscilloscope data, follow up
-
\$\begingroup\$ How is this supposed to work? Do I incorporate suggested changes in my code and update it here as an edit? \$\endgroup\$Dux– Dux2015年05月17日 10:36:04 +00:00Commented May 17, 2015 at 10:36
-
\$\begingroup\$ I have rolled back Rev 4 → 3. Please see What to do when someone answers . \$\endgroup\$200_success– 200_success2015年05月18日 07:55:56 +00:00Commented May 18, 2015 at 7:55
-
\$\begingroup\$ Sorry for that... Some of these "what to do" articles I only find when presented by a moderator... \$\endgroup\$Dux– Dux2015年05月18日 07:58:24 +00:00Commented May 18, 2015 at 7:58
3 Answers 3
Disclaimer: I haven't touch a tektronix device for years.
_read_chunk
fails if WFID contains a semicolon.- Don't read data blindly till EOF. The header promises that many samples, but they might be trailing data.
The
7 digit number
following a:CURVE #
is not necessarily 7 digit. You must parse it as well (it is a single byte ofcount
, akaX
, followed bycount
digit number, akaYYY
, which represents the byte length of the data).I would also assert that
YYY
equals toNR_PT * BYT_NR
-
\$\begingroup\$ Thanks for pointing that out! As explained in a comment above (a few seconds ago), I have no documentation of the data format at hand. As far as I understood it, the ";" should be the delimiter of different tags, how could that fail? Or did I misunderstand you? Thank you especially for the explanation about the number following
:CURVE
, I will incorporate that. \$\endgroup\$Dux– Dux2015年05月17日 10:33:04 +00:00Commented May 17, 2015 at 10:33 -
\$\begingroup\$ Regarding EOF: I believe
numpy.fromfile
deals with that, since it is capable of reading the whole file without the need to know its size first. What do you suppose I should do instead? \$\endgroup\$Dux– Dux2015年05月17日 10:45:07 +00:00Commented May 17, 2015 at 10:45 -
\$\begingroup\$ Sorry for spamming comments here, I only now realized what you meant exactly with the semicolon. You are right of course, I have to check for
"
and escape the part inside from searching for;
. \$\endgroup\$Dux– Dux2015年05月17日 10:47:01 +00:00Commented May 17, 2015 at 10:47 -
\$\begingroup\$ I chose this as accepted answer because it showed the most severe design flaws. \$\endgroup\$Dux– Dux2015年05月18日 07:21:37 +00:00Commented May 18, 2015 at 7:21
This seems pretty nice. I have just some nitpicks about this part:
postfixs = [".isf", ".ISF"] if os.path.splitext(binaryfile)[-1] not in postfixs: raise ValueError("File type unkown.")
Nitpicks:
I generally don't like repeating the same thing in lowercase and uppercase versions. I prefer to use lowercase, and convert untrusted input to lowercase. However, that will make this check less strict:
iSF
andISf
also become valid extensions, which you might not want.I'd spell out "postfixs" as "postfixes", or better yet, use the more common term "extensions"
Instead of a
list
of allowed extensions, I'd use aset
. Although in this example you surely won't be hitting performance issues, it's a matter of principle and habit
So I'd write this as:
extensions = set([".isf"])
if os.path.splitext(binaryfile)[-1].lower() not in extensions:
raise ValueError("File type unknown.")
As for the rest, this seems pretty clean to me. I especially like the assert
statements documenting your assumptions.
-
\$\begingroup\$ I honestly never thought about using a set at that point. Thanks for pointing is out! The
.lower()
part is nice as well - I don't really believe there will be file associated with.iSf
that won't fit to this parser... \$\endgroup\$Dux– Dux2015年05月16日 11:01:20 +00:00Commented May 16, 2015 at 11:01 -
\$\begingroup\$ Anytime. You have my vote too ;-) \$\endgroup\$janos– janos2015年05月16日 18:21:37 +00:00Commented May 16, 2015 at 18:21
_read_chunk()
does not need to return the filepointer position as you never use it. After leaving the main loop you (have to) read the filepointer explicitely. Thus, current
is oblivious.
Speaking of naming, 'current' can be a lot of things...'currentposition' or the like would be self-documenting. Likewise, a 'bfile' and a 'binaryfile' might be an 'isf_file'; it isn't strictly binary after all.
The guidelines for python style (PEP8) point out that blank lines be used sparsely. I would omit them especially at the beginning of a code block like a loop as to visually underline the block.
_read_chunk()
makes me feel uncomfortable - you know it's text data but resort to byte-wise reading and searching (for the delims). Built-in string search is much cleaner and easier to follow. You could read in the first 1 MB or so and parse all string data up to the '#' (if this is unique). Its string position gives you the starting point for reading the binary data then. Anyway, very concise and clean programming style, a pleasure to read!
Edit: According to this Tektronix doc you need to parse the next few chars after the ':CURVE' token like this (same what vnp hinted at):
else:
# ":CURVE" is the last tag of the header, followed by
# a '#', a single ASCII length char and a variable-length ASCII digit number.
skip = int(bfile.read(1))
assert 1 <= skip <= 9
numrecs = int(bfile.read(skip))
assert numrecs == header['NR_PT'] * header['BYT_NR']
current = bfile.tell()
break
-
\$\begingroup\$ The problem is, I am not sure about the exact specifications of the data format, since it does not seem to be documented (at least not for customers). I therefore do not know whether '#' is unique, which is why I tested for
:CURVE
, which seems to be the last tag. Is reading a precompiled amount of data and parsing it using string searching really a better solution? I do not safely know, how much data I need to read in to be sure i got the whole header... Why is reading text byte by byte (equals character by character, right?) a bad solution? \$\endgroup\$Dux– Dux2015年05月17日 10:31:52 +00:00Commented May 17, 2015 at 10:31 -
\$\begingroup\$ I compare it to human reading - you wouldn't read a book letter by letter either. Based on the assumption that the header data is small (a few KB at most) it won't make much difference if you keep it the way you coded it, performance-wise. You are relying on assumptions a lot - for instance, that all header tags you are using later on are indeed found in the header. \$\endgroup\$user1016274– user10162742015年05月17日 13:27:51 +00:00Commented May 17, 2015 at 13:27
-
\$\begingroup\$ That is true, I depend on the relevant tags to be present. But then, if they are not present, I do not know how to interpret the data anyways, so the file is broken then. I may of course catch that case and print a meaningful message... \$\endgroup\$Dux– Dux2015年05月17日 13:55:37 +00:00Commented May 17, 2015 at 13:55
Explore related questions
See similar questions with these tags.