I have a very simple Python script. All it does is open two data files from a given directory, read the data, make a series of plots and save as PDF. It works, but it is very slow. It takes almost 20 seconds for data files that have 50-100 lines and <30 variables.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
with open('file1.out') as f:
var1 = f.readline().split()
with open('file2.out') as f:
var2 = f.readline().split()
df1 = np.loadtxt('file1.out', skiprows=1, unpack=True)
df2 = np.loadtxt('file2.out', skiprows=1, unpack=True)
nc1 = df1.shape[0]
nc2 = df2.shape[0]
with PdfPages('file_output.pdf') as pdf:
## file1.out
fig = plt.figure(figsize=(11,7))
j = 1
for i in range(1,nc1):
ax = fig.add_subplot(3,2,j)
ax.plot(df1[0], df1[i], linestyle='-', color='black')
ax.set(title=var1[i], xlabel='seconds', ylabel='')
if j == 6:
pdf.savefig(fig)
fig = plt.figure(figsize=(11,7))
j = 1
else:
j = j + 1
pdf.savefig(fig)
## file2.out
fig = plt.figure(figsize=(11,7))
j = 1
for i in range(1,nc2):
... # and it continues like the block of code above
My questions are:
- Do I need all those imports and are they slowing down the execution?
- Is there a better way to read the data files then opening them twice (once to get the file header and once to get data)?
- Am I using the matplotlib commands correctly/efficiently (I am not very familiar with matplotlib, and this is basically my first attempt to use it)?
Please keep in mind that ideally this script should have as few dependencies as possible, because it is meant to be used on different systems by different users.
The data files have the following format:
t X1 X2 X3 X4 X5 X6 X7 X8 X11 X12 X13 X14 X15 X16
6.000000E+001 4.309764E-007 2.059219E-004 9.055840E-007 2.257223E-003 1.148868E-002 7.605114E-002 4.517820E-004 3.228596E-008 2.678874E-006 7.095441E-006 1.581115E-007 1.010346E-006 1.617892E-006 9.706194E-007
1.200000E+002 4.309764E-007 2.059219E-004 9.055840E-007 2.257223E-003 1.148868E-002 7.605114E-002 4.517820E-004 3.228596E-008 2.678874E-006 7.095441E-006 1.581115E-007 1.010346E-006 1.617892E-006 9.706194E-007
1.800000E+002 3.936234E-007 2.027775E-004 8.644279E-007 2.180931E-003 1.131226E-002 7.476778E-002 4.353550E-004 3.037527E-008 2.534515E-006 6.778434E-006 1.470889E-007 9.488175E-007 1.531702E-006 9.189112E-007
2 Answers 2
coding style
Your code is almost pep-8 compliant. There are a few spaces missing after comma's, but all in all this is not too bad. I myself use black to take care of this formatting for me.
some of the variables names can be clearer. What does nc1
mean for example
magic numbers
The number 3, 2 and 6 are the number of rows and columns on the grid. Better would be to make them real variables, and replace 6 with rows * columns
. If you ever decide you want 4 columns, you don't have to chase down all those magic numbers
looping
You are looping over the indexes of var
and df
. Better here would be to use zip
to iterate over both tables together. If you want to group them per 6, you can use the grouper
itertools recipe. and enumerate
to get the index of the different subplots.
rows, columns = 3, 2
for group in grouper(zip(var1[1:], df1[1:]), rows * columns):
fig = plt.figure(figsize=(11, 7))
for i, (label, row) in enumerate(filter(None, group)):
ax = fig.add_subplot(rows, columns, i + 1)
ax.plot(df1[0], row, linestyle="-", color="black")
ax.set(title=label, xlabel="seconds", ylabel="")
The filter(None,...)
is to eliminate the items that get the fillvalue
in the grouper
Is a lot clearer than the juggling with nc1
and j
functions
This would be a lot easier to test an handle if you would separate the different parts of the script into functions
- reading the file
- making 1 page plot
- appending the different pages
This will also allow each of those parts to be tested separately
reading the file
Instead of loading the file twice and using numpy
, using pandas
, which supports data with names and indices will simplify this part a lot
df = pd.read_csv(<filename>, sep="\s+", index_col=0)
this is a labelled DataFrame, so no more need to use var1
for the column names
making the individual plot:
group the columns per 6
def column_grouper(df, n):
for i in range(0, df.shape[1], n):
yield df.iloc[:, i:i+n]
this simple helper generator can group the data per 6 columns
make the plot
def generate_plots(df, rows=3, columns=2):
for group in column_grouper(df, rows * columns):
fig = plt.figure(figsize=(11, 7))
for i, (label, column) in enumerate(group.items()):
ax = fig.add_subplot(rows, columns,i + 1)
ax.plot(column, linestyle='-', color='black')
ax.set(title=label, xlabel='seconds', ylabel='')
yield fig
saving the pdf
Here a simple method that accepts an iterable of figures and a filename will do the trick
def save_plots(figures, output_file):
with PdfPages(output_file) as pdf:
for fig in figures:
pdf.savefig(fig)
pulling it together
def parse_file(input_file, output_file, rows=3, columns=2):
df = pd.read_csv(input_file, sep="\s+", index_col=0)
figures = generate_plots(df, rows, columns)
save_plots(figures, output_file)
and then calling this behind a main
guard
if __name__ == "__main__":
input_files = ['file1.out', 'file2.out']
output_file = 'file_output.pdf'
for input_file in input_files:
parse_file(input_file, output_file)
If this still is too slow, at least now the different parts of the program are split, and you can start looking what part of the program is slowing everything down
-
\$\begingroup\$ Thanks, looks great. I am getting an error
'DataFrame' object has no attribute 'items'
which I think is related to the output ofcolumn_grouper()
, but I am not sure I understand exactly what that function does. \$\endgroup\$point618– point6182019年04月24日 12:19:34 +00:00Commented Apr 24, 2019 at 12:19 -
\$\begingroup\$
DataFrame.items
. This should work. It worked for me at least. I didn't test the pdf creation, but the plot generation worked. \$\endgroup\$Maarten Fabré– Maarten Fabré2019年04月24日 12:36:34 +00:00Commented Apr 24, 2019 at 12:36 -
\$\begingroup\$ Ah, so it should be
for i, (label, column) in enumerate(group.iteritems()):
in thegenerate_plots()
function. \$\endgroup\$point618– point6182019年04月24日 13:21:26 +00:00Commented Apr 24, 2019 at 13:21 -
\$\begingroup\$ I think both items and iteritems should work \$\endgroup\$Maarten Fabré– Maarten Fabré2019年04月25日 07:09:10 +00:00Commented Apr 25, 2019 at 7:09
-
1\$\begingroup\$ Sure, go ahead. \$\endgroup\$Maarten Fabré– Maarten Fabré2019年04月26日 09:45:48 +00:00Commented Apr 26, 2019 at 9:45
Maarten's answer contains lot of useful advice but I think his code won't work as expected because the parse_file()
function which is called for each of the input files re-opens the output_file
, but help(PdfPages.__init__)
says (cf. "Parameters" in the doc)
The file is opened at once and any older file with the same name is overwritten.
(Anyway, I don't think one can just append something to a PDF file, so unless PdfPages() would read in the whole PDF and re-write it entirely after appending content, it is not possible to reopen an existing file to simply append stuff.)
So I think we must open the output file only once and then loop over the input files, thus replacing the main loop and the two functions parse_file
and save_plot
by one single function, e.g., as follows:
def plotToPDF(output_file, *input_files, rows=3, columns=2):
import pandas as pd
from matplotlib.backends.backend_pdf import PdfPages
with PdfPages(output_file) as pdf:
for file in input_files:
df = pd.read_csv(input_file, sep="\s+", index_col=0)
for fig in generate_plots(df, rows, columns):
pdf.savefig(fig)
which you would simply call as plotToPDF('file_output.pdf', 'file1.out', 'file2.out')
.
This uses Maarten's function generate_plots()
(requiring import matplotlib.pyplot as plt
and his other function column_grouper()
) which can remain exactly as given in his answer.
with
statements which define the samevar1
, together at the top? Shouldn't it (maybe?) bevar2
in the secondwith
statement and in the omitted second code block? (obviously it would be nicer to have that code block only once!) \$\endgroup\$