A Python script to plot data and save to PDF

Question 1

I have a very simple Python script. All it does is open two data files from a given directory, read the data, make a series of plots and save as PDF. It works, but it is very slow. It takes almost 20 seconds for data files that have 50-100 lines and <30 variables.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
with open('file1.out') as f:
 var1 = f.readline().split()
with open('file2.out') as f:
 var2 = f.readline().split()
df1 = np.loadtxt('file1.out', skiprows=1, unpack=True)
df2 = np.loadtxt('file2.out', skiprows=1, unpack=True)
nc1 = df1.shape[0]
nc2 = df2.shape[0]
 
with PdfPages('file_output.pdf') as pdf:
 ## file1.out
 fig = plt.figure(figsize=(11,7))
 j = 1
 for i in range(1,nc1):
 ax = fig.add_subplot(3,2,j)
 ax.plot(df1[0], df1[i], linestyle='-', color='black')
 ax.set(title=var1[i], xlabel='seconds', ylabel='')
 if j == 6:
 pdf.savefig(fig)
 fig = plt.figure(figsize=(11,7))
 j = 1
 else:
 j = j + 1
 pdf.savefig(fig)
 ## file2.out
 fig = plt.figure(figsize=(11,7))
 j = 1
 for i in range(1,nc2):
 ... # and it continues like the block of code above

My questions are:

Do I need all those imports and are they slowing down the execution?
Is there a better way to read the data files then opening them twice (once to get the file header and once to get data)?
Am I using the matplotlib commands correctly/efficiently (I am not very familiar with matplotlib, and this is basically my first attempt to use it)?

Please keep in mind that ideally this script should have as few dependencies as possible, because it is meant to be used on different systems by different users.

The data files have the following format:

 t X1 X2 X3 X4 X5 X6 X7 X8 X11 X12 X13 X14 X15 X16
 6.000000E+001 4.309764E-007 2.059219E-004 9.055840E-007 2.257223E-003 1.148868E-002 7.605114E-002 4.517820E-004 3.228596E-008 2.678874E-006 7.095441E-006 1.581115E-007 1.010346E-006 1.617892E-006 9.706194E-007 
 1.200000E+002 4.309764E-007 2.059219E-004 9.055840E-007 2.257223E-003 1.148868E-002 7.605114E-002 4.517820E-004 3.228596E-008 2.678874E-006 7.095441E-006 1.581115E-007 1.010346E-006 1.617892E-006 9.706194E-007 
 1.800000E+002 3.936234E-007 2.027775E-004 8.644279E-007 2.180931E-003 1.131226E-002 7.476778E-002 4.353550E-004 3.037527E-008 2.534515E-006 6.778434E-006 1.470889E-007 9.488175E-007 1.531702E-006 9.189112E-007

Question 2

Can you provide a small example input file for which the code works exactly as intended to demonstrate it's capabilities? This tends to make writing reviews easier, leading to higher quality reviews.

Question 3

is it correct to have the two with statements which define the same var1, together at the top? Shouldn't it (maybe?) be var2 in the second with statement and in the omitted second code block? (obviously it would be nicer to have that code block only once!)

Question 4

@Max yes you are right, I have corrected the original question, thanks.

Question 5

Please do not edit the question, especially the code, after an answer has been posted. Changing the question may cause answer invalidation. Everyone needs to be able to see what the reviewer was referring to. What to do after the question has been answered.

Question 6

@Max The edit was rolled back around the same time when I posted my comment.

Question 7

coding style

Your code is almost pep-8 compliant. There are a few spaces missing after comma's, but all in all this is not too bad. I myself use black to take care of this formatting for me.

some of the variables names can be clearer. What does nc1 mean for example

magic numbers

The number 3, 2 and 6 are the number of rows and columns on the grid. Better would be to make them real variables, and replace 6 with rows * columns. If you ever decide you want 4 columns, you don't have to chase down all those magic numbers

looping

You are looping over the indexes of var and df. Better here would be to use zip to iterate over both tables together. If you want to group them per 6, you can use the grouper itertools recipe. and enumerate to get the index of the different subplots.

rows, columns = 3, 2
for group in grouper(zip(var1[1:], df1[1:]), rows * columns):
 fig = plt.figure(figsize=(11, 7))
 for i, (label, row) in enumerate(filter(None, group)):
 ax = fig.add_subplot(rows, columns, i + 1)
 ax.plot(df1[0], row, linestyle="-", color="black")
 ax.set(title=label, xlabel="seconds", ylabel="")

The filter(None,...) is to eliminate the items that get the fillvalue in the grouper

Is a lot clearer than the juggling with nc1 and j

functions

This would be a lot easier to test an handle if you would separate the different parts of the script into functions

reading the file
making 1 page plot
appending the different pages

This will also allow each of those parts to be tested separately

reading the file

Instead of loading the file twice and using numpy, using pandas, which supports data with names and indices will simplify this part a lot

df = pd.read_csv(<filename>, sep="\s+", index_col=0)

this is a labelled DataFrame, so no more need to use var1 for the column names

making the individual plot:

group the columns per 6

def column_grouper(df, n):
 for i in range(0, df.shape[1], n):
 yield df.iloc[:, i:i+n]

this simple helper generator can group the data per 6 columns

make the plot

def generate_plots(df, rows=3, columns=2):
 for group in column_grouper(df, rows * columns):
 fig = plt.figure(figsize=(11, 7))
 for i, (label, column) in enumerate(group.items()):
 ax = fig.add_subplot(rows, columns,i + 1)
 ax.plot(column, linestyle='-', color='black')
 ax.set(title=label, xlabel='seconds', ylabel='')
 yield fig

saving the pdf

Here a simple method that accepts an iterable of figures and a filename will do the trick

def save_plots(figures, output_file):
 with PdfPages(output_file) as pdf:
 for fig in figures:
 pdf.savefig(fig)

pulling it together

def parse_file(input_file, output_file, rows=3, columns=2):
 df = pd.read_csv(input_file, sep="\s+", index_col=0)
 figures = generate_plots(df, rows, columns)
 save_plots(figures, output_file)

and then calling this behind a main guard

if __name__ == "__main__":
 input_files = ['file1.out', 'file2.out']
 output_file = 'file_output.pdf'
 for input_file in input_files:
 parse_file(input_file, output_file)

If this still is too slow, at least now the different parts of the program are split, and you can start looking what part of the program is slowing everything down

Question 8

Thanks, looks great. I am getting an error 'DataFrame' object has no attribute 'items' which I think is related to the output of column_grouper(), but I am not sure I understand exactly what that function does.

Question 9

DataFrame.items. This should work. It worked for me at least. I didn't test the pdf creation, but the plot generation worked.

Question 10

Ah, so it should be for i, (label, column) in enumerate(group.iteritems()): in the generate_plots() function.

Question 11

I think both items and iteritems should work

Question 12

Sure, go ahead.

Question 13

Maarten's answer contains lot of useful advice but I think his code won't work as expected because the parse_file() function which is called for each of the input files re-opens the output_file, but help(PdfPages.__init__) says (cf. "Parameters" in the doc)

The file is opened at once and any older file with the same name is overwritten.

(Anyway, I don't think one can just append something to a PDF file, so unless PdfPages() would read in the whole PDF and re-write it entirely after appending content, it is not possible to reopen an existing file to simply append stuff.)

So I think we must open the output file only once and then loop over the input files, thus replacing the main loop and the two functions parse_file and save_plot by one single function, e.g., as follows:

def plotToPDF(output_file, *input_files, rows=3, columns=2):
 import pandas as pd
 from matplotlib.backends.backend_pdf import PdfPages
 with PdfPages(output_file) as pdf:
 for file in input_files: 
 df = pd.read_csv(input_file, sep="\s+", index_col=0) 
 for fig in generate_plots(df, rows, columns):
 pdf.savefig(fig)

which you would simply call as plotToPDF('file_output.pdf', 'file1.out', 'file2.out').

This uses Maarten's function generate_plots() (requiring import matplotlib.pyplot as plt and his other function column_grouper()) which can remain exactly as given in his answer.

Maarten Fabré Maarten Fabré 9,3901 gold badge15 silver badges27 bronze badges · Accepted Answer · 2019-04-19 12:49:00Z

coding style

Your code is almost pep-8 compliant. There are a few spaces missing after comma's, but all in all this is not too bad. I myself use black to take care of this formatting for me.

some of the variables names can be clearer. What does nc1 mean for example

magic numbers

The number 3, 2 and 6 are the number of rows and columns on the grid. Better would be to make them real variables, and replace 6 with rows * columns. If you ever decide you want 4 columns, you don't have to chase down all those magic numbers

looping

You are looping over the indexes of var and df. Better here would be to use zip to iterate over both tables together. If you want to group them per 6, you can use the grouper itertools recipe. and enumerate to get the index of the different subplots.

rows, columns = 3, 2
for group in grouper(zip(var1[1:], df1[1:]), rows * columns):
 fig = plt.figure(figsize=(11, 7))
 for i, (label, row) in enumerate(filter(None, group)):
 ax = fig.add_subplot(rows, columns, i + 1)
 ax.plot(df1[0], row, linestyle="-", color="black")
 ax.set(title=label, xlabel="seconds", ylabel="")

The filter(None,...) is to eliminate the items that get the fillvalue in the grouper

Is a lot clearer than the juggling with nc1 and j

functions

This would be a lot easier to test an handle if you would separate the different parts of the script into functions

reading the file
making 1 page plot
appending the different pages

This will also allow each of those parts to be tested separately

reading the file

Instead of loading the file twice and using numpy, using pandas, which supports data with names and indices will simplify this part a lot

df = pd.read_csv(<filename>, sep="\s+", index_col=0)

this is a labelled DataFrame, so no more need to use var1 for the column names

making the individual plot:

group the columns per 6

def column_grouper(df, n):
 for i in range(0, df.shape[1], n):
 yield df.iloc[:, i:i+n]

this simple helper generator can group the data per 6 columns

make the plot

def generate_plots(df, rows=3, columns=2):
 for group in column_grouper(df, rows * columns):
 fig = plt.figure(figsize=(11, 7))
 for i, (label, column) in enumerate(group.items()):
 ax = fig.add_subplot(rows, columns,i + 1)
 ax.plot(column, linestyle='-', color='black')
 ax.set(title=label, xlabel='seconds', ylabel='')
 yield fig

saving the pdf

Here a simple method that accepts an iterable of figures and a filename will do the trick

def save_plots(figures, output_file):
 with PdfPages(output_file) as pdf:
 for fig in figures:
 pdf.savefig(fig)

pulling it together

def parse_file(input_file, output_file, rows=3, columns=2):
 df = pd.read_csv(input_file, sep="\s+", index_col=0)
 figures = generate_plots(df, rows, columns)
 save_plots(figures, output_file)

and then calling this behind a main guard

if __name__ == "__main__":
 input_files = ['file1.out', 'file2.out']
 output_file = 'file_output.pdf'
 for input_file in input_files:
 parse_file(input_file, output_file)

If this still is too slow, at least now the different parts of the program are split, and you can start looking what part of the program is slowing everything down

Thanks, looks great. I am getting an error 'DataFrame' object has no attribute 'items' which I think is related to the output of column_grouper(), but I am not sure I understand exactly what that function does.
DataFrame.items. This should work. It worked for me at least. I didn't test the pdf creation, but the plot generation worked.
Ah, so it should be for i, (label, column) in enumerate(group.iteritems()): in the generate_plots() function.

Stack Exchange Network

A Python script to plot data and save to PDF

2 Answers 2

coding style

magic numbers

looping

functions

reading the file

making the individual plot:

group the columns per 6

make the plot

saving the pdf

pulling it together

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

A Python script to plot data and save to PDF

2 Answers 2

coding style

magic numbers

looping

functions

reading the file

making the individual plot:

group the columns per 6

make the plot

saving the pdf

pulling it together

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions