Pythonic way of managing data for plotting 2D variance plots

Question 1

I have collected large dataset of trajectories in 3D space. Below is the information about dataset-

Each trajectory (shape: n x 4) is saved into CSV file with the following header: time, p_x, p_y, p_z
The filenames are defined in following way: subject_i_trail_i.csv, where i starts from 1 to 10.

Below are first 10 lines from subject_1_trail_1.csv file-

time p_x p_y p_z
0 0.4333 0.1107 0.1259
0.0103 0.4336 0.1106 0.126
0.02 0.4334 0.1108 0.1259
0.03 0.4333 0.1106 0.1259
0.04 0.4334 0.1107 0.1259
0.0501 0.4328 0.1103 0.126
0.06 0.4331 0.1107 0.1255
0.0703 0.4331 0.1103 0.126
0.08 0.4324 0.1102 0.126

For each subject, I want to plot the trajectory showing median and variance as shown below-

fill_between(x, perc_25, perc_75)

I am using NumPy but the code is relatively long. Below is the code snippet-

from pylab import *
import sys
# make a 3d numpy array (chop off the extra rows)
def load(location, subject, suffix):
 all_data = []
 min_rows = sys.maxint
 for i in range(1, 11):
 file_name = location + subject + '_trail_' + str(i) + '.csv'
 data = np.loadtxt(file_name, delimiter=',', skiprows=1)
 if data.shape[0] < min_rows:
 min_rows = data.shape[0]
 all_data.append(data)
 all_crop_data = []
 for data in all_data:
 all_crop_data.append(data[:min_rows,:])
 return np.array(all_crop_data)
def proc(data):
 t = data[:, :, 0] # first column is time
 x = data[:, :, 1] # second column is x
 y = data[:, :, 2] # third column is y
 z = data[:, :, 3] # fourth column is z
 t_median = np.zeros(data.shape[1])
 t_perc_25 = np.zeros(data.shape[1])
 t_perc_75 = np.zeros(data.shape[1])
 x_median = np.zeros(data.shape[1])
 x_perc_25 = np.zeros(data.shape[1])
 x_perc_75 = np.zeros(data.shape[1])
 y_median = np.zeros(data.shape[1])
 y_perc_25 = np.zeros(data.shape[1])
 y_perc_75 = np.zeros(data.shape[1])
 z_median = np.zeros(data.shape[1])
 z_perc_25 = np.zeros(data.shape[1])
 z_perc_75 = np.zeros(data.shape[1])
 for i in range(data.shape[1]): # for each timestamp
 t_median[i] = np.median(t[:, i])
 t_perc_25[i] = np.percentile(t[:, i], 25)
 t_perc_75[i] = np.percentile(t[:, i], 75)
 x_median[i] = np.median(x[:, i])
 x_perc_25[i] = np.percentile(x[:, i], 25)
 x_perc_75[i] = np.percentile(x[:, i], 75)
 y_median[i] = np.median(y[:, i])
 y_perc_25[i] = np.percentile(y[:, i], 25)
 y_perc_75[i] = np.percentile(y[:, i], 75)
 z_median[i] = np.median(z[:, i])
 z_perc_25[i] = np.percentile(z[:, i], 25)
 z_perc_75[i] = np.percentile(z[:, i], 75)
 all_median = np.vstack((t_median, x_median, y_median, z_median)).T
 all_perc_25 = np.vstack((t_perc_25, x_perc_25, y_perc_25, z_perc_25)).T
 all_perc_75 = np.vstack((t_perc_75, x_perc_75, y_perc_75, z_perc_75)).T
 return all_median, all_perc_25, all_perc_75
s1 = load('/Desktop/', 'subject_1') # subject 1
s2 = load('/Desktop/', 'subject_2') # subject 2
x = np.arange(0, s1.shape[1])
s1_med, s1_perc_25, s1_perc_75 = proc(s1)
s2_med, s2_perc_25, s2_perc_75 = proc(s2)
# lets plot only x (second column)
index = 1
s1_med = s1_med[:, index]
s1_perc_25 = s1_perc_25[:, index]
s1_perc_75 = s1_perc_75[:, index]
s2_med = s2_med[:, index]
s2_perc_25 = s2_perc_25[:, index]
s2_perc_75 = s2_perc_75[:, index]
fill_between(x, s1_perc_25, s1_perc_75, alpha=0.25, linewidth=0, color='#B22400')
fill_between(x, s2_perc_25, s2_perc_75, alpha=0.25, linewidth=0, color='#006BB2')
plot(x, s1_med, linewidth=2, color='#B22400')
plot(x, s2_med, linewidth=2, color='#006BB2')

I am looking for a better, i.e., pythonic way to achieve the same. I am not sure if Pandas can be useful here. Looking for your suggestions, please.

Question 2

You should add your imports to make this self-contained code. I would also help if you could post how to generate some example data.

Question 3

Your function proc can be greatly reduced by using the fact that numpy functions are vectorized and most of them take the axis to act upon as an argument. This is certainly true for numpy.median and numpy.percentile.

import numpy as np
def proc(data):
 return (np.median(data, axis=0),
 np.percentile(data, 25, axis=0),
 np.percentile(data, 75, axis=0))

You can also use a for loop for plotting, to avoid duplicating everything:

if __name__ == "__main__":
 colors = '#B22400', '#006BB2'
 file_names = 'subject_1', 'subject_2'
 for file_name, color in zip(file_names, colors):
 subject = load('/Desktop/', file_name)
 med, perc_25, perc_75 = proc(subject)
 # lets plot only x (second column)
 med = med[:, 1]
 perc_25 = perc_25[:, 1]
 perc_75 = perc_75[:, 1]
 fill_between(x, perc_25, perc_75, alpha=0.25, linewidth=0, color=color)
 plot(x, med, linewidth=2, color=color)

Note that x is currently undefined (as in your code).

I also added a if __name__ == "__main__": guard to allow importing from this script without executing the plotting.

In addition I changed your indentation to conform to Python's official style-guide, PEP8:

This

med = med[:, 1]
perc_25 = perc_25[:, 1]
perc_75 = perc_75[:, 1]

should just be

med = med[:, 1]
perc_25 = perc_25[:, 1]
perc_75 = perc_75[:, 1]

as noted under Pet Peeves (of Guido van Rossum, the creator and BDFL of Python).

Question 4

I like PEP8, but sometimes I align a series of assignments like Ravi did. I feel excused by the seventh line of The Zen (PEP20): Readability counts.

Question 5

@JanKuiken I agree that sometimes this makes it more readable. I mainly commented on it 1. Out of habit but 2. Because when collaborating on code, one should have agreed upon code styles (ideally enforced/applied by automatic tools). And for Python the agreed upon style is per default PEP8. If you wish reviewers not to comment on one particular violation of some common style guide, you can also mention it as a note in your question. Reviewers are still free to comment on it (if they think it is a particularly bad idea), but they usually won't in that case.

Question 6

Thanks, Graipher. The axis argument is really helpful!

Graipher Graipher 41.6k7 gold badges70 silver badges134 bronze badges · Accepted Answer · 2018-04-23 16:15:56Z

Your function proc can be greatly reduced by using the fact that numpy functions are vectorized and most of them take the axis to act upon as an argument. This is certainly true for numpy.median and numpy.percentile.

import numpy as np
def proc(data):
 return (np.median(data, axis=0),
 np.percentile(data, 25, axis=0),
 np.percentile(data, 75, axis=0))

You can also use a for loop for plotting, to avoid duplicating everything:

if __name__ == "__main__":
 colors = '#B22400', '#006BB2'
 file_names = 'subject_1', 'subject_2'
 for file_name, color in zip(file_names, colors):
 subject = load('/Desktop/', file_name)
 med, perc_25, perc_75 = proc(subject)
 # lets plot only x (second column)
 med = med[:, 1]
 perc_25 = perc_25[:, 1]
 perc_75 = perc_75[:, 1]
 fill_between(x, perc_25, perc_75, alpha=0.25, linewidth=0, color=color)
 plot(x, med, linewidth=2, color=color)

Note that x is currently undefined (as in your code).

I also added a if __name__ == "__main__": guard to allow importing from this script without executing the plotting.

In addition I changed your indentation to conform to Python's official style-guide, PEP8:

This

med = med[:, 1]
perc_25 = perc_25[:, 1]
perc_75 = perc_75[:, 1]

should just be

med = med[:, 1]
perc_25 = perc_25[:, 1]
perc_75 = perc_75[:, 1]

as noted under Pet Peeves (of Guido van Rossum, the creator and BDFL of Python).

I like PEP8, but sometimes I align a series of assignments like Ravi did. I feel excused by the seventh line of The Zen (PEP20): Readability counts.
@JanKuiken I agree that sometimes this makes it more readable. I mainly commented on it 1. Out of habit but 2. Because when collaborating on code, one should have agreed upon code styles (ideally enforced/applied by automatic tools). And for Python the agreed upon style is per default PEP8. If you wish reviewers not to comment on one particular violation of some common style guide, you can also mention it as a note in your question. Reviewers are still free to comment on it (if they think it is a particularly bad idea), but they usually won't in that case.

Stack Exchange Network

Pythonic way of managing data for plotting 2D variance plots

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Pythonic way of managing data for plotting 2D variance plots

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions