Pythonic way to manipulate arrays: a case study

Question 1

I just spent almost one day to figure out the following code, which analyzes the output data from material science programs called VASP and Bader. As a Python beginner, I earnestly ask some aces to help me make this code more pythonic, or efficient. Because I know that although the code can run smoothly, it definitely needs more modification from professionals. The code is listed below, with the data file ACF.dat after it.

#!/usr/bin/python
import numpy as np
def charge(x): # function to calculate charge difference 
 if x<4:
 return 4-x # atom1
 else:
 return 6-x # atom2
# read from file ACF.dat
f = open('ACF.dat', 'r')
lines = []
for row in f.readlines():
 tmp = row.split()
 lines.append(tmp)
lines = lines[2:-4] # remove unecessary rows
lines = np.array(lines, dtype=float)
lines = lines[:,0:5] # remove unecessary columns
# calculate residual charge
chg = np.vectorize(charge)
lines[:,4] = chg(lines[:,4])
# calculate interatomic distances
dist = []
for i in range(0,len(lines)):
 d = np.sqrt(np.sum((lines[i,1:4]-lines[0,1:4])**2))
 dist.append(d)
lines = np.column_stack((lines, dist)) # attach dist to lines
lines = lines[lines[:,5].argsort()]
for row in lines:
 print '{0:<3.0f} {1:12.6f} {2:12.6f}'.format( row[0], row[5], row[4])
f.close()

ACF.dat

The structure of the file is seen as the heading. What I need are the first five columns: the atomic index, the x, y, z coordinates of each atom, the atomic charge.

# X Y Z CHARGE MIN DIST ATOMIC VOL
--------------------------------------------------------------------------------
1 7.5119 7.5119 7.5119 2.9875 1.1144 20.7692
2 18.0286 18.0286 18.0286 2.8514 1.3688 23.0095
3 6.0058 18.0205 18.0205 2.8500 1.3599 22.9265
4 12.0323 18.0342 18.0342 2.8480 1.3638 22.9816
5 18.0205 6.0058 18.0205 2.8500 1.3599 22.9265
6 5.9968 5.9968 17.9789 2.7979 1.2317 22.3582
--------------------------------------------------------------------------------
VACUUM CHARGE: 0.0000
VACUUM VOLUME: 0.0000
NUMBER OF ELECTRONS: 1085.9999

What the code does is:

Read in the content of ACF.dat, and removes the first two rows and last four rows. Then saves only the first five columns into array lines.
Calculates the residual charge using function charge(x).
Calculates the interatomic distances using the x, y, z coordinates of the atoms, and attach the array of distances to lines as its last column.
Sorts lines according to the interatomic distances (column #6 of lines).
Prints out the atomic index, interatomic distances, and atomic charges.

I understand that this is a lengthy question, but I still would like to ask someone helping me, and other newbies, learn Python faster by optimizing our own code. Any suggestion is welcomed.

Question 2

Coding style

For starters, read PEP 8, which is the Python style guide. This gives suggestions on whitespace around operators and commas, variable names, and the like. Your code isn’t bad if it doesn’t follow these conventions, but it will slow down more experienced Python programmers reading your code, and might make it harder for you later when you read other people’s Python.

You should also read PEP 257, which explain the Python conventions for docstrings. Rather than including a description for the function on the same line as the def, we put it on a special line inside the function definition.

With those two in mind, I might rewrite your charge function as:

def residual_charge(x):
 """Returns the residual charge of an atom."""
 if x < 4:
 return 4 - x # atom 1
 else:
 return 6 - x # atom 2

The rest of the style changes are only minor, and you can find them yourself.

Opening/closing the file

As far as I can tell, the six lines after you’ve defined charge are the only lines where you’re directly interacting with the file ACF.dat, but you don’t issue a close() command until the end of the script. That can cause problems if the input file is large, or if you want to load multiple files.

I’d make three changes:

Use Python’s with open() ... construction – this keeps all the file handling code in one place, and automatically closes it for you as well.
Wrap this in a function: it keeps the code for processing the file input in one place, and means you can use it again if you want to feed it a different file.
Go through the file a line at a time (you might not need this, but it saves loading the whole file into memory if ACF.dat is large).

So I’d write something like:

def parse_dat_file(filename):
 """Returns a numpy array based on the contents of the input file."""
 # read the file contents
 with open(filename, 'r') as f:
 for line in f:
 parsed_lines.append(line.split())
 # remove unnecessary rows and columns
 parsed_lines = parsed_lines[2:-4]
 parsed_lines = np.array(parsed_lines, dtype=float)
 parsed_lines = parsed_lines[:,0:5]
 return parsed_lines

Doing the necessary calculations

Again, I’d break this into a single function. It separates that block of work from the rest of the file, and makes it easier to organise. This means that if you get the data from another source, you can use the same code untouched.

Some notes:

In Python, a range() starts at 0 by default, so you don’t need to add it. If no start is supplied, it just counts from 0 upwards. You can find out more by typing help(range) at a Python shell.
If the data set is large, you should use xrange() instead. This is faster when you’re using a large range of objects. See What is the difference between range and xrange? for more details.

This is what I get at the end:

def process_lines(lines):
 """Returns the lines with the residual charge and
 interatomic distances computed and added.
 """
 # Calculate residual charge
 v_charge = np.vectorize(residual_charge)
 lines[:, 4] = v_charge(lines[:, 4])
 # Calculate atomic distances
 distances = []
 for i in xrange(len(lines)):
 line_sum = np.sum((lines[i, 1:4] - lines[0, 1:4]) ** 2)
 min_dist = np.sqrt(line_sum)
 distances.append(min_dist)
 lines = np.column_stack((lines, dist))
 lines = lines[lines[:, 5].argsort()]
 return

Printing the file

Without wishing to sound like a broken record, put this in a function as well. This one is easy:

def print_data(lines):
 """Prints the parsed and formatted data."""
 for row in lines:
 print '{0:<3.0f} {1:12.6f} {2:12.6f}'.format( row[0], row[5], row[4])

Script vs. module

At the end of the script, we can add the lines

acf_lines = parse_dat_file('ACF.dat')
print_data(process_lines(acf_lines))

and it will do the printing we originally set out to do.

But now the code has been broken into separate functions, we can use it in other scripts. Say this was atomic.py; then we could write from atomic import residual_charge in another file, and we’d have access to the residual_charge function.

However, when we import this file in another script, we don’t want it to start reading and printing ACF.dat; we just want the definitions. If we put the particular code for ACF.dat in a special if statement, then it only gets run if the file is called directly:

if __name__ == '__main__':
 acf_lines = parse_dat_file('ACF.dat')
 print_data(process_lines(acf_lines))

Now, if we type python atomic.py at a command line, this code gets run and the data is printed. If we use import atomic elsewhere, then we just get the function definitions. You can find out more in What does if __name__ == "__main__": do?

Summary

If I combine everything I’ve done above and put it into a single file, then this is what I’m left with:

#!/usr/bin/python
import numpy as np
def residual_charge(x):
 """Returns the residual charge of an atom."""
 if x < 4:
 return 4 - x # atom 1
 else:
 return 6 - x # atom 2
def parse_dat_file(filename):
 """Returns a numpy array based on the contents of the input file."""
 # read the file contents
 parsed_lines = []
 with open(filename, 'r') as f:
 for line in f:
 parsed_lines.append(line.split())
 # Remove unnecessary rows and columns
 parsed_lines = parsed_lines[2:-4]
 parsed_lines = np.array(parsed_lines, dtype=float)
 parsed_lines = parsed_lines[:,0:5]
 return parsed_lines
def process_lines(lines):
 """Returns the lines with the residual charge and
 interatomic distances computed and added.
 """ 
 # Calculate residual charge
 v_charge = np.vectorize(residual_charge)
 lines[:, 4] = v_charge(lines[:, 4])
 # Calculate atomic distances
 distances = []
 for i in xrange(len(lines)):
 line_sum = np.sum((lines[i, 1:4] - lines[0, 1:4]) ** 2)
 min_dist = np.sqrt(line_sum)
 distances.append(min_dist)
 lines = np.column_stack((lines, distances))
 lines = lines[lines[:, 5].argsort()]
 return lines
def print_data(lines):
 """Prints the parsed and formatted data."""
 for row in lines:
 print '{0:<3.0f} {1:12.6f} {2:12.6f}'.format( row[0], row[5], row[4])
if __name__ == '__main__':
 acf_lines = parse_dat_file('ACF.dat')
 print_data(process_lines(acf_lines))

Question 3

This is fantastic, thank you, Alex! Then I realize that it is always more readable to use functions.

alexwlchan alexwlchan 8,6751 gold badge23 silver badges55 bronze badges · Accepted Answer · 2014-04-05 19:33:25Z

Coding style

For starters, read PEP 8, which is the Python style guide. This gives suggestions on whitespace around operators and commas, variable names, and the like. Your code isn’t bad if it doesn’t follow these conventions, but it will slow down more experienced Python programmers reading your code, and might make it harder for you later when you read other people’s Python.

You should also read PEP 257, which explain the Python conventions for docstrings. Rather than including a description for the function on the same line as the def, we put it on a special line inside the function definition.

With those two in mind, I might rewrite your charge function as:

def residual_charge(x):
 """Returns the residual charge of an atom."""
 if x < 4:
 return 4 - x # atom 1
 else:
 return 6 - x # atom 2

The rest of the style changes are only minor, and you can find them yourself.

Opening/closing the file

As far as I can tell, the six lines after you’ve defined charge are the only lines where you’re directly interacting with the file ACF.dat, but you don’t issue a close() command until the end of the script. That can cause problems if the input file is large, or if you want to load multiple files.

I’d make three changes:

Use Python’s with open() ... construction – this keeps all the file handling code in one place, and automatically closes it for you as well.
Wrap this in a function: it keeps the code for processing the file input in one place, and means you can use it again if you want to feed it a different file.
Go through the file a line at a time (you might not need this, but it saves loading the whole file into memory if ACF.dat is large).

So I’d write something like:

def parse_dat_file(filename):
 """Returns a numpy array based on the contents of the input file."""
 # read the file contents
 with open(filename, 'r') as f:
 for line in f:
 parsed_lines.append(line.split())
 # remove unnecessary rows and columns
 parsed_lines = parsed_lines[2:-4]
 parsed_lines = np.array(parsed_lines, dtype=float)
 parsed_lines = parsed_lines[:,0:5]
 return parsed_lines

Doing the necessary calculations

Again, I’d break this into a single function. It separates that block of work from the rest of the file, and makes it easier to organise. This means that if you get the data from another source, you can use the same code untouched.

Some notes:

In Python, a range() starts at 0 by default, so you don’t need to add it. If no start is supplied, it just counts from 0 upwards. You can find out more by typing help(range) at a Python shell.
If the data set is large, you should use xrange() instead. This is faster when you’re using a large range of objects. See What is the difference between range and xrange? for more details.

This is what I get at the end:

def process_lines(lines):
 """Returns the lines with the residual charge and
 interatomic distances computed and added.
 """
 # Calculate residual charge
 v_charge = np.vectorize(residual_charge)
 lines[:, 4] = v_charge(lines[:, 4])
 # Calculate atomic distances
 distances = []
 for i in xrange(len(lines)):
 line_sum = np.sum((lines[i, 1:4] - lines[0, 1:4]) ** 2)
 min_dist = np.sqrt(line_sum)
 distances.append(min_dist)
 lines = np.column_stack((lines, dist))
 lines = lines[lines[:, 5].argsort()]
 return

Printing the file

Without wishing to sound like a broken record, put this in a function as well. This one is easy:

def print_data(lines):
 """Prints the parsed and formatted data."""
 for row in lines:
 print '{0:<3.0f} {1:12.6f} {2:12.6f}'.format( row[0], row[5], row[4])

Script vs. module

At the end of the script, we can add the lines

acf_lines = parse_dat_file('ACF.dat')
print_data(process_lines(acf_lines))

and it will do the printing we originally set out to do.

But now the code has been broken into separate functions, we can use it in other scripts. Say this was atomic.py; then we could write from atomic import residual_charge in another file, and we’d have access to the residual_charge function.

However, when we import this file in another script, we don’t want it to start reading and printing ACF.dat; we just want the definitions. If we put the particular code for ACF.dat in a special if statement, then it only gets run if the file is called directly:

if __name__ == '__main__':
 acf_lines = parse_dat_file('ACF.dat')
 print_data(process_lines(acf_lines))

Now, if we type python atomic.py at a command line, this code gets run and the data is printed. If we use import atomic elsewhere, then we just get the function definitions. You can find out more in What does if __name__ == "__main__": do?

Summary

If I combine everything I’ve done above and put it into a single file, then this is what I’m left with:

#!/usr/bin/python
import numpy as np
def residual_charge(x):
 """Returns the residual charge of an atom."""
 if x < 4:
 return 4 - x # atom 1
 else:
 return 6 - x # atom 2
def parse_dat_file(filename):
 """Returns a numpy array based on the contents of the input file."""
 # read the file contents
 parsed_lines = []
 with open(filename, 'r') as f:
 for line in f:
 parsed_lines.append(line.split())
 # Remove unnecessary rows and columns
 parsed_lines = parsed_lines[2:-4]
 parsed_lines = np.array(parsed_lines, dtype=float)
 parsed_lines = parsed_lines[:,0:5]
 return parsed_lines
def process_lines(lines):
 """Returns the lines with the residual charge and
 interatomic distances computed and added.
 """ 
 # Calculate residual charge
 v_charge = np.vectorize(residual_charge)
 lines[:, 4] = v_charge(lines[:, 4])
 # Calculate atomic distances
 distances = []
 for i in xrange(len(lines)):
 line_sum = np.sum((lines[i, 1:4] - lines[0, 1:4]) ** 2)
 min_dist = np.sqrt(line_sum)
 distances.append(min_dist)
 lines = np.column_stack((lines, distances))
 lines = lines[lines[:, 5].argsort()]
 return lines
def print_data(lines):
 """Prints the parsed and formatted data."""
 for row in lines:
 print '{0:<3.0f} {1:12.6f} {2:12.6f}'.format( row[0], row[5], row[4])
if __name__ == '__main__':
 acf_lines = parse_dat_file('ACF.dat')
 print_data(process_lines(acf_lines))

This is fantastic, thank you, Alex! Then I realize that it is always more readable to use functions.

Stack Exchange Network

Pythonic way to manipulate arrays: a case study

1 Answer 1

Coding style

Opening/closing the file

Doing the necessary calculations

Printing the file

Script vs. module

Summary

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Pythonic way to manipulate arrays: a case study

1 Answer 1

Coding style

Opening/closing the file

Doing the necessary calculations

Printing the file

Script vs. module

Summary

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions