I just spent almost one day to figure out the following code, which analyzes the output data from material science programs called VASP and Bader. As a Python beginner, I earnestly ask some aces to help me make this code more pythonic, or efficient. Because I know that although the code can run smoothly, it definitely needs more modification from professionals. The code is listed below, with the data file ACF.dat
after it.
#!/usr/bin/python
import numpy as np
def charge(x): # function to calculate charge difference
if x<4:
return 4-x # atom1
else:
return 6-x # atom2
# read from file ACF.dat
f = open('ACF.dat', 'r')
lines = []
for row in f.readlines():
tmp = row.split()
lines.append(tmp)
lines = lines[2:-4] # remove unecessary rows
lines = np.array(lines, dtype=float)
lines = lines[:,0:5] # remove unecessary columns
# calculate residual charge
chg = np.vectorize(charge)
lines[:,4] = chg(lines[:,4])
# calculate interatomic distances
dist = []
for i in range(0,len(lines)):
d = np.sqrt(np.sum((lines[i,1:4]-lines[0,1:4])**2))
dist.append(d)
lines = np.column_stack((lines, dist)) # attach dist to lines
lines = lines[lines[:,5].argsort()]
for row in lines:
print '{0:<3.0f} {1:12.6f} {2:12.6f}'.format( row[0], row[5], row[4])
f.close()
ACF.dat
The structure of the file is seen as the heading. What I need are the first five columns: the atomic index, the x, y, z coordinates of each atom, the atomic charge.
# X Y Z CHARGE MIN DIST ATOMIC VOL
--------------------------------------------------------------------------------
1 7.5119 7.5119 7.5119 2.9875 1.1144 20.7692
2 18.0286 18.0286 18.0286 2.8514 1.3688 23.0095
3 6.0058 18.0205 18.0205 2.8500 1.3599 22.9265
4 12.0323 18.0342 18.0342 2.8480 1.3638 22.9816
5 18.0205 6.0058 18.0205 2.8500 1.3599 22.9265
6 5.9968 5.9968 17.9789 2.7979 1.2317 22.3582
--------------------------------------------------------------------------------
VACUUM CHARGE: 0.0000
VACUUM VOLUME: 0.0000
NUMBER OF ELECTRONS: 1085.9999
What the code does is:
- Read in the content of
ACF.dat
, and removes the first two rows and last four rows. Then saves only the first five columns into arraylines
. - Calculates the residual charge using function
charge(x)
. - Calculates the interatomic distances using the x, y, z coordinates of the atoms, and attach the array of distances to
lines
as its last column. - Sorts lines according to the interatomic distances (column #6 of
lines
). - Prints out the atomic index, interatomic distances, and atomic charges.
I understand that this is a lengthy question, but I still would like to ask someone helping me, and other newbies, learn Python faster by optimizing our own code. Any suggestion is welcomed.
1 Answer 1
Coding style
For starters, read PEP 8, which is the Python style guide. This gives suggestions on whitespace around operators and commas, variable names, and the like. Your code isn’t bad if it doesn’t follow these conventions, but it will slow down more experienced Python programmers reading your code, and might make it harder for you later when you read other people’s Python.
You should also read PEP 257, which explain the Python conventions for docstrings. Rather than including a description for the function on the same line as the def
, we put it on a special line inside the function definition.
With those two in mind, I might rewrite your charge
function as:
def residual_charge(x):
"""Returns the residual charge of an atom."""
if x < 4:
return 4 - x # atom 1
else:
return 6 - x # atom 2
The rest of the style changes are only minor, and you can find them yourself.
Opening/closing the file
As far as I can tell, the six lines after you’ve defined charge
are the only lines where you’re directly interacting with the file ACF.dat
, but you don’t issue a close()
command until the end of the script. That can cause problems if the input file is large, or if you want to load multiple files.
I’d make three changes:
Use Python’s
with open() ...
construction – this keeps all the file handling code in one place, and automatically closes it for you as well.Wrap this in a function: it keeps the code for processing the file input in one place, and means you can use it again if you want to feed it a different file.
Go through the file a line at a time (you might not need this, but it saves loading the whole file into memory if
ACF.dat
is large).
So I’d write something like:
def parse_dat_file(filename):
"""Returns a numpy array based on the contents of the input file."""
# read the file contents
with open(filename, 'r') as f:
for line in f:
parsed_lines.append(line.split())
# remove unnecessary rows and columns
parsed_lines = parsed_lines[2:-4]
parsed_lines = np.array(parsed_lines, dtype=float)
parsed_lines = parsed_lines[:,0:5]
return parsed_lines
Doing the necessary calculations
Again, I’d break this into a single function. It separates that block of work from the rest of the file, and makes it easier to organise. This means that if you get the data from another source, you can use the same code untouched.
Some notes:
In Python, a
range()
starts at 0 by default, so you don’t need to add it. If no start is supplied, it just counts from 0 upwards. You can find out more by typinghelp(range)
at a Python shell.If the data set is large, you should use
xrange()
instead. This is faster when you’re using a large range of objects. See What is the difference between range and xrange? for more details.
This is what I get at the end:
def process_lines(lines):
"""Returns the lines with the residual charge and
interatomic distances computed and added.
"""
# Calculate residual charge
v_charge = np.vectorize(residual_charge)
lines[:, 4] = v_charge(lines[:, 4])
# Calculate atomic distances
distances = []
for i in xrange(len(lines)):
line_sum = np.sum((lines[i, 1:4] - lines[0, 1:4]) ** 2)
min_dist = np.sqrt(line_sum)
distances.append(min_dist)
lines = np.column_stack((lines, dist))
lines = lines[lines[:, 5].argsort()]
return
Printing the file
Without wishing to sound like a broken record, put this in a function as well. This one is easy:
def print_data(lines):
"""Prints the parsed and formatted data."""
for row in lines:
print '{0:<3.0f} {1:12.6f} {2:12.6f}'.format( row[0], row[5], row[4])
Script vs. module
At the end of the script, we can add the lines
acf_lines = parse_dat_file('ACF.dat')
print_data(process_lines(acf_lines))
and it will do the printing we originally set out to do.
But now the code has been broken into separate functions, we can use it in other scripts. Say this was atomic.py
; then we could write from atomic import residual_charge
in another file, and we’d have access to the residual_charge
function.
However, when we import
this file in another script, we don’t want it to start reading and printing ACF.dat
; we just want the definitions. If we put the particular code for ACF.dat
in a special if
statement, then it only gets run if the file is called directly:
if __name__ == '__main__':
acf_lines = parse_dat_file('ACF.dat')
print_data(process_lines(acf_lines))
Now, if we type python atomic.py
at a command line, this code gets run and the data is printed. If we use import atomic
elsewhere, then we just get the function definitions. You can find out more in What does if __name__ == "__main__":
do?
Summary
If I combine everything I’ve done above and put it into a single file, then this is what I’m left with:
#!/usr/bin/python
import numpy as np
def residual_charge(x):
"""Returns the residual charge of an atom."""
if x < 4:
return 4 - x # atom 1
else:
return 6 - x # atom 2
def parse_dat_file(filename):
"""Returns a numpy array based on the contents of the input file."""
# read the file contents
parsed_lines = []
with open(filename, 'r') as f:
for line in f:
parsed_lines.append(line.split())
# Remove unnecessary rows and columns
parsed_lines = parsed_lines[2:-4]
parsed_lines = np.array(parsed_lines, dtype=float)
parsed_lines = parsed_lines[:,0:5]
return parsed_lines
def process_lines(lines):
"""Returns the lines with the residual charge and
interatomic distances computed and added.
"""
# Calculate residual charge
v_charge = np.vectorize(residual_charge)
lines[:, 4] = v_charge(lines[:, 4])
# Calculate atomic distances
distances = []
for i in xrange(len(lines)):
line_sum = np.sum((lines[i, 1:4] - lines[0, 1:4]) ** 2)
min_dist = np.sqrt(line_sum)
distances.append(min_dist)
lines = np.column_stack((lines, distances))
lines = lines[lines[:, 5].argsort()]
return lines
def print_data(lines):
"""Prints the parsed and formatted data."""
for row in lines:
print '{0:<3.0f} {1:12.6f} {2:12.6f}'.format( row[0], row[5], row[4])
if __name__ == '__main__':
acf_lines = parse_dat_file('ACF.dat')
print_data(process_lines(acf_lines))
-
\$\begingroup\$ This is fantastic, thank you, Alex! Then I realize that it is always more readable to use functions. \$\endgroup\$Murphy– Murphy2014年04月06日 17:42:35 +00:00Commented Apr 6, 2014 at 17:42