I have been working on a file parser that takes a very specific file format and parses it into a list that is arranged into node data and the neighbors that it relates to. I am new to Python (this is my very first program in this language), so I am not familiar with more advanced methods of solving the problem using Python.
The program runs very quickly: I get an average of about Elapsed time: 0.0006923562137193841
with a test file, but think it could be even better, especially if I task it with a significantly larger file.
Seeking from question:
- Optimization in the form of cleaner methods
- Decrease the overall runtime
- Verification of estimated runtime: \$O(N * E)\$. I got this because there are
N
nodes, which each containE
edges. However, this may or may not be incorrect. - General style comments for Python coding
Input file example:
The following would be 1 line of data in the file. This file could contain thousands of lines, each line identifying a node and the neighbors that it has.
100 Alpha 123 321 ((101,Beta,123,321)(101,Gamma,123,321)(102,Alpha,123,321)(103,Alpha,123,321)(104,Beta,123,321)(105,Alpha,123,321)(099,Gamma,123,321)(098,Beta,123,321)(097,Beta,123,321)(222,Gamma,123,321)(223,Beta,123,321)(234,Gamma,123,321)(451,Beta,123,321)(999,Beta,123,321)(879,Gamma,123,321)(369,Gamma,123,321)(741,Beta,123,321)(753,Beta,123,321)(357,Beta,123,321)(159,Alpha,123,321))
The parsing would end with the line containing only "At the last row"
.
import os
import timeit
__author__ = 'Evan Bechtol'
"""
Parses through a file given the appropriate filepath. Once a filepath
has been received, the Parser instance opens and begins parsing out the
individual nodes and their neighbor node relationships. Each node is an
index of the nodeList, which contains a sub-list of the nodes that are neighbors
of that node.
The structure is created as follows:
nodeList: A list that is 'n' nodes long. The sub-list containing neighbors is of
length 'e', where 'e' is the number of neighbor-edges.
numNeighbors: A list that contains the number of neighbors for each node from 0 to (n-1)
Resulting runtime of class: O(N*E)
"""
class Parser:
# Constructor accepting filepath for file to read
# as am argument. The constructor also calls readFile with
# the filepath to begin parsing the specific file.
def __init__(self, filePath):
self.nodeList = []
self.numNeighbors = []
self.readFile(filePath)
# Add nodes the the nodeList in order that they appear
def setNodeData(self, id, sector, freq, band, neighborList):
tmpData = ((id), (sector), (freq), (band), (neighborList))
return tmpData
# Add neighbors to the neighborList in the order that they appear
def setNeighborData(self, id, sector, freq, band):
tmpData = ((id), (sector), (freq), (band))
return tmpData
# Returns the entire nodeList as a string
def getNodeList(self):
return str(self.nodeList)
# Return a specific node of the nodeList with all of its' neighbors
def getNodeListIndex(self, index):
return str(self.nodeList[index])
# Return a specific neighbor for a given node
def getNodeNeighbor(self, node, neighbor):
return str(self.nodeList[node][4][neighbor])
# Retrieves the location of the line to begin retrieving node and
# neighbor data in the file. This eliminates any data above the actual
# data required to build node and neighbor relationships.
def searchForStartLine(self, data):
startLine = "-default- - - - - "
numLines = 0
for line in data:
numLines += 1
if startLine in line:
return numLines
# Removes parenthesis from the line so that neighbors can be parsed.
# Returns the line with all parenthesis removed and individual neighbors
# are separated by spaces.
def removeParens(self, line):
# First, remove all parenthesis
line = line.strip("((")
line = line.strip("))")
line = line.replace(")(", " ")
return line
# Splits the provided line into the required sections for
# placement into the appropriate lists.
# The reference node is parsed first and stored into the nodeList
#
# Once the nodeList is updated, the neighbor data is then parsed from
# the line and stored in the neighborList for the reference node.
def splitLine(self, line):
# Separate into individual reference nodes
splitLine = line.split()
line = self.extractNode(line, splitLine)
# Get each individual node from the specific line. This is referred to as the
# "reference node", which represents the node that we will be creating a specific
# list of neighbors for.
# Each reference node is unique and contains a unique neighborList.
def extractNode(self, line, splitLine):
# Get all of the node data first and store in the nodeList
nodeId = splitLine[0]
sector = splitLine[1]
freq = splitLine[2]
band = splitLine[3]
line = self.removeParens(splitLine[4])
# Separate into individual neighbors
neighbor = line.split()
# Contains the number of neighbors for each reference node
self.numNeighbors.append(len(neighbor))
# Place each neighbor tuple into the neighborList
neighborList = self.extractNeighbors(neighbor)
self.nodeList.append(self.setNodeData(nodeId, sector, freq, band, neighborList))
return line
# Get the parsed list of neighbors for all nodes, then append
# them to the neighborList in order that they are read.
def extractNeighbors(self, neighbor):
# Create a temporary storage for the neighbors of the reference node
neighborList = []
# Separate each neighbor string into individual neighbor components
for i in range(len(neighbor)):
neighbor[i] = neighbor[i].replace(",", " ")
neighbor[i] = neighbor[i].split()
nodeId = neighbor[i][0]
sector = neighbor[i][1]
freq = neighbor[i][2]
band = neighbor[i][3]
# Append the components to the neighborList
neighborList.append(self.setNeighborData(nodeId, sector, freq, band))
return neighborList
# Read the file and remove junk data, leaving only the node and neighbor
# data behind for storage in the data structure
def readFile(self, fileName):
# Check if the file exists at the specified path
if not os.path.isfile(fileName):
print ('File does not exist.')
# File exists, will attempt parsing
else:
with open(str(fileName)) as file:
data = file.readlines()
# Look for the first sign of data that we can use, read from that location
currentLine = self.searchForStartLine(data)
# Read from file until we find the last line of data that we need
lastLine = "At the last row"
for line in data:
if lastLine in data[currentLine + 1]:
break
else:
nodeId = data[0]
self.splitLine(data[currentLine])
currentLine += 1
return file.read()
# Read file, given the exact file path
startTime = timeit.timeit()
parse = Parser("<file_path>")
#print(parse.getNodeNeighbor(1, 0))
print (parse.nodeList[0][4][0])
endTime = timeit.timeit()
print ("Elapsed time: " + str(endTime - startTime))
2 Answers 2
Style
- Use
snake_case
for functions and variables. - Leave 1 line between methods.
- Don't overwrite keywords,
id
,file
. Use synonyms orid_
, the former is preferred. - Keep lines to a maximum of 79. (Comments should be a maximum of 72 however.)
- Avoid excessive white-space.
- Use one space between both sides of the assignment operator. E.G.
a = 1
, nota = 1
.
As @Quill said docstrings are good. And you should write them instead of some of your comments in your code.
Algorithms
setNodeData
Can be shortened to just:
def setNodeData(self, id_, sector, freq, band, neighborList):
return ((id_), (sector), (freq), (band), (neighborList))
searchForStartLine
should use the builtin enumerate.
def searchForStartLine(self, data):
start_line = "-default- - - - - "
for line_num, line in enumerate(data):
if start_line in line:
return line_num
splitLine
is only used once. Also you overwrite line
and don't use it after the overwrite.
Consider removing it.
extractNeighbors
can be simplified to a list comprehension, and could be simpler that way.
It is also faster than appending to an existing list.
I first used the *
operator on the slice [:4]
. This way you don't have to define sector
, etc.
I then removed the need for i
, there is no need for it, as everything uses neighbor[i]
.
def extractNeighbors(self, neighbors):
return [
self.setNeighborData(
*(neighbor.replace(",", " ").split()[:4])
)
for neighbor in neighbors
]
We can do roughly the same thing above to extractNode
, to minimise code.
First remove all the noise of sector
etc. We will use *(splitLine[:5])
.
You only make changes to splitLine[4]
, and the other informaton just clutters that.
def extractNode(self, splitLine):
splitLine[4] = self.extractNeighbors(
self.removeParens(splitLine[4]).split()
)
self.numNeighbors.append(len(splitLine[4]))
self.nodeList.append(self.setNodeData(*(splitLine[:5])))
In readFile
you should change the for loop.
for line in data:
You don't use line
, and you do currentLine += 1
.
for line_num, _ in enumerate(data, currentLine):
if lastLine in data[line_num + 1]:
break
else:
nodeId = data[0]
self.splitLine(data[line_num])
It may be better to use range
however.
Overall, this will have no effect on the speed.
But it highlights that str.split
and str.replace
are probably the reasons that it is slow.
This is as they are, if I recall correctly, linear time.
You could try using the optional argument of str.split
, maxsplit, to not go through the entire string.
And you wouldn't need to use [:4]
and [:5]
.
The only other way I can see speeding this up would be to use a different algorithm to handle each line in roughly O(n). I know that we aren't allowed to write compleat rewrites so here is an example I wrote.
- There are no speed optimisations.
- Same as 1. But try
str.split
's maxsplit. - I can't comment.
- It's throughout.
-
1\$\begingroup\$ Feel free to add the rewrite to the answer. It's just that posting only a rewrite without commenting on the original code would not be a good answer. meta \$\endgroup\$Janne Karila– Janne Karila2015年07月08日 07:07:23 +00:00Commented Jul 8, 2015 at 7:07
Here is my updated code per the suggestions given in Joe Wallis' answer.
I have changed the main loop in the readFile
method, eliminating the if/else statements. I have also added additional methods to aid in easily retrieving data later form a separate module that is used.
import os
import time
__author__ = 'Evan Bechtol'
"""
Parses through a file given the appropriate filepath. Once a filepath
has been received, the Parser instance opens and begins parsing out the
individual nodes and their neighbor node relationships. Each node is an
index of the nodeList, which contains a sub-list of the nodes that are neighbors
of that node.
The structure is created as follows:
nodeList: A list that is 'n' nodes long. The sub-list containing neighbors is of
length 'e', where 'e' is the number of neighbor-vertices.
numNeighbors: A list that contains the number of neighbors for each node from 0 to (n-1)
Resulting runtime of class: O(N*E)
"""
class Parser:
# Constructor accepting filepath for file to read
# as am argument. The constructor also calls readFile with
# the filepath to begin parsing the specific file.
def __init__(self, filePath):
self.nodeList = []
self.numNeighbors = []
self.readFile(filePath)
# Add nodes the the nodeList in order that they appear
def setNodeData(self, id, sector, freq, band, neighborList):
return ((id), (sector), (freq), (band), (neighborList))
# Add neighbors to the neighborList in the order that they appear
def setNeighborData(self, id, sector, freq, band):
return ((id), (sector), (freq), (band))
# Returns the entire nodeList as a string
def getNodeList(self):
return str(self.nodeList)
# Return a specific node of the nodeList with all of its' neighbors
def getNodeListIndex(self, index):
return str(self.nodeList[index])
# Returns a list of neighbors represented by a string, for a given node.
# Each index of the list is an individual neighbor of that node.
def getNodeNeighborList(self, node):
neighborString = []
i = 0
while i < self.numNeighbors[node]:
neighborString.append(self.getNodeNeighbor(node, i))
i += 1
return neighborString
# Return a specific neighbor for a given node
def getNodeNeighbor(self, node, neighbor):
return str(self.nodeList[node][4][neighbor][0] + "_" + \
self.nodeList[node][4][neighbor][1] + "_" + \
self.nodeList[node][4][neighbor][2] + "_" + \
self.nodeList[node][4][neighbor][3])
# Returns the node information, as a string, for a given index on the
# nodeList.
# Data is returned as follows:
# NodeID_Sector_Frequency_Band
def getNode(self, node):
return (self.nodeList[node][0] + "_" + \
self.nodeList[node][1] + "_" + \
self.nodeList[node][2] + "_" + \
self.nodeList[node][3])
# Return the NodeID for a given index on the nodeList
def getNodeId(self, node):
return (self.nodeList[node][0])
# Returns the number of nodes in the nodeList
def getNumNodes(self):
return (len(self.nodeList) + 1)
# Returns the number of neighbors that exist for a given node as an int
def getNodeNumNeighbors(self, node):
return self.numNeighbors[node]
# Retrieves the location of the line to begin retrieving node and
# neighbor data in the file. This eliminates any data above the actual
# data required to build node and neighbor relationships.
def searchForStartLine(self, data):
startLine = "-default- - - - - "
numLines = 0
for line in data:
numLines += 1
if startLine in line:
return numLines
# Removes parenthesis from the line so that neighbors can be parsed.
# Returns the line with all parenthesis removed and individual neighbors
# are separated by spaces.
def removeParens(self, line):
# First, remove all parenthesis
line = line.strip("((")
line = line.strip("))")
line = line.replace(")(", " ")
return line
# Get each individual node from the specific line.
# This is referred to as the "reference node",
# which represents the node that we will be creating a specific
# list of neighbors for.
# Each reference node is unique and contains a unique neighborList.
def extractNode(self, splitLine):
splitLine[4] = self.extractNeighbors(
self.removeParens(splitLine[4]).split()
)
self.numNeighbors.append(len(splitLine[4]))
self.nodeList.append(self.setNodeData(*(splitLine[:5])))
# Get the parsed list of neighbors for all nodes, then append
# them to the neighborList in order that they are read.
def extractNeighbors(self, neighbors):
return [
self.setNeighborData(
*(neighbor.replace(",", " ").split()[:4])
)
for neighbor in neighbors
]
# Read the file and remove junk data, leaving only the node and neighbor
# data behind for storage in the data structure
def readFile(self, fileName):
# Check if the file exists at the specified path
if not os.path.isfile(fileName):
print ('File does not exist.')
# File exists, will attempt parsing
else:
with open(str(fileName)) as file:
data = file.readlines()
# Look for the first sign of data that we can use, read from that location
currentLine = self.searchForStartLine(data)
# Read from file until we find the last line of data that we need
lastLine = "At the last row"
while lastLine not in data[currentLine + 1]:
nodeId = data[0]
self.extractNode(data[currentLine].split())
currentLine += 1
-
1\$\begingroup\$ Don't forget to reward the best answer by accepting it if it helped you improve your code. \$\endgroup\$Phrancis– Phrancis2015年07月09日 00:47:53 +00:00Commented Jul 9, 2015 at 0:47
-
\$\begingroup\$ I'm not the one who down-voted \$\endgroup\$Phrancis– Phrancis2015年07月09日 02:08:29 +00:00Commented Jul 9, 2015 at 2:08
-
\$\begingroup\$ @Phrancis my bad :( Not really sure why someone would downvote and not explain why. \$\endgroup\$Evan Bechtol– Evan Bechtol2015年07月09日 02:08:46 +00:00Commented Jul 9, 2015 at 2:08
-
\$\begingroup\$ No worries friend \$\endgroup\$Phrancis– Phrancis2015年07月09日 02:09:13 +00:00Commented Jul 9, 2015 at 2:09
readFile
checks for a file's existence and then simply prints a message if it can't find it. It's not really a style issue, but as a potential user of your Parser class, I think I would expect it to at least throw an exception in that case. You could bypess the existence check altogether and just letopen
throw an exception for you if it can't be opened. By the way, a file not existing is only one of the reasons a file could fail to be opened. See also "EAFP" in the Python glossary. \$\endgroup\$