File parser convert to node data and neighbouring relational data

Question 1

I have been working on a file parser that takes a very specific file format and parses it into a list that is arranged into node data and the neighbors that it relates to. I am new to Python (this is my very first program in this language), so I am not familiar with more advanced methods of solving the problem using Python.

The program runs very quickly: I get an average of about Elapsed time: 0.0006923562137193841 with a test file, but think it could be even better, especially if I task it with a significantly larger file.

Seeking from question:

Optimization in the form of cleaner methods
Decrease the overall runtime
Verification of estimated runtime: \$O(N * E)\$. I got this because there are N nodes, which each contain E edges. However, this may or may not be incorrect.
General style comments for Python coding

Input file example:

The following would be 1 line of data in the file. This file could contain thousands of lines, each line identifying a node and the neighbors that it has.

100 Alpha 123 321 ((101,Beta,123,321)(101,Gamma,123,321)(102,Alpha,123,321)(103,Alpha,123,321)(104,Beta,123,321)(105,Alpha,123,321)(099,Gamma,123,321)(098,Beta,123,321)(097,Beta,123,321)(222,Gamma,123,321)(223,Beta,123,321)(234,Gamma,123,321)(451,Beta,123,321)(999,Beta,123,321)(879,Gamma,123,321)(369,Gamma,123,321)(741,Beta,123,321)(753,Beta,123,321)(357,Beta,123,321)(159,Alpha,123,321))

The parsing would end with the line containing only "At the last row".

import os
import timeit
__author__ = 'Evan Bechtol'
"""
Parses through a file given the appropriate filepath. Once a filepath
has been received, the Parser instance opens and begins parsing out the
individual nodes and their neighbor node relationships. Each node is an
index of the nodeList, which contains a sub-list of the nodes that are neighbors
of that node.
The structure is created as follows:
nodeList: A list that is 'n' nodes long. The sub-list containing neighbors is of
 length 'e', where 'e' is the number of neighbor-edges.
numNeighbors: A list that contains the number of neighbors for each node from 0 to (n-1)
Resulting runtime of class: O(N*E)
"""
class Parser:
 # Constructor accepting filepath for file to read
 # as am argument. The constructor also calls readFile with
 # the filepath to begin parsing the specific file.
 def __init__(self, filePath):
 self.nodeList = []
 self.numNeighbors = []
 self.readFile(filePath)
 # Add nodes the the nodeList in order that they appear
 def setNodeData(self, id, sector, freq, band, neighborList):
 tmpData = ((id), (sector), (freq), (band), (neighborList))
 return tmpData
 # Add neighbors to the neighborList in the order that they appear
 def setNeighborData(self, id, sector, freq, band):
 tmpData = ((id), (sector), (freq), (band))
 return tmpData
 # Returns the entire nodeList as a string
 def getNodeList(self):
 return str(self.nodeList)
 # Return a specific node of the nodeList with all of its' neighbors
 def getNodeListIndex(self, index):
 return str(self.nodeList[index])
 # Return a specific neighbor for a given node
 def getNodeNeighbor(self, node, neighbor):
 return str(self.nodeList[node][4][neighbor])
 # Retrieves the location of the line to begin retrieving node and
 # neighbor data in the file. This eliminates any data above the actual
 # data required to build node and neighbor relationships.
 def searchForStartLine(self, data):
 startLine = "-default- - - - - "
 numLines = 0
 for line in data:
 numLines += 1
 if startLine in line:
 return numLines
 # Removes parenthesis from the line so that neighbors can be parsed.
 # Returns the line with all parenthesis removed and individual neighbors
 # are separated by spaces.
 def removeParens(self, line):
 # First, remove all parenthesis
 line = line.strip("((")
 line = line.strip("))")
 line = line.replace(")(", " ")
 return line
 # Splits the provided line into the required sections for
 # placement into the appropriate lists.
 # The reference node is parsed first and stored into the nodeList
 #
 # Once the nodeList is updated, the neighbor data is then parsed from
 # the line and stored in the neighborList for the reference node.
 def splitLine(self, line):
 # Separate into individual reference nodes
 splitLine = line.split()
 line = self.extractNode(line, splitLine)
 # Get each individual node from the specific line. This is referred to as the
 # "reference node", which represents the node that we will be creating a specific
 # list of neighbors for.
 # Each reference node is unique and contains a unique neighborList.
 def extractNode(self, line, splitLine):
 # Get all of the node data first and store in the nodeList
 nodeId = splitLine[0]
 sector = splitLine[1]
 freq = splitLine[2]
 band = splitLine[3]
 line = self.removeParens(splitLine[4])
 # Separate into individual neighbors
 neighbor = line.split()
 # Contains the number of neighbors for each reference node
 self.numNeighbors.append(len(neighbor))
 # Place each neighbor tuple into the neighborList
 neighborList = self.extractNeighbors(neighbor)
 self.nodeList.append(self.setNodeData(nodeId, sector, freq, band, neighborList))
 return line
 # Get the parsed list of neighbors for all nodes, then append
 # them to the neighborList in order that they are read.
 def extractNeighbors(self, neighbor):
 # Create a temporary storage for the neighbors of the reference node
 neighborList = []
 # Separate each neighbor string into individual neighbor components
 for i in range(len(neighbor)):
 neighbor[i] = neighbor[i].replace(",", " ")
 neighbor[i] = neighbor[i].split()
 nodeId = neighbor[i][0]
 sector = neighbor[i][1]
 freq = neighbor[i][2]
 band = neighbor[i][3]
 # Append the components to the neighborList
 neighborList.append(self.setNeighborData(nodeId, sector, freq, band))
 return neighborList
 # Read the file and remove junk data, leaving only the node and neighbor
 # data behind for storage in the data structure
 def readFile(self, fileName):
 # Check if the file exists at the specified path
 if not os.path.isfile(fileName):
 print ('File does not exist.')
 # File exists, will attempt parsing
 else:
 with open(str(fileName)) as file:
 data = file.readlines()
 # Look for the first sign of data that we can use, read from that location
 currentLine = self.searchForStartLine(data)
 # Read from file until we find the last line of data that we need
 lastLine = "At the last row"
 for line in data:
 if lastLine in data[currentLine + 1]:
 break
 else:
 nodeId = data[0]
 self.splitLine(data[currentLine])
 currentLine += 1
 return file.read()
# Read file, given the exact file path
startTime = timeit.timeit()
parse = Parser("<file_path>")
#print(parse.getNodeNeighbor(1, 0))
print (parse.nodeList[0][4][0])
endTime = timeit.timeit()
print ("Elapsed time: " + str(endTime - startTime))

Question 2

As a side note, I just read another question here on the site and have discovered that using """ to indicate comments is generally considered bad practice. :)

Question 3

Rule of thumb: 80% of the execution time is spent in 20% of the code. To do meaningful optimizations, use a profiler to measure the time that's spent inside each of your functions. Test your code on a large data set. The performance for a small file doesn't matter. Use empirical studies to verify your estimated runtime. Measure the running time multiple times for different values of N and E and analyze the result.

Question 4

I notice your readFile checks for a file's existence and then simply prints a message if it can't find it. It's not really a style issue, but as a potential user of your Parser class, I think I would expect it to at least throw an exception in that case. You could bypess the existence check altogether and just let open throw an exception for you if it can't be opened. By the way, a file not existing is only one of the reasons a file could fail to be opened. See also "EAFP" in the Python glossary.

Question 5

@Brandin I had the print statement there for testing purposes. Once I put this code into production similar print statements (including the one mentioned) were replaced with exceptions that are then displayed in a GUI

Question 6

Style

Use snake_case for functions and variables.
Leave 1 line between methods.
Don't overwrite keywords, id, file. Use synonyms or id_, the former is preferred.
Keep lines to a maximum of 79. (Comments should be a maximum of 72 however.)
Avoid excessive white-space.
Use one space between both sides of the assignment operator. E.G. a = 1, not a = 1.

As @Quill said docstrings are good. And you should write them instead of some of your comments in your code.

Algorithms

setNodeData Can be shortened to just:

def setNodeData(self, id_, sector, freq, band, neighborList):
 return ((id_), (sector), (freq), (band), (neighborList))

searchForStartLine should use the builtin enumerate.

def searchForStartLine(self, data):
 start_line = "-default- - - - - "
 for line_num, line in enumerate(data):
 if start_line in line:
 return line_num

splitLine is only used once. Also you overwrite line and don't use it after the overwrite. Consider removing it.

extractNeighbors can be simplified to a list comprehension, and could be simpler that way. It is also faster than appending to an existing list.

I first used the * operator on the slice [:4]. This way you don't have to define sector, etc. I then removed the need for i, there is no need for it, as everything uses neighbor[i].

def extractNeighbors(self, neighbors):
 return [
 self.setNeighborData(
 *(neighbor.replace(",", " ").split()[:4])
 )
 for neighbor in neighbors
 ]

We can do roughly the same thing above to extractNode, to minimise code.

First remove all the noise of sector etc. We will use *(splitLine[:5]). You only make changes to splitLine[4], and the other informaton just clutters that.

def extractNode(self, splitLine):
 splitLine[4] = self.extractNeighbors(
 self.removeParens(splitLine[4]).split()
 )
 self.numNeighbors.append(len(splitLine[4]))
 self.nodeList.append(self.setNodeData(*(splitLine[:5])))

In readFile you should change the for loop.

for line in data:

You don't use line, and you do currentLine += 1.

for line_num, _ in enumerate(data, currentLine):
 if lastLine in data[line_num + 1]:
 break
 else:
 nodeId = data[0]
 self.splitLine(data[line_num])

It may be better to use range however.

Overall, this will have no effect on the speed. But it highlights that str.split and str.replace are probably the reasons that it is slow. This is as they are, if I recall correctly, linear time. You could try using the optional argument of str.split, maxsplit, to not go through the entire string. And you wouldn't need to use [:4] and [:5].

The only other way I can see speeding this up would be to use a different algorithm to handle each line in roughly O(n). I know that we aren't allowed to write compleat rewrites so here is an example I wrote.

There are no speed optimisations.
Same as 1. But try str.split's maxsplit.
I can't comment.
It's throughout.

Question 7

Feel free to add the rewrite to the answer. It's just that posting only a rewrite without commenting on the original code would not be a good answer. meta

Question 8

Here is my updated code per the suggestions given in Joe Wallis' answer.

I have changed the main loop in the readFile method, eliminating the if/else statements. I have also added additional methods to aid in easily retrieving data later form a separate module that is used.

import os
import time
__author__ = 'Evan Bechtol'
"""
Parses through a file given the appropriate filepath. Once a filepath
has been received, the Parser instance opens and begins parsing out the
individual nodes and their neighbor node relationships. Each node is an
index of the nodeList, which contains a sub-list of the nodes that are neighbors
of that node.
The structure is created as follows:
nodeList: A list that is 'n' nodes long. The sub-list containing neighbors is of
 length 'e', where 'e' is the number of neighbor-vertices.
numNeighbors: A list that contains the number of neighbors for each node from 0 to (n-1)
Resulting runtime of class: O(N*E)
"""
class Parser:
 # Constructor accepting filepath for file to read
 # as am argument. The constructor also calls readFile with
 # the filepath to begin parsing the specific file.
 def __init__(self, filePath):
 self.nodeList = []
 self.numNeighbors = []
 self.readFile(filePath)
 # Add nodes the the nodeList in order that they appear
 def setNodeData(self, id, sector, freq, band, neighborList):
 return ((id), (sector), (freq), (band), (neighborList))
 # Add neighbors to the neighborList in the order that they appear
 def setNeighborData(self, id, sector, freq, band):
 return ((id), (sector), (freq), (band))
 # Returns the entire nodeList as a string
 def getNodeList(self):
 return str(self.nodeList)
 # Return a specific node of the nodeList with all of its' neighbors
 def getNodeListIndex(self, index):
 return str(self.nodeList[index])
 # Returns a list of neighbors represented by a string, for a given node.
 # Each index of the list is an individual neighbor of that node.
 def getNodeNeighborList(self, node):
 neighborString = []
 i = 0
 while i < self.numNeighbors[node]:
 neighborString.append(self.getNodeNeighbor(node, i))
 i += 1
 return neighborString
 # Return a specific neighbor for a given node
 def getNodeNeighbor(self, node, neighbor):
 return str(self.nodeList[node][4][neighbor][0] + "_" + \
 self.nodeList[node][4][neighbor][1] + "_" + \
 self.nodeList[node][4][neighbor][2] + "_" + \
 self.nodeList[node][4][neighbor][3])
 # Returns the node information, as a string, for a given index on the
 # nodeList.
 # Data is returned as follows:
 # NodeID_Sector_Frequency_Band
 def getNode(self, node):
 return (self.nodeList[node][0] + "_" + \
 self.nodeList[node][1] + "_" + \
 self.nodeList[node][2] + "_" + \
 self.nodeList[node][3])
 # Return the NodeID for a given index on the nodeList
 def getNodeId(self, node):
 return (self.nodeList[node][0])
 # Returns the number of nodes in the nodeList
 def getNumNodes(self):
 return (len(self.nodeList) + 1)
 # Returns the number of neighbors that exist for a given node as an int
 def getNodeNumNeighbors(self, node):
 return self.numNeighbors[node]
 # Retrieves the location of the line to begin retrieving node and
 # neighbor data in the file. This eliminates any data above the actual
 # data required to build node and neighbor relationships.
 def searchForStartLine(self, data):
 startLine = "-default- - - - - "
 numLines = 0
 for line in data:
 numLines += 1
 if startLine in line:
 return numLines
 # Removes parenthesis from the line so that neighbors can be parsed.
 # Returns the line with all parenthesis removed and individual neighbors
 # are separated by spaces.
 def removeParens(self, line):
 # First, remove all parenthesis
 line = line.strip("((")
 line = line.strip("))")
 line = line.replace(")(", " ")
 return line
 # Get each individual node from the specific line.
 # This is referred to as the "reference node",
 # which represents the node that we will be creating a specific
 # list of neighbors for.
 # Each reference node is unique and contains a unique neighborList.
 def extractNode(self, splitLine):
 splitLine[4] = self.extractNeighbors(
 self.removeParens(splitLine[4]).split()
 )
 self.numNeighbors.append(len(splitLine[4]))
 self.nodeList.append(self.setNodeData(*(splitLine[:5])))
 # Get the parsed list of neighbors for all nodes, then append
 # them to the neighborList in order that they are read.
 def extractNeighbors(self, neighbors):
 return [
 self.setNeighborData(
 *(neighbor.replace(",", " ").split()[:4])
 )
 for neighbor in neighbors
 ]
 # Read the file and remove junk data, leaving only the node and neighbor
 # data behind for storage in the data structure
 def readFile(self, fileName):
 # Check if the file exists at the specified path
 if not os.path.isfile(fileName):
 print ('File does not exist.')
 # File exists, will attempt parsing
 else:
 with open(str(fileName)) as file:
 data = file.readlines()
 # Look for the first sign of data that we can use, read from that location
 currentLine = self.searchForStartLine(data)
 # Read from file until we find the last line of data that we need
 lastLine = "At the last row"
 while lastLine not in data[currentLine + 1]:
 nodeId = data[0]
 self.extractNode(data[currentLine].split())
 currentLine += 1

Question 9

Don't forget to reward the best answer by accepting it if it helped you improve your code.

Question 10

I'm not the one who down-voted

Question 11

@Phrancis my bad :( Not really sure why someone would downvote and not explain why.

Question 12

No worries friend

Peilonrayz ♦Peilonrayz 44.4k7 gold badges80 silver badges157 bronze badges · Accepted Answer · 2015-07-07 16:49:02Z

Style

Use snake_case for functions and variables.
Leave 1 line between methods.
Don't overwrite keywords, id, file. Use synonyms or id_, the former is preferred.
Keep lines to a maximum of 79. (Comments should be a maximum of 72 however.)
Avoid excessive white-space.
Use one space between both sides of the assignment operator. E.G. a = 1, not a = 1.

As @Quill said docstrings are good. And you should write them instead of some of your comments in your code.

Algorithms

setNodeData Can be shortened to just:

def setNodeData(self, id_, sector, freq, band, neighborList):
 return ((id_), (sector), (freq), (band), (neighborList))

searchForStartLine should use the builtin enumerate.

def searchForStartLine(self, data):
 start_line = "-default- - - - - "
 for line_num, line in enumerate(data):
 if start_line in line:
 return line_num

splitLine is only used once. Also you overwrite line and don't use it after the overwrite. Consider removing it.

extractNeighbors can be simplified to a list comprehension, and could be simpler that way. It is also faster than appending to an existing list.

I first used the * operator on the slice [:4]. This way you don't have to define sector, etc. I then removed the need for i, there is no need for it, as everything uses neighbor[i].

def extractNeighbors(self, neighbors):
 return [
 self.setNeighborData(
 *(neighbor.replace(",", " ").split()[:4])
 )
 for neighbor in neighbors
 ]

We can do roughly the same thing above to extractNode, to minimise code.

First remove all the noise of sector etc. We will use *(splitLine[:5]). You only make changes to splitLine[4], and the other informaton just clutters that.

def extractNode(self, splitLine):
 splitLine[4] = self.extractNeighbors(
 self.removeParens(splitLine[4]).split()
 )
 self.numNeighbors.append(len(splitLine[4]))
 self.nodeList.append(self.setNodeData(*(splitLine[:5])))

In readFile you should change the for loop.

for line in data:

You don't use line, and you do currentLine += 1.

for line_num, _ in enumerate(data, currentLine):
 if lastLine in data[line_num + 1]:
 break
 else:
 nodeId = data[0]
 self.splitLine(data[line_num])

It may be better to use range however.

Overall, this will have no effect on the speed. But it highlights that str.split and str.replace are probably the reasons that it is slow. This is as they are, if I recall correctly, linear time. You could try using the optional argument of str.split, maxsplit, to not go through the entire string. And you wouldn't need to use [:4] and [:5].

The only other way I can see speeding this up would be to use a different algorithm to handle each line in roughly O(n). I know that we aren't allowed to write compleat rewrites so here is an example I wrote.

There are no speed optimisations.
Same as 1. But try str.split's maxsplit.
I can't comment.
It's throughout.

Feel free to add the rewrite to the answer. It's just that posting only a rewrite without commenting on the original code would not be a good answer. meta

Stack Exchange Network

File parser convert to node data and neighbouring relational data

2 Answers 2

Style

Algorithms

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

File parser convert to node data and neighbouring relational data

2 Answers 2

Style

Algorithms

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions