I am trying to write a simple (trivial?) "compare" program for Eclipse preferences files.
Eclipse preferences files take more of less this form:
# optional comment line /a/sequence/of/path/elements=string /another/sequence/of/path/elements=42 # ... (possibly repeated)
Let's call the path sequences "keys" and what follows the =
sign "values".
The rules of the program should be:
- Exit upon detecting an invalid # of arguments (must be 2)
- argument1 and argument2 are the files to compare
- Exit if any of the input files are empty
- One preference entry per line
- Lines with a path key will have a path value, guaranteed
The output should be as follows:
Part 1
All lines from argument1 which have keys not present in argument2, like so:
/path 42 /path2 banana
Blank line
Part 2
All keys that are present in both argument1 and argument2, like so:
/shared (valueInArgument1, valueInArgument2) /shared2 (valueInArgument1, valueInArgument2)
Blank line
Part 3
All lines from
argument2
which have keys not present inargument1
, like so:/path3 24 /path4 ananab
Notice Part 1, Part 2, and Part 3 must all be sorted.
I would like to get something:
- Pythonic
- Efficient (do as little work as possible computationally, use as little space as possible)
- Clear and readable from a logical point of view
- Correct (gets the right result even in edge cases)
- Instructive (uses data structures and algorithms properly)
Here's my attempt:
#!/usr/bin/env /opt/local/bin/python2.7
import sys
import os
import re
import pprint
COMMAND_SYNTAX_ERROR = 2
EMPTY_PREFS_FILE_ERROR = 3
PREFS_REGEX_PATTERN = '^(.*?)=(.*)$'
PREFS_REGEX = re.compile(PREFS_REGEX_PATTERN)
def parse_prefs_line(line):
regex_result = PREFS_REGEX.match(line)
if not regex_result:
return None, None
return regex_result.group(1), regex_result.group(2)
arguments = sys.argv
n_arguments = len(arguments)
if n_arguments != 3:
print 'usage: eclipse_diff file1.epf file2.epf'
sys.exit(COMMAND_SYNTAX_ERROR)
file1, file2 = open(sys.argv[1]), open(sys.argv[2])
prefs1 = file1.readlines()
file1.close()
prefs2 = file2.readlines()
file2.close()
length1, length2 = len(prefs1), len(prefs2)
if length1 == 0 or length2 == 0:
print 'both files must contain at least one configuration line'
sys.exit(EMPTY_PREFS_FILE_ERROR)
prefs1.sort()
prefs2.sort()
onlyin1 = []
onlyin2 = []
inboth = []
index1 = 0
index2 = 0
stop1 = False
stop2 = False
pause1 = False
pause2 = False
lastkey1 = None
lastkey2 = None
while True:
if not stop1 and not pause1:
line1 = prefs1[index1]
if not line1:
stop1 = True
else:
line1_left, line1_right = parse_prefs_line(line1)
if line1_left and line1_right:
onlyin1.append((line1_left, line1_right))
lastkey1 = line1_left
index1 += 1
if index1 == length1:
stop1 = True
if not stop2 and not pause2:
line2 = prefs2[index2]
if not line2:
stop2 = True
else:
line2_left, line2_right = parse_prefs_line(line2)
if line2_left and line2_right:
onlyin2.append((line2_left, line2_right))
lastkey2 = line2_left
index2 += 1
if index2 == length2:
stop2 = True
if lastkey1 and lastkey2:
if lastkey1 == lastkey2:
inboth.append(
(lastkey1,
(onlyin1.pop()[1], onlyin2.pop()[1]
)))
lastkey1 = onlyin1[len(onlyin1)-1][0] if onlyin1 else None
lastkey2 = onlyin2[len(onlyin2)-1][0] if onlyin2 else None
if not stop1:
if lastkey1 and not stop2 and lastkey1 > lastkey2:
pause1 = True
else:
pause1 = False
if not stop2:
if lastkey2 and not stop1 and lastkey2 > lastkey1:
pause2 = True
else:
pause2 = False
if stop1 and stop2:
break
for pref in onlyin1:
print pref[0], pref[1]
print
for pref in inboth:
print pref[0], pref[1]
print
for pref in onlyin2:
print pref[0], pref[1]
How can I improve this?
To make my question specific: I would like for it to execute in less cycles.
My other thought was to create a hash map of keys, add values to it, sort the keys and split the output based on how many values correspond to the key...
-
\$\begingroup\$ Where's the profiler output? \$\endgroup\$Ignacio Vazquez-Abrams– Ignacio Vazquez-Abrams2012年08月17日 03:52:44 +00:00Commented Aug 17, 2012 at 3:52
-
\$\begingroup\$ I don't know how to do that. \$\endgroup\$Robottinosino– Robottinosino2012年08月17日 03:53:13 +00:00Commented Aug 17, 2012 at 3:53
-
\$\begingroup\$ You are still working on your algorithm. Once you get an algorithm you are happy with, then optimize it, not before. Your "PS" comment sounds like a good direction to me. Give that a try. \$\endgroup\$wberry– wberry2012年08月17日 04:05:37 +00:00Commented Aug 17, 2012 at 4:05
-
\$\begingroup\$ Actually no, I have decided against the hash map in the end and this is what I have got as my algorithm. I think this is superior to the hash map approach. I may be mistaken. Do you think differently? Why? \$\endgroup\$Robottinosino– Robottinosino2012年08月17日 04:08:31 +00:00Commented Aug 17, 2012 at 4:08
-
\$\begingroup\$ @mhawke: a good pointer. I did not know about that... Still in Beta though, no wonder I did not have it at the top of my list. \$\endgroup\$Robottinosino– Robottinosino2012年08月17日 04:16:55 +00:00Commented Aug 17, 2012 at 4:16
1 Answer 1
Don't optimize unless your profiler says so.
You could start with the simplest code that works e.g., here's a straightforward translation of your requirements:
import sys
def get_entries(filename):
with open(filename) as file:
# extract 'key = value' entries
entries = (map(str.strip, line.partition('=')[::2]) for line in file)
#note: if keys are repeated the last value wins
# enforce non-empty values, skip comments
return {key: value for key, value in entries
if value and not key.startswith('#')}
if len(sys.argv) != 3:
sys.exit(2) # wrong number of arguments
d1, d2 = map(get_entries, sys.argv[1:])
if not (d1 and d2):
sys.exit(1) # no entries in a file
def print_entries(keys, d, d2=None):
for k in sorted(keys):
value = d[k] if d2 is None else "(%s, %s)" % (d[k], d2[k])
print k, value
print
print_entries(d1.viewkeys() - d2.viewkeys(), d1)
print_entries(d1.viewkeys() & d2.viewkeys(), d1, d2)
print_entries(d2.viewkeys() - d1.viewkeys(), d2)
You could compare results and the performance with your code.
You could also compare it the comm
command from coreutils:
$ comm <(sort file1) <(sort file2)