"Compare" program for Eclipse preference files

Question 1

I am trying to write a simple (trivial?) "compare" program for Eclipse preferences files.

Eclipse preferences files take more of less this form:

# optional comment line
/a/sequence/of/path/elements=string
/another/sequence/of/path/elements=42
# ... (possibly repeated)

Let's call the path sequences "keys" and what follows the = sign "values".

The rules of the program should be:

Exit upon detecting an invalid # of arguments (must be 2)
argument1 and argument2 are the files to compare
Exit if any of the input files are empty
One preference entry per line
Lines with a path key will have a path value, guaranteed

The output should be as follows:

Part 1

All lines from argument1 which have keys not present in argument2, like so:
```
/path 42
/path2 banana
```
Blank line

Part 2

All keys that are present in both argument1 and argument2, like so:

/shared (valueInArgument1, valueInArgument2)
/shared2 (valueInArgument1, valueInArgument2)

Blank line

Part 3

All lines from argument2 which have keys not present in argument1, like so:
```
/path3 24
/path4 ananab
```

Notice Part 1, Part 2, and Part 3 must all be sorted.

I would like to get something:

Pythonic
Efficient (do as little work as possible computationally, use as little space as possible)
Clear and readable from a logical point of view
Correct (gets the right result even in edge cases)
Instructive (uses data structures and algorithms properly)

Here's my attempt:

#!/usr/bin/env /opt/local/bin/python2.7
import sys
import os
import re
import pprint
COMMAND_SYNTAX_ERROR = 2
EMPTY_PREFS_FILE_ERROR = 3
PREFS_REGEX_PATTERN = '^(.*?)=(.*)$'
PREFS_REGEX = re.compile(PREFS_REGEX_PATTERN)
def parse_prefs_line(line):
 regex_result = PREFS_REGEX.match(line)
 if not regex_result:
 return None, None
 return regex_result.group(1), regex_result.group(2)
arguments = sys.argv
n_arguments = len(arguments)
if n_arguments != 3:
 print 'usage: eclipse_diff file1.epf file2.epf'
 sys.exit(COMMAND_SYNTAX_ERROR)
file1, file2 = open(sys.argv[1]), open(sys.argv[2])
prefs1 = file1.readlines()
file1.close()
prefs2 = file2.readlines()
file2.close()
length1, length2 = len(prefs1), len(prefs2)
if length1 == 0 or length2 == 0:
 print 'both files must contain at least one configuration line'
 sys.exit(EMPTY_PREFS_FILE_ERROR)
prefs1.sort()
prefs2.sort()
onlyin1 = []
onlyin2 = []
inboth = []
index1 = 0
index2 = 0
stop1 = False
stop2 = False
pause1 = False
pause2 = False
lastkey1 = None
lastkey2 = None
while True:
 if not stop1 and not pause1:
 line1 = prefs1[index1]
 if not line1:
 stop1 = True
 else:
 line1_left, line1_right = parse_prefs_line(line1)
 if line1_left and line1_right:
 onlyin1.append((line1_left, line1_right))
 lastkey1 = line1_left
 index1 += 1
 if index1 == length1:
 stop1 = True
 if not stop2 and not pause2:
 line2 = prefs2[index2]
 if not line2:
 stop2 = True
 else:
 line2_left, line2_right = parse_prefs_line(line2)
 if line2_left and line2_right:
 onlyin2.append((line2_left, line2_right))
 lastkey2 = line2_left
 index2 += 1
 if index2 == length2:
 stop2 = True
 if lastkey1 and lastkey2:
 if lastkey1 == lastkey2:
 inboth.append(
 (lastkey1,
 (onlyin1.pop()[1], onlyin2.pop()[1]
 )))
 lastkey1 = onlyin1[len(onlyin1)-1][0] if onlyin1 else None
 lastkey2 = onlyin2[len(onlyin2)-1][0] if onlyin2 else None
 if not stop1:
 if lastkey1 and not stop2 and lastkey1 > lastkey2:
 pause1 = True
 else:
 pause1 = False
 if not stop2:
 if lastkey2 and not stop1 and lastkey2 > lastkey1:
 pause2 = True
 else:
 pause2 = False
 if stop1 and stop2:
 break
for pref in onlyin1:
 print pref[0], pref[1]
print
for pref in inboth:
 print pref[0], pref[1]
print
for pref in onlyin2:
 print pref[0], pref[1]

How can I improve this?

To make my question specific: I would like for it to execute in less cycles.

My other thought was to create a hash map of keys, add values to it, sort the keys and split the output based on how many values correspond to the key...

Question 2

Where's the profiler output?

Question 3

I don't know how to do that.

Question 4

You are still working on your algorithm. Once you get an algorithm you are happy with, then optimize it, not before. Your "PS" comment sounds like a good direction to me. Give that a try.

Question 5

Actually no, I have decided against the hash map in the end and this is what I have got as my algorithm. I think this is superior to the hash map approach. I may be mistaken. Do you think differently? Why?

Question 6

@mhawke: a good pointer. I did not know about that... Still in Beta though, no wonder I did not have it at the top of my list.

Question 7

Don't optimize unless your profiler says so.

You could start with the simplest code that works e.g., here's a straightforward translation of your requirements:

import sys
def get_entries(filename):
  with open(filename) as file:
    # extract 'key = value' entries
    entries = (map(str.strip, line.partition('=')[::2]) for line in file)
    #note: if keys are repeated the last value wins
    # enforce non-empty values, skip comments
    return {key: value for key, value in entries
        if value and not key.startswith('#')}
if len(sys.argv) != 3:
  sys.exit(2) # wrong number of arguments
d1, d2 = map(get_entries, sys.argv[1:])
if not (d1 and d2):
  sys.exit(1) # no entries in a file
def print_entries(keys, d, d2=None):
  for k in sorted(keys):
    value = d[k] if d2 is None else "(%s, %s)" % (d[k], d2[k])
    print k, value
  print
print_entries(d1.viewkeys() - d2.viewkeys(), d1)
print_entries(d1.viewkeys() & d2.viewkeys(), d1, d2)
print_entries(d2.viewkeys() - d1.viewkeys(), d2)

You could compare results and the performance with your code.

You could also compare it the comm command from coreutils:

$ comm <(sort file1) <(sort file2)

jfs jfs 7333 silver badges12 bronze badges · Accepted Answer · 2012-08-17 07:29:57Z

Don't optimize unless your profiler says so.

You could start with the simplest code that works e.g., here's a straightforward translation of your requirements:

import sys
def get_entries(filename):
  with open(filename) as file:
    # extract 'key = value' entries
    entries = (map(str.strip, line.partition('=')[::2]) for line in file)
    #note: if keys are repeated the last value wins
    # enforce non-empty values, skip comments
    return {key: value for key, value in entries
        if value and not key.startswith('#')}
if len(sys.argv) != 3:
  sys.exit(2) # wrong number of arguments
d1, d2 = map(get_entries, sys.argv[1:])
if not (d1 and d2):
  sys.exit(1) # no entries in a file
def print_entries(keys, d, d2=None):
  for k in sorted(keys):
    value = d[k] if d2 is None else "(%s, %s)" % (d[k], d2[k])
    print k, value
  print
print_entries(d1.viewkeys() - d2.viewkeys(), d1)
print_entries(d1.viewkeys() & d2.viewkeys(), d1, d2)
print_entries(d2.viewkeys() - d1.viewkeys(), d2)

You could compare results and the performance with your code.

You could also compare it the comm command from coreutils:

$ comm <(sort file1) <(sort file2)

Stack Exchange Network

"Compare" program for Eclipse preference files

Part 1

Part 2

Part 3

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

"Compare" program for Eclipse preference files

Part 1

Part 2

Part 3

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions