Build a dictionary based on split string

Question 1

I've written a simple script that takes a string filled with brackets and strings and builds them to a dictionary.

Example Input:

string_1 = '(()())'
string_2 = '((a)(b))'
string_3 = '(a(a)b(b)c)'
string_4 = 'a(a(a)b(b)c)b'
string_4 = 'a[a{a}b{b}c]b'
string_5 = 'beginning(content[word]{another word})outside'

Output for string5 and the other examples:

[
('0-8', 'beginning'),
('9', '('),
('10-16', 'content'),
('17', '['),
('18-21', 'word'),
('22', ']'),
('23', '{'),
('24-35', 'another word'),
('36', '}'),
('37', ')'),
('38-44', 'outside')
]

def sort(dictionary):
 return sorted(dictionary.items(), key=lambda v: int(v[0].split("-")[0]))
def join(a, b):
 return ''.join([repr(a), '-', repr(b - 1)]) if a != b else repr(b)
class Positions:
 def __init__(self, name):
 self.name = name
 self.content = {} # creates a new empty list for each bracket
 def push(self, pos, content):
 self.content[pos] = content
def convert_string(string):
 string_content = string
 string = Positions(string)
 pos = 0
 start_of_str_pos = 0
 internal_string = ''
 for char in string_content:
 if char in ['(', '{', '[', ')', '<', '}', ']', '>']:
 string.push(repr(pos), repr(char))
 if internal_string != '':
 string.push(join(start_of_str_pos, pos), internal_string)
 internal_string = ''
 start_of_str_pos = pos + 1
 else:
 internal_string += char
 pos += 1
 if internal_string != '':
 string.push(''.join([repr(start_of_str_pos),
 '-', repr(pos - 1)]), internal_string)
 print sort(string.content)

Question 2

Is this a simple implementation of my parentheses matching idea? It looks a little like it...

Question 3

1. Why is the output on Pastebin? 2. PEP-8 recommends lowercase_with_underscore function/variable names, so you're not yet fully compliant.

Question 4

Obvious but easy

There are some issues that are immediately obvious and easy to correct.

In this statement:

return ''.join([repr(a), '-', repr(b - 1)]) if a != b else repr(b)

The multiple repr are awkward, and so is the join. It would be less awkward, simpler and more readable by using string formatting:

return '{}-{}'.format(a, b - 1) if a != b else repr(a)

This can be written a lot simpler:

if char in ['(', '{', '[', ')', '<', '}', ']', '>']:

Like this:

if char in '({[)<}]>':

But actually, it's a bit disturbing that the opening and closing brackets are ordered inconsistently. For example, looking at this string, it's not immediately obvious that it includes all matching closing brackets for all opening brackets. This way would be immediately obvious and less error prone:

if char in '(){}[]<>':

This is both confusing and disturbing:

def convert_string(string):
 string_content = string
 string = Positions(string)

Why shuffle around the string parameter? It's a bad practice to reassign the value of a parameter, as it makes it confusing what it really stands for throughout the function.

Another problem here is calling an instance of Positions a "string". Later in the code when I read "string", I think it's an actually string, but in fact it's not, it's an instance of "Positions". Very misleading.

Since empty strings are falsy, instead of this:

if internal_string != '':
 # ...

You can write simply:

if internal_string:
 # ...

Obvious but harder

There are some issues that are sort of obviously wrong, but not so easy to correct.

This sorting function smells:

def sort(dictionary):
 return sorted(dictionary.items(), key=lambda v: int(v[0].split("-")[0]))

Instead of splitting a string and parsing to int, my instinct says there should be a way to use the original int values directly.

In general, the code formats these strings from int values, and it seems silly to later un-format the strings so you can sort. There should be a better way.

The Positions class stinks. It's called "Positions", but its fields are "name" and "content". Sounds more like a key-value pair.

Then there's this code with the comment:

 self.content = {} # creates a new empty list for each bracket

The comment talks about "list", but the code uses a dictionary.

And there's the push method:

def push(self, pos, content):
 self.content[pos] = content

push is a term commonly used with stacks, or sometimes with lists. But here the method puts a value into a dictionary, an operation normally called "put".

Alternative implementation

Based on the above suggestions, and reorganizing to more meaningful elements, you could do something like this:

from functools import total_ordering
@total_ordering
class Item:
 def __init__(self, start, end, content):
 self.start = start
 self.end = end
 self.content = content
 @property
 def position(self):
 a = self.start
 b = self.end
 return '{}-{}'.format(a, b - 1) if a != b else repr(a)
 def __lt__(self, other):
 return self.start < other.start
def convert_string(string):
 start = 0
 internal_string = ''
 items = []
 for (pos, char) in enumerate(string):
 if char in '(){}[]<>':
 items.append(Item(pos, pos, char))
 if internal_string:
 items.append(Item(start, pos, internal_string))
 internal_string = ''
 start = pos + 1
 else:
 internal_string += char
 if internal_string:
 items.append(Item(start, len(string), internal_string))
 return items

Although with the implementation of __lt__, I made it possible to sort simply by sorted(items), actually there is no need to sort, as the items are naturally sorted.

Note: the purpose of @total_ordering is to implement the other rich comparison operators like __le__, as recommended by PEP8.

Question 5

Just looking at the first few lines of convert_string()...

def convert_string(string):
 string_content = string
 string = Positions(string)
 pos = 0
 start_of_str_pos = 0
 ...

... and the last line...

 ...
 print sort(string.content)

... I'm pretty confused! Why are you renaming the string parameter to string_content? Then why is string not a string, but a Positions object? And that object's name is actually the string, whereas its content is a dictionary? On top of that, string_content looks confusingly like string.content, but it's actually the same as string.name???

Honestly, I'd rather rewrite it from scratch than try to figure out what you did. I'd use regular expressions, because they are a handy way to do string matching and the right tool for the job. Instead of printing the results in convert_string(), it would be better to return a data structure. And the data structure I would choose is collections.OrderedDict(), because it preserves the order in which the keys were added, so you don't have to sort the result.

from collections import OrderedDict
import re
def convert_string(s):
 result = OrderedDict()
 for match in re.finditer(r'[^<({\[\]})>]+|(?P<delim>.)', s):
 if match.group('delim'):
 result[str(match.start())] = match.group()
 else:
 result['%d-%d' % (match.start(), match.end() - 1)] = match.group()
 return result

Question 6

You have an odd mix of class and function. Rather than a standalone function that creates a Positions object from a string, for example, I would make that a class method:

class Positions(object): # note new-style class
 CHARS = set('()[]{}<>') # class attribute, uses set to speed up 'in'
 ...
 @classmethod
 def from_string(cls, string):
 ...

Now rather than calling convert_string(foo), you can get a Positions instance with Positions.from_string(foo). Similarly, you could move the join and sort functionality into the class, as they're all connected to each other. You should move the "pretty print" formatting into a __str__ method.

Your sort function has a problem; it only works for the first "level". Rather than encapsulate the whole sort in a function, I would write a custom sort key function, then replace e.g.

print sort(string.content)

with:

print sorted(string.content, key=sort_key)

For example, you could use map to apply int to all of the parts of the key:

def sort_key(item):
 """Convert string for multi-level numerical sort.
 >>> sort_key(('1-2-3', None))
 [1, 2, 3]
 """
 key, _ = item
 return map(int, key.split('-'))

Or, alternatively, use a tuple of integers ((1, 2, 3), in this case) as the key rather than a string.

Note the docstring to explain what the function does and provide built-in demonstration/testing with doctest. A docstring would be particularly useful in join, which does some odd things I couldn't quite figure out and would really benefit from explanation. As it stands, your script only has a single explanatory comment:

self.content = {} # creates a new empty list for each bracket

which isn't actually true!

janos janos 113k15 gold badges154 silver badges396 bronze badges · Accepted Answer · 2015-05-24 09:16:21Z

Obvious but easy

There are some issues that are immediately obvious and easy to correct.

In this statement:

return ''.join([repr(a), '-', repr(b - 1)]) if a != b else repr(b)

The multiple repr are awkward, and so is the join. It would be less awkward, simpler and more readable by using string formatting:

return '{}-{}'.format(a, b - 1) if a != b else repr(a)

This can be written a lot simpler:

if char in ['(', '{', '[', ')', '<', '}', ']', '>']:

Like this:

if char in '({[)<}]>':

But actually, it's a bit disturbing that the opening and closing brackets are ordered inconsistently. For example, looking at this string, it's not immediately obvious that it includes all matching closing brackets for all opening brackets. This way would be immediately obvious and less error prone:

if char in '(){}[]<>':

This is both confusing and disturbing:

def convert_string(string):
 string_content = string
 string = Positions(string)

Why shuffle around the string parameter? It's a bad practice to reassign the value of a parameter, as it makes it confusing what it really stands for throughout the function.

Another problem here is calling an instance of Positions a "string". Later in the code when I read "string", I think it's an actually string, but in fact it's not, it's an instance of "Positions". Very misleading.

Since empty strings are falsy, instead of this:

if internal_string != '':
 # ...

You can write simply:

if internal_string:
 # ...

Obvious but harder

There are some issues that are sort of obviously wrong, but not so easy to correct.

This sorting function smells:

def sort(dictionary):
 return sorted(dictionary.items(), key=lambda v: int(v[0].split("-")[0]))

Instead of splitting a string and parsing to int, my instinct says there should be a way to use the original int values directly.

In general, the code formats these strings from int values, and it seems silly to later un-format the strings so you can sort. There should be a better way.

The Positions class stinks. It's called "Positions", but its fields are "name" and "content". Sounds more like a key-value pair.

Then there's this code with the comment:

 self.content = {} # creates a new empty list for each bracket

The comment talks about "list", but the code uses a dictionary.

And there's the push method:

def push(self, pos, content):
 self.content[pos] = content

push is a term commonly used with stacks, or sometimes with lists. But here the method puts a value into a dictionary, an operation normally called "put".

Alternative implementation

Based on the above suggestions, and reorganizing to more meaningful elements, you could do something like this:

from functools import total_ordering
@total_ordering
class Item:
 def __init__(self, start, end, content):
 self.start = start
 self.end = end
 self.content = content
 @property
 def position(self):
 a = self.start
 b = self.end
 return '{}-{}'.format(a, b - 1) if a != b else repr(a)
 def __lt__(self, other):
 return self.start < other.start
def convert_string(string):
 start = 0
 internal_string = ''
 items = []
 for (pos, char) in enumerate(string):
 if char in '(){}[]<>':
 items.append(Item(pos, pos, char))
 if internal_string:
 items.append(Item(start, pos, internal_string))
 internal_string = ''
 start = pos + 1
 else:
 internal_string += char
 if internal_string:
 items.append(Item(start, len(string), internal_string))
 return items

Although with the implementation of __lt__, I made it possible to sort simply by sorted(items), actually there is no need to sort, as the items are naturally sorted.

Note: the purpose of @total_ordering is to implement the other rich comparison operators like __le__, as recommended by PEP8.

Stack Exchange Network

Build a dictionary based on split string

3 Answers 3

Obvious but easy

Obvious but harder

Alternative implementation

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Build a dictionary based on split string

3 Answers 3

Obvious but easy

Obvious but harder

Alternative implementation

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions