Check consistency of a list of statements, with fuzzy rhyme matching

Question 1

I made a program today that solves the following problem on a programming competition site (open.kattis.com): Check if all statements with the following format X is Y or X not Y are consistent. X and Y are words of the length 1-20 and a-z.

The catch is that every word that rhymes with X in any provided statement is said to be equivalent to X. For simplicity it is said that two words rhyme if the last min(3, |X|, |Y|) character are the same, so for example foo and boo would be the same word but not skoo and troo). Also words which do not appear in any statement do not exist, and should not be considered rhyming with anything. For example, the two words foo and moo do not rhyme with each other, unless the word oo or o exists in the input.

So to break it down, here are the rules:

Check if all statements of the form "X is Y" or "X not Y" is consistent.
Two words rhyme if the last min(3, |X|, |Y|) characters are the same. So if this condition holds for two words X and Y that are in the list of statements, it is equivalent to having a statement "X is Y".
Words which do not appear in any statement do not exist, and should not be considered rhyming with anything

The first line consists of an integer 0≤N≤100000, the number of statements. The next N lines consists of statements of the two given forms.

Output should the string "yes", if the statements are consistent with each other or if there is a contradiction the output should be "wait what?".

The problem provides 4 sample inputs and corresponding outputs:

4
herp is derp
derp is herp
herp is herp
derp is derp

Output to the input above should be yes.

3
oskar not lukas
oskar is poptart
lukas is smart

... should output wait what?

1
moo not foo

... should output yes

2
moo not foo
oo is blah

... should output wait what?

Here is my program that should hopefully be commented well enough to understand:

import sys
# read all input
indata = sys.stdin.readlines()
inlen = int(indata[0])
# return yes if the first line was 0
if inlen == 0:
 print("yes")
 exit(0)
Xs = []
Ys = []
equivalent = {}
words = []
# build lists of words in data
for i in range(inlen):
 item = indata[i+1]
 parts = item.split(" ")
 word1 = parts[0].strip()
 word2 = parts[2].strip()
 Xs.append(word1)
 Ys.append(word2)
 words.append(word1)
 words.append(word2)
 # add word in equivalent list if not present
 if word1 not in equivalent:
 equivalent[word1] = [word1]
 if word2 not in equivalent:
 equivalent[word2] = [word2]
# check which words rhyme and add to equivalent list
for i in range(len(words)):
 word1 = words[i].strip()
 for j in range(len(words)):
 word2 = words[j]
 if word1 == word2:
 continue
 endinglen = min(3, len(word1), len(word2)) # min(3, |X|, |Y|)
 ending1 = word1[-endinglen:]
 ending2 = word2[-endinglen:]
 if ending1 == ending2:
 equivalent[word1].append(word2)
 equivalent[word2].append(word1)
# check if any word in eq1 matches any word in eq2
def any_equals(eq1, eq2):
 for i in eq1:
 for j in eq2:
 if i == j:
 return True
 return False
# build the statements data structure
statements = {}
for i in range(inlen):
 item = indata[i+1]
 parts = item.split(" ")
 X = parts[0].strip()
 b = True if parts[1] == "is" else False
 Y = parts[2].strip()
 if X not in statements:
 statements[X] = []
 # this tuiple is of the format (boolean, word) where boolean means if X is equal to or not equal to Y
 statements[X].append((b, Y))
# check consistency of statements
for i in range(inlen):
 item = indata[i+1]
 parts = item.split(" ")
 X = parts[0].strip()
 Y = parts[2].strip()
 for eq1 in equivalent[X]:
 if eq1 not in statements:
 continue
 for key in statements[eq1]:
 equals = any_equals(equivalent[X], equivalent[Y])
 # we check if the statement is true for all words that rhyme with X
 if (key[0] and equals) or (not key[0] and not equals):
 continue
 else:
 print("wait what?") # the statement was not consistent
 sys.exit(0)
print("yes") # all statements were consistent

I think due to the fact I have two nested loops the time complexity is O(n^2) but I think there is a O(n) algorithm but I can't figure it out. Any tips on how to improve my code to run faster would be much appreciated!

Question 2

General remarks

Organizing your code into functions is a good thing. Unfortunately, you only have one function, any_equals(), and its definition is buried in the middle of a sea of sequential statements, which makes it weird and hard to find. As it turns out, your function
```
# check if any word in eq1 matches any word in eq2
def any_equals(eq1, eq2):
 for i in eq1:
 for j in eq2:
 if i == j:
 return True
 return False
```
... would be better written as
```
def any_equals(eq1, eq2):
 """Check if any element of eq1 equals any element of eq2"""
 return not(set.isdisjoint(set(eq1), set(eq2)))
```
... with a docstring, and a data structure that takes advantage of hashing to avoid O(n²) performance. With an implementation that is so succinct, it might not be worth defining as a function after all.

So, basically, you have no functions in your code. It's hard to analyze eighty-something lines of code all at once without having it broken down into smaller self-contained chunks. There is also no way to get an overview of how your program works.
Code should be compartmentalized and composable. This is just a corollary to the first point. Say, hypothetically, that we no longer cared about rhyming. With a well designed program, you should be able to easily disable the code that implements rhyming. That's not the case with your code — your equivalent dictionary is mentioned all over the place.
Trivial special cases aren't worth the trouble. Why bother writing this?
```
# return yes if the first line was 0
if inlen == 0:
 print("yes")
 exit(0)
```
It's a very unlikely case — basically dead code. Furthermore, the existence of the special case makes it hard for you to verify whether your code works in the general case.

A more interesting special case, if you wanted to code it, is that any ruleset with no "not" rules is automatically consistent.
Avoid parsing the input repeatedly. You have not just two, but three loops that look like this:
```
for i in range(inlen):
 item = indata[i+1]
 parts = item.split(" ")
 X = parts[0].strip()
 Y = parts[2].strip()
```
That means that you failed to parse the input into a good representation. You should be able to read the input in only one pass.

Algorithms and efficiency

Building the equivalent structure for rhymes

This is the code you use to read the rules and note any rhyming equivalence:
```
# build lists of words in data
for i in range(inlen):
 item = indata[i+1]
 parts = item.split(" ")
 word1 = parts[0].strip()
 word2 = parts[2].strip()
 Xs.append(word1)
 Ys.append(word2)
 words.append(word1)
 words.append(word2)
 # add word in equivalent list if not present
 if word1 not in equivalent:
 equivalent[word1] = [word1]
 if word2 not in equivalent:
 equivalent[word2] = [word2]
# check which words rhyme and add to equivalent list
for i in range(len(words)):
 word1 = words[i].strip()
 for j in range(len(words)):
 word2 = words[j]
 if word1 == word2:
 continue
 endinglen = min(3, len(word1), len(word2)) # min(3, |X|, |Y|)
 ending1 = word1[-endinglen:]
 ending2 = word2[-endinglen:]
 if ending1 == ending2:
 equivalent[word1].append(word2)
 equivalent[word2].append(word1)
```
The result, I believe, is a dictionary, whose keys are every word encountered, and whose values are a list of words that rhyme with it. That's not what you really want, though. You want a dictionary that maps each word to the shortest word that rhymes with it. In other words, you want one canonical representation of each word; you don't want to expand each word into more alternative representations.

You can actually take a shortcut here. Only the last three letters of each word matter. You might as well immediately ignore the first few characters of any longer word. One way to accomplish would be using a regular expression match, instead of splitting on spaces.

Note that the second loop above, with i and j, is O(N²).

As a nitpick: word1 = words[i].strip() is not appropriate. The time for sanitizing the input is when you read it. You already stripped them earlier anyway. And if you had done item.split(), it would have split on any whitespace, such that you wouldn't even have to strip anything.
Building the statements structure
```
# build the statements data structure
statements = {}
for i in range(inlen):
 item = indata[i+1]
 parts = item.split(" ")
 X = parts[0].strip()
 b = True if parts[1] == "is" else False
 Y = parts[2].strip()
 if X not in statements:
 statements[X] = []
 # this tuiple is of the format (boolean, word) where boolean means if X is equal to or not equal to Y
 statements[X].append((b, Y))
```
statements[X].append((b, Y)) results in a weird data structure. The most important aspect of a statement is whether it is an "is" rule or a "not" rule, so I would classify the statements primarily according to that dichotomy. The order of the statements does not matter, nor does it matter whether each word appears on the left or right. So, it seems arbitrary that statements is keyed by X, when in fact X and Y should be treated symmetrically.
Statement consistency check
```
# check consistency of statements
for i in range(inlen):
 item = indata[i+1]
 parts = item.split(" ")
 X = parts[0].strip()
 Y = parts[2].strip()
 for eq1 in equivalent[X]:
 if eq1 not in statements:
 continue
 for key in statements[eq1]:
 equals = any_equals(equivalent[X], equivalent[Y])
 # we check if the statement is true for all words that rhyme with X
 if (key[0] and equals) or (not key[0] and not equals):
 continue
 else:
 print("wait what?") # the statement was not consistent
 sys.exit(0)
```
The most important line here is if (key[0] and equals) or (not key[0] and not equals) — and unfortunately it is buried quite deep in the loops.

The right strategy to use, I believe, is to transform the ruleset twice: once by rhyming equivalence, then by "is" equivalence. Then, examine the "not" rules, and see if any of them is of the form X not X, which is an indication of a contradiction.

So, the consistency check should primarily iterate through the "not" rules. (This observation leads to the insight I pointed out above: that any ruleset without any "not" rules is automatically self-consistent.)

General remarks

Organizing your code into functions is a good thing. Unfortunately, you only have one function, any_equals(), and its definition is buried in the middle of a sea of sequential statements, which makes it weird and hard to find. As it turns out, your function
```
# check if any word in eq1 matches any word in eq2
def any_equals(eq1, eq2):
 for i in eq1:
 for j in eq2:
 if i == j:
 return True
 return False
```
... would be better written as
```
def any_equals(eq1, eq2):
 """Check if any element of eq1 equals any element of eq2"""
 return not(set.isdisjoint(set(eq1), set(eq2)))
```
... with a docstring, and a data structure that takes advantage of hashing to avoid O(n²) performance. With an implementation that is so succinct, it might not be worth defining as a function after all.

So, basically, you have no functions in your code. It's hard to analyze eighty-something lines of code all at once without having it broken down into smaller self-contained chunks. There is also no way to get an overview of how your program works.
Code should be compartmentalized and composable. This is just a corollary to the first point. Say, hypothetically, that we no longer cared about rhyming. With a well designed program, you should be able to easily disable the code that implements rhyming. That's not the case with your code — your equivalent dictionary is mentioned all over the place.
Trivial special cases aren't worth the trouble. Why bother writing this?
```
# return yes if the first line was 0
if inlen == 0:
 print("yes")
 exit(0)
```
It's a very unlikely case — basically dead code. Furthermore, the existence of the special case makes it hard for you to verify whether your code works in the general case.

A more interesting special case, if you wanted to code it, is that any ruleset with no "not" rules is automatically consistent.
Avoid parsing the input repeatedly. You have not just two, but three loops that look like this:
```
for i in range(inlen):
 item = indata[i+1]
 parts = item.split(" ")
 X = parts[0].strip()
 Y = parts[2].strip()
```
That means that you failed to parse the input into a good representation. You should be able to read the input in only one pass.

Algorithms and efficiency

Building the equivalent structure for rhymes

This is the code you use to read the rules and note any rhyming equivalence:
```
# build lists of words in data
for i in range(inlen):
 item = indata[i+1]
 parts = item.split(" ")
 word1 = parts[0].strip()
 word2 = parts[2].strip()
 Xs.append(word1)
 Ys.append(word2)
 words.append(word1)
 words.append(word2)
 # add word in equivalent list if not present
 if word1 not in equivalent:
 equivalent[word1] = [word1]
 if word2 not in equivalent:
 equivalent[word2] = [word2]
# check which words rhyme and add to equivalent list
for i in range(len(words)):
 word1 = words[i].strip()
 for j in range(len(words)):
 word2 = words[j]
 if word1 == word2:
 continue
 endinglen = min(3, len(word1), len(word2)) # min(3, |X|, |Y|)
 ending1 = word1[-endinglen:]
 ending2 = word2[-endinglen:]
 if ending1 == ending2:
 equivalent[word1].append(word2)
 equivalent[word2].append(word1)
```
The result, I believe, is a dictionary, whose keys are every word encountered, and whose values are a list of words that rhyme with it. That's not what you really want, though. You want a dictionary that maps each word to the shortest word that rhymes with it. In other words, you want one canonical representation of each word; you don't want to expand each word into more alternative representations.

You can actually take a shortcut here. Only the last three letters of each word matter. You might as well immediately ignore the first few characters of any longer word. One way to accomplish would be using a regular expression match, instead of splitting on spaces.

Note that the second loop above, with i and j, is O(N²).

As a nitpick: word1 = words[i].strip() is not appropriate. The time for sanitizing the input is when you read it. You already stripped them earlier anyway. And if you had done item.split(), it would have split on any whitespace, such that you wouldn't even have to strip anything.
Building the statements structure
```
# build the statements data structure
statements = {}
for i in range(inlen):
 item = indata[i+1]
 parts = item.split(" ")
 X = parts[0].strip()
 b = True if parts[1] == "is" else False
 Y = parts[2].strip()
 if X not in statements:
 statements[X] = []
 # this tuiple is of the format (boolean, word) where boolean means if X is equal to or not equal to Y
 statements[X].append((b, Y))
```
statements[X].append((b, Y)) results in a weird data structure. The most important aspect of a statement is whether it is an "is" rule or a "not" rule, so I would classify the statements primarily according to that dichotomy. The order of the statements does not matter, nor does it matter whether each word appears on the left or right. So, it seems arbitrary that statements is keyed by X, when in fact X and Y should be treated symmetrically.
Statement consistency check
```
# check consistency of statements
for i in range(inlen):
 item = indata[i+1]
 parts = item.split(" ")
 X = parts[0].strip()
 Y = parts[2].strip()
 for eq1 in equivalent[X]:
 if eq1 not in statements:
 continue
 for key in statements[eq1]:
 equals = any_equals(equivalent[X], equivalent[Y])
 # we check if the statement is true for all words that rhyme with X
 if (key[0] and equals) or (not key[0] and not equals):
 continue
 else:
 print("wait what?") # the statement was not consistent
 sys.exit(0)
```
The most important line here is if (key[0] and equals) or (not key[0] and not equals) — and unfortunately it is buried quite deep in the loops.

The right strategy to use, I believe, is to transform the ruleset twice: once by rhyming equivalence, then by "is" equivalence. Then, examine the "not" rules, and see if any of them is of the form X not X, which is an indication of a contradiction.

So, the consistency check should primarily iterate through the "not" rules. (This observation leads to the insight I pointed out above: that any ruleset without any "not" rules is automatically self-consistent.)

Stack Exchange Network

Check consistency of a list of statements, with fuzzy rhyme matching

1 Answer 1

General remarks

Algorithms and efficiency

Suggested solution

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Check consistency of a list of statements, with fuzzy rhyme matching

1 Answer 1

General remarks

Algorithms and efficiency

Suggested solution

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions