Correct logic for string comparison in Wordle

Question 1

I'm a beginner trying to make a Wordle clone in python. I already have most of it sorted out but am concerned about my code being difficult to read. The part I'm struggling with the most is the algorithm for how guesses are displayed.

An easily overlooked aspect of the original game is that it tries to make the player's life easier by hiding additional instances of a character in a guess whenever all instances of that letter have been found. That is, if the solution was "walls", and the player inputs "lulls", the last three characters would show as green as expected, but the first 'L' would be greyed out to indicate that all instances of that letter have been found (only for that line, subsequent lines/guesses reset this).

This may seem obvious if you've played the game a lot, but many clones don't do this, which has the side effect of making the game harder. This answer to another wordle related question touches on this.

I'm trying to implement this functionality in the function that compares player input with the solution in my version, the relevant section of which currently looks like this:

from collections import Counter, defaultdict
green = 3
yellow = 2
grey = 1
def compare_word(string, solution):
 def clear_yellow_letters():
 # use counter to count instaces
 # of a letter in each string
 str_c_count = Counter(string)
 sol_c_count = Counter(solution)
 for k, v in c_green_count.items():
 if v == sol_c_count[k]:
 # if there are more instances of letter than
 # have been located in solution, grey-out
 # additional instances
 if str_c_count[k] > sol_c_count[k]:
 for i in range(len(color_dic[0])):
 char = color_dic[0][i]
 if char == k and color_dic[1][i] == yellow:
 color_dic[1][i] = grey
 return color_dic
 # color_dic is actually a list >_>. since the same letter can appear
 # multiple times in a word, we prefer to use the index as a key
 # instead of a real dictionary, where repeated letters would all
 # point to the same color regardless of location.
 color_dic = string, [0 for c in string]
 c_green_count = defaultdict(int) # creates keys if they don't exist
 # this loop compares the chars in string and solution based on index,
 # and assigns a color to the respective position in the array.
 for i in range(len(color_dic[0])):
 char = color_dic[0][i]
 if char == solution[i]:
 color = green
 c_green_count[char] += 1
 elif char in solution:
 color = yellow
 else:
 color = grey
 color_dic[1][i] = color
 if c_green_count.items():
 color_dic = clear_yellow_letters()
 return color_dic
print(compare_word('lulls', 'walls'))

This prints a tuple containing a string, and a list of ints referencing each color used for displaying guesses. Note that this use of print is just for demonstration, the actual output is handled in curses (that's what the numbers are for).

Here clear_yellow_letters() does what I just described. Without it compare_word() would return the int that maps to l in our string as 2 (yellow). While this works as it is, I feel there must be a simpler way to do it, as that section is a bit difficult to follow, even with all the comments.

Another thing I'm confused about is code style/formatting. This is only an excerpt of my full project, which I've split into several modules due to its length, but one issue I keep coming across is how to know when to break-down functions into smaller functions and when to not. With compare_word() for example, clear_yellow_letters() is only called inside of it. Should I give the latter any parameters if all the vars it uses exist in the same scope, or is it okay to use the same names in this case? Would it be best to define it as a separate function for legibility?

I hope this excerpt is sufficient for understanding the logic for how I'm trying to get the words displayed, I am not including the whole thing because as I mentioned, it's rather long, but if more context is needed I'll oblige. I'd appreciate any suggestions on what could be improved.

Question 2

First of all, I think it's great that you ask for feedback on a well isolated and explained piece of code. I also like your attention to detail in the rules of the game.

I also find this code complex, I think for two main reasons:

The algorithm itself is a bit complex
The implementation of the algorithm is a bit complex

Simplifying the algorithm

The current algorithm goes something like this:

In a 1st pass, check the letters pairwise:
- if the pair is a match:
  - mark the position green
  - update the count of matches of this letter
- or else if the guess exists somewhere else in the solution:
  - mark the position yellow
- or else mark the position grey
If there were any matches, make a 2nd pass to clear some yellows
- for each matched letter and its count
  - if all instances of this letter were matched
    - if the guess contains more instances of this letter
      - find all unmatched positions of this letter and mark them grey

Consider a simpler alternative:

Initialize all positions to grey
Compute a count of all letters in the solution
In a 1st pass, check the letters pairwise:
- if the pair is a match:
  - mark the position green
  - update the count of matches of this letter
In a 2st pass, check the letters pairwise:
- if the pair is not a match, and the solution contains unmatched positions of this letter
  - mark the position yellow

That's fewer steps, and also simpler to implement: (Using the Status enum idea from the answer of @Reinderien)

class Status(Enum):
 MATCH = 3
 MISPLACE = 2
 MISMATCH = 1
status = [Status.MISMATCH] * len(guess)
counts_in_solution = Counter(solution)
matches = defaultdict(int)
for index, (g, s) in enumerate(zip(guess, solution)):
 if g == s:
 status[index] = Status.MATCH
 matches[g] += 1
for index, (g, s) in enumerate(zip(guess, solution)):
 if g != s and matches[g] < counts_in_solution[g]:
 status[index] = Status.MISPLACE
return [s.value for s in status]

Simplifying the implementation

Even if we don't change the main algorithm, the implementation of the original algorithm can be written simpler.

First of all, some of the variable names don't help understanding the logic.

string would describe itself better as guess
- Note that string was also a poor name because it shadows a common Python package
the color codes would be better in all-caps
c_green_count would be less cryptic as green_counts or matches

Then there is color_dic...

color_dic = string, [0 for c in string]

In a comment you had the right instinct: "color_dic is actually a list". Even further:

color_dic is a tuple, and the comment was probably referring to the 2nd element only
color_dic was not useful at all:
- the first value is just the guess, and there is no need to store it in another tuple
- the second value is probably just what you intended for color_dic. A better name would be colors.

Even with just these renames, and the minor transformation of dropping color_dic, I think the code already looks a bit less complex:

matches = defaultdict(int)
colors = [0 for c in guess]
for i in range(len(guess)):
 char = guess[i]
 if char == solution[i]:
 color = GREEN
 matches[char] += 1
 elif char in solution:
 color = YELLOW
 else:
 color = GREY
 colors[i] = color
if matches.items():
 clear_yellow_letters()
return guess, colors

Next, let's apply some idiomatic transformations:

colors = [GREY] * len(guess)
for index, [g, s] in enumerate(zip(guess, solution)):
 if g == s:
 colors[index] = GREEN
 matches[g] += 1
 elif g in solution:
 colors[index] = YELLOW
if matches:
 clear_yellow_letters()

That is:

To create a list of the same value, you can use [the_value] * count
It's good to initialize colors to all grey because:
- It's a valid value, unlike the original 0
- It's a sensible default
- This way we don't need to overwrite values to grey later
Instead iterating over indexes, enumerate lets us iterate over (index, value) pairs
Instead using index to get pairs of characters from guess and solution, we can use zip to line them up
Instead of if matches.items() it's equivalent and shorter to write if matches

For the 2nd pass, again I would start with renaming variables that look cryptic and hard to understand, to names that seem more natural and descriptive:

str_c_count -> counts_in_guess
sol_c_count -> counts_in_solution
k, v -> letter, count

With the above renames, and other minor transformations we get:

def clear_yellow_letters():
 counts_in_guess = Counter(guess)
 counts_in_solution = Counter(solution)
 for letter, count in matches.items():
 if count == counts_in_solution[letter] and counts_in_guess[letter] > count:
 for index, g in enumerate(guess):
 if g == letter and colors[index] == YELLOW:
 colors[index] = GREY

Where the other transformations are:

Joined the two ifs in the outer loop: this is equivalent, and looks simpler
Replaced the second reference to counts_in_solution[letter] with count, because at that point we know they are the same
Using enumerate to loop over index + value pairs
Not returning colors, since we can manipulate the variable in the context

Having a for loop nested within another for loop still looks complex of course. If you think about it, the inner loop is clumsy, because what we don't really want to iterate over all the letters in guess, what we really want is operate on the matched letters only.

In other words, currently we're iterating over dictionary entries and then doing a search in a list. If we reverse these operations we would iterate over a list and then do a search in a dictionary, which is easier and faster too:

def clear_yellow_letters():
 counts_in_solution = Counter(solution)
 for index, (g, s) in enumerate(zip(guess, solution)):
 if g != s and matches[g] == counts_in_solution[g]:
 colors[index] = GREY

Lastly, since this function doesn't really buy us much, we as well inline it, arriving at something that looks simpler, with simple equivalent transformations, and a very small change in the algorithm:

def compare_word(guess, solution):
 matches = defaultdict(int)
 colors = [GREY] * len(guess)
 for index, (g, s) in enumerate(zip(guess, solution)):
 if g == s:
 colors[index] = GREEN
 matches[g] += 1
 elif g in solution:
 colors[index] = YELLOW
 if matches:
 counts_in_solution = Counter(solution)
 for index, (g, s) in enumerate(zip(guess, solution)):
 if g != s and matches[g] == counts_in_solution[g]:
 colors[index] = GREY
 return guess, colors

Defining functions inside functions

Defining functions inside functions is reasonable when:

The function is not useful outside
The function is called more than once

In the posted code you only call the inner function once, so I think it doesn't add value.

And when defining inner functions, beware of accidentally modifying variables that our defined in their outer scope. Code editors like PyCharm will give a warning for variables in inner functions whose names shadow variables in the outer scope, which can be confusing for readers, and a source of mistakes.

Question 3

Greatly appreciate the answer :). I have one additional question, is there any advantage to declaring Status as an Enum if I plan on using its values as ints anyway? Are they preferable solely because they're constants?

Question 4

Enum is the right type when you need to represent a fixed set of constants. Here, status can only have 3 possible values and nothing else. Using an enum correctly expresses this intent. Even if you will use the int value right after the function returns, using the enum within the function helps you express your intent with precision that's not possible with simple int values. When reading code, understanding the intent helps to understand the code.

Question 5

You should reframe your colour constants as an enum. Also, since this is all purely business logic and not presentation, the first-class names should be match status descriptions with the colours being secondary.

I don't see a lot of value in having your inner clear_yellow_letters as a nested function.

I think you've done the right thing in having two counters, one for each of string and solution. However, the algorithm that uses them can be simplified by deriving a dictionary of minimum counts as I will demonstrate.

color_dic is a troubled name for a number of reasons, including that it isn't a dictionary and it shouldn't contain colours per se.

I do not think that a defaultdict is necessary.

Suggested

I believe this to be equivalent but you will want to test.

from collections import Counter
from enum import Enum
class Status(Enum):
 MATCH = 3
 MISPLACE = 2
 MISMATCH = 1
 GREEN = MATCH
 YELLOW = MISPLACE
 GREY = MISMATCH
def compare_word(provided: str, solution: str):
 matches = [
 Status.MATCH if p == s else Status.MISMATCH
 for p, s in zip(provided, solution)
 ]
 prov_remain = Counter(
 p for p, status in zip(provided, matches)
 if status is not Status.MATCH
 )
 soln_remain = Counter(
 s for s, status in zip(solution, matches)
 if status is not Status.MATCH
 )
 reassignable = {
 p: min(n_p, soln_remain[p])
 for p, n_p in prov_remain.items()
 }
 for i, p in enumerate(provided):
 if matches[i] is not Status.MATCH and reassignable.get(p):
 matches[i] = Status.MISPLACE
 reassignable[p] -= 1
 return matches
print(compare_word('lulls', 'walls'))
print(compare_word('abedc', 'abcde'))
print(compare_word('bbccc', 'aabdd'))

janos janos 113k15 gold badges154 silver badges396 bronze badges · Accepted Answer · 2022-08-19 17:30:26Z

First of all, I think it's great that you ask for feedback on a well isolated and explained piece of code. I also like your attention to detail in the rules of the game.

I also find this code complex, I think for two main reasons:

The algorithm itself is a bit complex
The implementation of the algorithm is a bit complex

Simplifying the algorithm

The current algorithm goes something like this:

In a 1st pass, check the letters pairwise:
- if the pair is a match:
  - mark the position green
  - update the count of matches of this letter
- or else if the guess exists somewhere else in the solution:
  - mark the position yellow
- or else mark the position grey
If there were any matches, make a 2nd pass to clear some yellows
- for each matched letter and its count
  - if all instances of this letter were matched
    - if the guess contains more instances of this letter
      - find all unmatched positions of this letter and mark them grey

Consider a simpler alternative:

Initialize all positions to grey
Compute a count of all letters in the solution
In a 1st pass, check the letters pairwise:
- if the pair is a match:
  - mark the position green
  - update the count of matches of this letter
In a 2st pass, check the letters pairwise:
- if the pair is not a match, and the solution contains unmatched positions of this letter
  - mark the position yellow

That's fewer steps, and also simpler to implement: (Using the Status enum idea from the answer of @Reinderien)

class Status(Enum):
 MATCH = 3
 MISPLACE = 2
 MISMATCH = 1
status = [Status.MISMATCH] * len(guess)
counts_in_solution = Counter(solution)
matches = defaultdict(int)
for index, (g, s) in enumerate(zip(guess, solution)):
 if g == s:
 status[index] = Status.MATCH
 matches[g] += 1
for index, (g, s) in enumerate(zip(guess, solution)):
 if g != s and matches[g] < counts_in_solution[g]:
 status[index] = Status.MISPLACE
return [s.value for s in status]

Simplifying the implementation

Even if we don't change the main algorithm, the implementation of the original algorithm can be written simpler.

First of all, some of the variable names don't help understanding the logic.

string would describe itself better as guess
- Note that string was also a poor name because it shadows a common Python package
the color codes would be better in all-caps
c_green_count would be less cryptic as green_counts or matches

Then there is color_dic...

color_dic = string, [0 for c in string]

In a comment you had the right instinct: "color_dic is actually a list". Even further:

color_dic is a tuple, and the comment was probably referring to the 2nd element only
color_dic was not useful at all:
- the first value is just the guess, and there is no need to store it in another tuple
- the second value is probably just what you intended for color_dic. A better name would be colors.

Even with just these renames, and the minor transformation of dropping color_dic, I think the code already looks a bit less complex:

matches = defaultdict(int)
colors = [0 for c in guess]
for i in range(len(guess)):
 char = guess[i]
 if char == solution[i]:
 color = GREEN
 matches[char] += 1
 elif char in solution:
 color = YELLOW
 else:
 color = GREY
 colors[i] = color
if matches.items():
 clear_yellow_letters()
return guess, colors

Next, let's apply some idiomatic transformations:

colors = [GREY] * len(guess)
for index, [g, s] in enumerate(zip(guess, solution)):
 if g == s:
 colors[index] = GREEN
 matches[g] += 1
 elif g in solution:
 colors[index] = YELLOW
if matches:
 clear_yellow_letters()

That is:

To create a list of the same value, you can use [the_value] * count
It's good to initialize colors to all grey because:
- It's a valid value, unlike the original 0
- It's a sensible default
- This way we don't need to overwrite values to grey later
Instead iterating over indexes, enumerate lets us iterate over (index, value) pairs
Instead using index to get pairs of characters from guess and solution, we can use zip to line them up
Instead of if matches.items() it's equivalent and shorter to write if matches

For the 2nd pass, again I would start with renaming variables that look cryptic and hard to understand, to names that seem more natural and descriptive:

str_c_count -> counts_in_guess
sol_c_count -> counts_in_solution
k, v -> letter, count

With the above renames, and other minor transformations we get:

def clear_yellow_letters():
 counts_in_guess = Counter(guess)
 counts_in_solution = Counter(solution)
 for letter, count in matches.items():
 if count == counts_in_solution[letter] and counts_in_guess[letter] > count:
 for index, g in enumerate(guess):
 if g == letter and colors[index] == YELLOW:
 colors[index] = GREY

Where the other transformations are:

Joined the two ifs in the outer loop: this is equivalent, and looks simpler
Replaced the second reference to counts_in_solution[letter] with count, because at that point we know they are the same
Using enumerate to loop over index + value pairs
Not returning colors, since we can manipulate the variable in the context

Having a for loop nested within another for loop still looks complex of course. If you think about it, the inner loop is clumsy, because what we don't really want to iterate over all the letters in guess, what we really want is operate on the matched letters only.

In other words, currently we're iterating over dictionary entries and then doing a search in a list. If we reverse these operations we would iterate over a list and then do a search in a dictionary, which is easier and faster too:

def clear_yellow_letters():
 counts_in_solution = Counter(solution)
 for index, (g, s) in enumerate(zip(guess, solution)):
 if g != s and matches[g] == counts_in_solution[g]:
 colors[index] = GREY

Lastly, since this function doesn't really buy us much, we as well inline it, arriving at something that looks simpler, with simple equivalent transformations, and a very small change in the algorithm:

def compare_word(guess, solution):
 matches = defaultdict(int)
 colors = [GREY] * len(guess)
 for index, (g, s) in enumerate(zip(guess, solution)):
 if g == s:
 colors[index] = GREEN
 matches[g] += 1
 elif g in solution:
 colors[index] = YELLOW
 if matches:
 counts_in_solution = Counter(solution)
 for index, (g, s) in enumerate(zip(guess, solution)):
 if g != s and matches[g] == counts_in_solution[g]:
 colors[index] = GREY
 return guess, colors

Defining functions inside functions

Defining functions inside functions is reasonable when:

The function is not useful outside
The function is called more than once

In the posted code you only call the inner function once, so I think it doesn't add value.

And when defining inner functions, beware of accidentally modifying variables that our defined in their outer scope. Code editors like PyCharm will give a warning for variables in inner functions whose names shadow variables in the outer scope, which can be confusing for readers, and a source of mistakes.

Greatly appreciate the answer :). I have one additional question, is there any advantage to declaring Status as an Enum if I plan on using its values as ints anyway? Are they preferable solely because they're constants?
Enum is the right type when you need to represent a fixed set of constants. Here, status can only have 3 possible values and nothing else. Using an enum correctly expresses this intent. Even if you will use the int value right after the function returns, using the enum within the function helps you express your intent with precision that's not possible with simple int values. When reading code, understanding the intent helps to understand the code.

Stack Exchange Network

Correct logic for string comparison in Wordle

2 Answers 2

Simplifying the algorithm

Simplifying the implementation

Defining functions inside functions

Suggested

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Correct logic for string comparison in Wordle

2 Answers 2

Simplifying the algorithm

Simplifying the implementation

Defining functions inside functions

Suggested

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions