Markov country name generator

Question 1

I wrote a country name generator in Python 3.5. My goal was to get randomized names that would look as much like real-world names as possible. Each name needed to have a noun and an adjective form (e.g., Italy and Italian).

I started with a list of real countries, regions, and cities, which I stored in a text file. The names are broken up by syllables, with the noun and adjective endings separated (e.g., i-ta-l y/ian). The program splits each name into syllables, and each syllable into three segments: onset, nucleus, and coda (i.e., the leading consonants, the vowels, and the trailing consonants). It then uses the frequency of these segments in relation to each other to drive a Markov process that generates a name. (It's not a pure Markov process because I wanted to ensure a distribution of syllable counts similar to the input set. I also special-cased the endings.) Several types of undesirable names are rejected.

Main code

#!/usr/bin/python3
import re, random
# A regex that matches a syllable, with three groups for the three
# segments of the syllable: onset (initial consonants), nucleus (vowels),
# and coda (final consonants).
# The regex also matches if there is just an onset (even an empty
# onset); this case corresponds to the final partial syllable of the
# stem, which is usually the consonant before a vowel ending (for
# example, the d in "ca-na-d a").
syllableRgx = re.compile(r"(y|[^aeiouy]*)([aeiouy]+|$)([^aeiouy]*)")
nameFile = "names.txt"
# Dictionary that holds the frequency of each syllable count (note that these
# are the syllables *before* the ending, so "al-ba-n ia" only counts two)
syllableCounts = {}
# List of four dictionaries (for onsets, nuclei, codas, and endings):
# Each dictionary's key/value pairs are prevSegment:segmentDict, where
# segmentDict is a frequency dictionary of various onsets, nuclei, codas,
# or endings, and prevSegment is a segment that can be the last nonempty
# segment preceding them. A prevSegment of None marks segments at the
# beginnings of names.
segmentData = [{}, {}, {}, {}]
ONSET = 0
NUCLEUS = 1
CODA = 2
ENDING = 3
# Read names from file and generate the segmentData structure
with open(nameFile) as f:
 for line in f.readlines():
 # Strip whitespace, ignore blank lines and comments
 line = line.strip()
 if not line:
 continue
 if line[0] == "#":
 continue
 stem, ending = line.split()
 # Endings should be of the format noun/adj
 if "/" not in ending:
 # The noun ending is given; the adjective ending can be
 # derived by appending -n
 ending = "{}/{}n".format(ending, ending)
 # Syllable count is the number of hyphens
 syllableCount = stem.count("-")
 if syllableCount in syllableCounts:
 syllableCounts[syllableCount] += 1
 else:
 syllableCounts[syllableCount] = 1
 # Add the segments in this name to segmentData
 prevSegment = None
 for syllable in stem.split("-"):
 segments = syllableRgx.match(syllable).groups()
 if segments[NUCLEUS] == segments[CODA] == "":
 # A syllable with emtpy nucleus and coda comes right before
 # the ending, so we only process the onset
 segments = (segments[ONSET],)
 for segType, segment in enumerate(segments):
 if prevSegment not in segmentData[segType]:
 segmentData[segType][prevSegment] = {}
 segFrequencies = segmentData[segType][prevSegment]
 if segment in segFrequencies:
 segFrequencies[segment] += 1
 else:
 segFrequencies[segment] = 1
 if segment:
 prevSegment = segment
 # Add the ending to segmentData
 if prevSegment not in segmentData[ENDING]:
 segmentData[ENDING][prevSegment] = {}
 endFrequencies = segmentData[ENDING][prevSegment]
 if ending in endFrequencies:
 endFrequencies[ending] += 1
 else:
 endFrequencies[ending] = 1
def randFromFrequencies(dictionary):
 "Returns a random dictionary key, where the values represent frequencies."
 keys = dictionary.keys()
 frequencies = dictionary.values()
 index = random.randrange(sum(dictionary.values()))
 for key, freq in dictionary.items():
 if index < freq:
 # Select this one
 return key
 else:
 index -= freq
 # Weird, should have returned something
 raise ValueError("randFromFrequencies didn't pick a value "
 "(index remainder is {})".format(index))
def markovName(syllableCount):
 "Generate a country name using a Markov-chain-like process."
 prevSegment = None
 stem = ""
 for syll in range(syllableCount):
 for segType in [ONSET, NUCLEUS, CODA]:
 try:
 segFrequencies = segmentData[segType][prevSegment]
 except KeyError:
 # In the unusual situation that the chain fails to find an
 # appropriate next segment, it's too complicated to try to
 # roll back and pick a better prevSegment; so instead,
 # return None and let the caller generate a new name
 return None
 segment = randFromFrequencies(segFrequencies)
 stem += segment
 if segment:
 prevSegment = segment
 endingOnset = None
 # Try different onsets for the last syllable till we find one that's
 # legal before an ending; we also allow empty onsets. Because it's
 # possible we won't find one, we also limit the number of retries
 # allowed.
 retries = 10
 while (retries and endingOnset != ""
 and endingOnset not in segmentData[ENDING]):
 segFrequencies = segmentData[ONSET][prevSegment]
 endingOnset = randFromFrequencies(segFrequencies)
 retries -= 1
 stem += endingOnset
 if endingOnset != "":
 prevSegment = endingOnset
 if prevSegment in segmentData[ENDING]:
 # Pick an ending that goes with the prevSegment
 endFrequencies = segmentData[ENDING][prevSegment]
 endings = randFromFrequencies(endFrequencies)
 else:
 # It can happen, if we used an empty last-syllable onset, that
 # the previous segment does not appear before any ending in the
 # data set. In this case, we'll just use -a(n) for the ending.
 endings = "a/an"
 endings = endings.split("/")
 nounForm = stem + endings[0]
 # Filter out names that are too short or too long
 if len(nounForm) < 3:
 # This would give two-letter names like Mo, which don't appeal
 # to me
 return None
 if len(nounForm) > 11:
 # This would give very long names like Imbadossorbia that are too
 # much of a mouthful
 return None
 # Filter out names with weird consonant clusters at the end
 for consonants in ["bl", "tn", "sr", "sn", "sm", "shm"]:
 if nounForm.endswith(consonants):
 return None
 # Filter out names that sound like anatomical references
 for bannedSubstring in ["vag", "coc", "cok", "kok", "peni"]:
 if bannedSubstring in stem:
 return None
 if nounForm == "ass":
 # This isn't a problem if it's part of a larger name like Assyria,
 # so filter it out only if it's the entire name
 return None
 return stem, endings

Testing code

def printCountryNames(count):
 for i in range(count):
 syllableCount = randFromFrequencies(syllableCounts)
 nameInfo = markovName(syllableCount)
 while nameInfo is None:
 nameInfo = markovName(syllableCount)
 stem, endings = nameInfo
 stem = stem.capitalize()
 noun = stem + endings[0]
 adjective = stem + endings[1]
 print("{} ({})".format(noun, adjective))
if __name__ == "__main__":
 printCountryNames(10)

Example `names.txt` contents

# Comments are ignored
i-ta-l y/ian
# A suffix can be empty
i-ra-q /i
# The stem can end with a syllable break
ge-no- a/ese
# Names whose adjective suffix just adds an -n need only list the noun suffix
ar-me-n ia
sa-mo- a

My full names.txt file can be found, along with the code, in this Gist.

Example output

Generated using the full data file:

Slorujarnia (Slorujarnian)
Ashmar (Ashmari)
Babya (Babyan)
Randorkia (Randorkian)
Esanoa (Esanoese)
Manglalia (Manglalic)
Konara (Konaran)
Lilvispia (Lilvispian)
Cenia (Cenian)
Rafri (Rafrian)

Questions

Is my code readable? Clear variable and function names? Sufficient comments?
Should I restructure anything?
Are there any Python 3 features I could use or use better? I'm particularly unused to format and the various approaches for using it.

If you see anything else that can be improved, please let me know. Just one exception: I know the PEP standard is snake_case, but I want to use camelCase and I don't intend to change that. Other formatting tips are welcome.

Question 2

you might consider running pep8 and pyflakes on your code

Question 3

I was really hoping to see "Eblonia" in the example output.

Question 4

@That1Guy I did get "Elbonia" in one test run. :^D

Question 5

It is better to follow PEP8 which says that import statements, such as in your case, should use multiple lines:
```
import re
import random
```
Whatever the programming language you use, it is better to avoid input/output operations whenever you can. So instead of storing the countries' names in a text file, you can choose a suitable Python data structure for the purpose.
When someone reads your main program, he must straightforwardly knows what it is doing. It is not the case with your main.py file where I see lot of distracting information and noise. For example, you should save all those constants in a separate module you may call configurations.py, cfg.py, settings.py or whatever name you think it fits into the your project's architecture.
Choose meaningful names: while most of the names you chose are edible, I believe you still can make some improvements for few of them. That is the case, for example, of nameFile which is too vague and does not bring any information apart from the one of the assignment operation itself: nameFile = "names.txt". Sure it is a name of a file, but only after reading your program's statement that one can guess what you would mean by nameFile and start to suggest you right away a more suitable and informative name such as countries_names. Note in my suggestion there is not the container's name. I mean, I am not pushing the reader of your code to know programming details such as whether you stored the information in a file or this or that data structure. Names should be "high level" and independent from the data structure they represent. This offers you the advantage not to spot the same name in your program to re-write it just because you changed the store data from a file to an other data structure. This also applies to syllableRgx: for sure when someone reads syllableRgx = re.compile(r"...") he understands you are storing the result of a regex expression. But you should change this name to something better for the reasons I explained just before.
You should follow the standard naming conventions. For example, syllableCounts and segmentData should be written syllable_counts and segment_data respectively.
I want to use camelCase No. When develop in a given programming language, adopt its spirit and philosophy. It is like when you join a new company's developer team: adapt yourself to themselves, and do not ask them to adapt themselves to your habits and wishes.

Question 6

Mmm. Edible names. (Readable?)

Question 7

By edible (rather a French way of expressing myself) I meant "meaningful names that tell about the variable's, function's or class' purpose*

Question 8

Often the information needs to get stored in a language-agnostic way at some point. In such a case, would you keep the structure as-is or first put everything in a variable and print the whole variable in 1 go at the end?

Question 9

You are right, constants should be written in capital letters as you did, and as the OP did , but syllable_counts is a dictionary which is, in the beginning inititalized to be empty, later it is filled with data, so it is not a constant @Wyrmwood

Question 10

I think 'digestible' (or better still, 'easy to digest') is better English, while still being faithful to your intention in French.

Question 11

Looping over lines of a file

Probably a minor nitpick, but when using a file object returned by open(), you can just iterate over the object, instead of calling readlines(), like so:

# Read names from file and generate the segmentData structure
with open(nameFile) as input_names:
 for line in input_names:

From the docs:

readlines(hint=-1)

Read and return a list of lines from the stream. hint can be specified to control the number of lines read: no more lines will be read if the total size (in bytes/characters) of all lines so far exceeds hint.

Note that it’s already possible to iterate on file objects using for line in file: ... without calling file.readlines().

So, if you're not going to limit the data being read, then there's no need to use readlines().

Testing if any element matches a condition

... can be done using any(), map(), and an appropriate function, so:

# Filter out names with weird consonant clusters at the end
weird_consonant_clusters = ["bl", "tn", "sr", "sn", "sm", "shm"]
if any(map(nounForm.endswith, weird_consonant_clusters)):
 return None

With the bannedSubstring case, though, there's no direct equivalent of in: you'd have to use count() or write a lambda, so there might not be much to be gained here.

Increment-or-set

For the frequencies, where you do an increment or set, you could use the get method, or a defaultdict, so that this:

if ending in endFrequencies:
 endFrequencies[ending] += 1
else:
 endFrequencies[ending] = 1

Becomes:

endFrequencies[ending] = endFrequencies.get(ending, 0) + 1

Or if endFrequencies is a defaultdict(int), just:

endFrequencies[ending] += 1

Question 12

There’s a specialized collection class made just for this purpose: collections.Counter. Just initialize with endFrequencies = Counter() and update with endFrequencies[ending] += 1.

Question 13

[facepalm] Forgot about iterating over the file. Due to another answer's suggestion I won't be reading a file anymore, but thanks. defaultdict is another good idea, although it looks like @Chortos-2's collections.Counter is exactly what I need.

Billal BEGUERADJ Billal BEGUERADJ 1 · Accepted Answer · 2017-08-14 06:14:46Z

It is better to follow PEP8 which says that import statements, such as in your case, should use multiple lines:
```
import re
import random
```
Whatever the programming language you use, it is better to avoid input/output operations whenever you can. So instead of storing the countries' names in a text file, you can choose a suitable Python data structure for the purpose.
When someone reads your main program, he must straightforwardly knows what it is doing. It is not the case with your main.py file where I see lot of distracting information and noise. For example, you should save all those constants in a separate module you may call configurations.py, cfg.py, settings.py or whatever name you think it fits into the your project's architecture.
Choose meaningful names: while most of the names you chose are edible, I believe you still can make some improvements for few of them. That is the case, for example, of nameFile which is too vague and does not bring any information apart from the one of the assignment operation itself: nameFile = "names.txt". Sure it is a name of a file, but only after reading your program's statement that one can guess what you would mean by nameFile and start to suggest you right away a more suitable and informative name such as countries_names. Note in my suggestion there is not the container's name. I mean, I am not pushing the reader of your code to know programming details such as whether you stored the information in a file or this or that data structure. Names should be "high level" and independent from the data structure they represent. This offers you the advantage not to spot the same name in your program to re-write it just because you changed the store data from a file to an other data structure. This also applies to syllableRgx: for sure when someone reads syllableRgx = re.compile(r"...") he understands you are storing the result of a regex expression. But you should change this name to something better for the reasons I explained just before.
You should follow the standard naming conventions. For example, syllableCounts and segmentData should be written syllable_counts and segment_data respectively.
I want to use camelCase No. When develop in a given programming language, adopt its spirit and philosophy. It is like when you join a new company's developer team: adapt yourself to themselves, and do not ask them to adapt themselves to your habits and wishes.

By edible (rather a French way of expressing myself) I meant "meaningful names that tell about the variable's, function's or class' purpose*
Often the information needs to get stored in a language-agnostic way at some point. In such a case, would you keep the structure as-is or first put everything in a variable and print the whole variable in 1 go at the end?
You are right, constants should be written in capital letters as you did, and as the OP did , but syllable_counts is a dictionary which is, in the beginning inititalized to be empty, later it is filled with data, so it is not a constant @Wyrmwood
I think 'digestible' (or better still, 'easy to digest') is better English, while still being faithful to your intention in French.

Stack Exchange Network

Markov country name generator

Main code

Testing code

Example `names.txt` contents

Example output

Questions

2 Answers 2

Looping over lines of a file

`readlines(hint=-1)`

Testing if any element matches a condition

Increment-or-set

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Markov country name generator

Main code

Testing code

Example names.txt contents

Example output

Questions

2 Answers 2

Looping over lines of a file

readlines(hint=-1)

Testing if any element matches a condition

Increment-or-set

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Example `names.txt` contents

`readlines(hint=-1)`