I wrote a country name generator in Python 3.5. My goal was to get randomized names that would look as much like real-world names as possible. Each name needed to have a noun and an adjective form (e.g., Italy
and Italian
).
I started with a list of real countries, regions, and cities, which I stored in a text file. The names are broken up by syllables, with the noun and adjective endings separated (e.g., i-ta-l y/ian
). The program splits each name into syllables, and each syllable into three segments: onset, nucleus, and coda (i.e., the leading consonants, the vowels, and the trailing consonants). It then uses the frequency of these segments in relation to each other to drive a Markov process that generates a name. (It's not a pure Markov process because I wanted to ensure a distribution of syllable counts similar to the input set. I also special-cased the endings.) Several types of undesirable names are rejected.
Main code
#!/usr/bin/python3
import re, random
# A regex that matches a syllable, with three groups for the three
# segments of the syllable: onset (initial consonants), nucleus (vowels),
# and coda (final consonants).
# The regex also matches if there is just an onset (even an empty
# onset); this case corresponds to the final partial syllable of the
# stem, which is usually the consonant before a vowel ending (for
# example, the d in "ca-na-d a").
syllableRgx = re.compile(r"(y|[^aeiouy]*)([aeiouy]+|$)([^aeiouy]*)")
nameFile = "names.txt"
# Dictionary that holds the frequency of each syllable count (note that these
# are the syllables *before* the ending, so "al-ba-n ia" only counts two)
syllableCounts = {}
# List of four dictionaries (for onsets, nuclei, codas, and endings):
# Each dictionary's key/value pairs are prevSegment:segmentDict, where
# segmentDict is a frequency dictionary of various onsets, nuclei, codas,
# or endings, and prevSegment is a segment that can be the last nonempty
# segment preceding them. A prevSegment of None marks segments at the
# beginnings of names.
segmentData = [{}, {}, {}, {}]
ONSET = 0
NUCLEUS = 1
CODA = 2
ENDING = 3
# Read names from file and generate the segmentData structure
with open(nameFile) as f:
for line in f.readlines():
# Strip whitespace, ignore blank lines and comments
line = line.strip()
if not line:
continue
if line[0] == "#":
continue
stem, ending = line.split()
# Endings should be of the format noun/adj
if "/" not in ending:
# The noun ending is given; the adjective ending can be
# derived by appending -n
ending = "{}/{}n".format(ending, ending)
# Syllable count is the number of hyphens
syllableCount = stem.count("-")
if syllableCount in syllableCounts:
syllableCounts[syllableCount] += 1
else:
syllableCounts[syllableCount] = 1
# Add the segments in this name to segmentData
prevSegment = None
for syllable in stem.split("-"):
segments = syllableRgx.match(syllable).groups()
if segments[NUCLEUS] == segments[CODA] == "":
# A syllable with emtpy nucleus and coda comes right before
# the ending, so we only process the onset
segments = (segments[ONSET],)
for segType, segment in enumerate(segments):
if prevSegment not in segmentData[segType]:
segmentData[segType][prevSegment] = {}
segFrequencies = segmentData[segType][prevSegment]
if segment in segFrequencies:
segFrequencies[segment] += 1
else:
segFrequencies[segment] = 1
if segment:
prevSegment = segment
# Add the ending to segmentData
if prevSegment not in segmentData[ENDING]:
segmentData[ENDING][prevSegment] = {}
endFrequencies = segmentData[ENDING][prevSegment]
if ending in endFrequencies:
endFrequencies[ending] += 1
else:
endFrequencies[ending] = 1
def randFromFrequencies(dictionary):
"Returns a random dictionary key, where the values represent frequencies."
keys = dictionary.keys()
frequencies = dictionary.values()
index = random.randrange(sum(dictionary.values()))
for key, freq in dictionary.items():
if index < freq:
# Select this one
return key
else:
index -= freq
# Weird, should have returned something
raise ValueError("randFromFrequencies didn't pick a value "
"(index remainder is {})".format(index))
def markovName(syllableCount):
"Generate a country name using a Markov-chain-like process."
prevSegment = None
stem = ""
for syll in range(syllableCount):
for segType in [ONSET, NUCLEUS, CODA]:
try:
segFrequencies = segmentData[segType][prevSegment]
except KeyError:
# In the unusual situation that the chain fails to find an
# appropriate next segment, it's too complicated to try to
# roll back and pick a better prevSegment; so instead,
# return None and let the caller generate a new name
return None
segment = randFromFrequencies(segFrequencies)
stem += segment
if segment:
prevSegment = segment
endingOnset = None
# Try different onsets for the last syllable till we find one that's
# legal before an ending; we also allow empty onsets. Because it's
# possible we won't find one, we also limit the number of retries
# allowed.
retries = 10
while (retries and endingOnset != ""
and endingOnset not in segmentData[ENDING]):
segFrequencies = segmentData[ONSET][prevSegment]
endingOnset = randFromFrequencies(segFrequencies)
retries -= 1
stem += endingOnset
if endingOnset != "":
prevSegment = endingOnset
if prevSegment in segmentData[ENDING]:
# Pick an ending that goes with the prevSegment
endFrequencies = segmentData[ENDING][prevSegment]
endings = randFromFrequencies(endFrequencies)
else:
# It can happen, if we used an empty last-syllable onset, that
# the previous segment does not appear before any ending in the
# data set. In this case, we'll just use -a(n) for the ending.
endings = "a/an"
endings = endings.split("/")
nounForm = stem + endings[0]
# Filter out names that are too short or too long
if len(nounForm) < 3:
# This would give two-letter names like Mo, which don't appeal
# to me
return None
if len(nounForm) > 11:
# This would give very long names like Imbadossorbia that are too
# much of a mouthful
return None
# Filter out names with weird consonant clusters at the end
for consonants in ["bl", "tn", "sr", "sn", "sm", "shm"]:
if nounForm.endswith(consonants):
return None
# Filter out names that sound like anatomical references
for bannedSubstring in ["vag", "coc", "cok", "kok", "peni"]:
if bannedSubstring in stem:
return None
if nounForm == "ass":
# This isn't a problem if it's part of a larger name like Assyria,
# so filter it out only if it's the entire name
return None
return stem, endings
Testing code
def printCountryNames(count):
for i in range(count):
syllableCount = randFromFrequencies(syllableCounts)
nameInfo = markovName(syllableCount)
while nameInfo is None:
nameInfo = markovName(syllableCount)
stem, endings = nameInfo
stem = stem.capitalize()
noun = stem + endings[0]
adjective = stem + endings[1]
print("{} ({})".format(noun, adjective))
if __name__ == "__main__":
printCountryNames(10)
Example names.txt
contents
# Comments are ignored
i-ta-l y/ian
# A suffix can be empty
i-ra-q /i
# The stem can end with a syllable break
ge-no- a/ese
# Names whose adjective suffix just adds an -n need only list the noun suffix
ar-me-n ia
sa-mo- a
My full names.txt
file can be found, along with the code, in this Gist.
Example output
Generated using the full data file:
Slorujarnia (Slorujarnian)
Ashmar (Ashmari)
Babya (Babyan)
Randorkia (Randorkian)
Esanoa (Esanoese)
Manglalia (Manglalic)
Konara (Konaran)
Lilvispia (Lilvispian)
Cenia (Cenian)
Rafri (Rafrian)
Questions
- Is my code readable? Clear variable and function names? Sufficient comments?
- Should I restructure anything?
- Are there any Python 3 features I could use or use better? I'm particularly unused to
format
and the various approaches for using it.
If you see anything else that can be improved, please let me know. Just one exception: I know the PEP standard is snake_case, but I want to use camelCase and I don't intend to change that. Other formatting tips are welcome.
2 Answers 2
It is better to follow PEP8 which says that import statements, such as in your case, should use multiple lines:
import re import random
- Whatever the programming language you use, it is better to avoid input/output operations whenever you can. So instead of storing the countries' names in a text file, you can choose a suitable Python data structure for the purpose.
- When someone reads your main program, he must straightforwardly knows what it is doing. It is not the case with your main.py file where I see lot of distracting information and noise. For example, you should save all those constants in a separate module you may call configurations.py, cfg.py, settings.py or whatever name you think it fits into the your project's architecture.
- Choose meaningful names: while most of the names you chose are edible, I believe you still can make some improvements for few of them. That is the case, for example, of
nameFile
which is too vague and does not bring any information apart from the one of the assignment operation itself:nameFile = "names.txt"
. Sure it is a name of a file, but only after reading your program's statement that one can guess what you would mean bynameFile
and start to suggest you right away a more suitable and informative name such ascountries_names
. Note in my suggestion there is not the container's name. I mean, I am not pushing the reader of your code to know programming details such as whether you stored the information in a file or this or that data structure. Names should be "high level" and independent from the data structure they represent. This offers you the advantage not to spot the same name in your program to re-write it just because you changed the store data from a file to an other data structure. This also applies tosyllableRgx
: for sure when someone readssyllableRgx = re.compile(r"...")
he understands you are storing the result of a regex expression. But you should change this name to something better for the reasons I explained just before. - You should follow the standard naming conventions. For example,
syllableCounts
andsegmentData
should be writtensyllable_counts
andsegment_data
respectively. - I want to use camelCase No. When develop in a given programming language, adopt its spirit and philosophy. It is like when you join a new company's developer team: adapt yourself to themselves, and do not ask them to adapt themselves to your habits and wishes.
-
5\$\begingroup\$ Mmm. Edible names. (Readable?) \$\endgroup\$TRiG– TRiG2017年08月14日 10:19:46 +00:00Commented Aug 14, 2017 at 10:19
-
3\$\begingroup\$ By edible (rather a French way of expressing myself) I meant "meaningful names that tell about the variable's, function's or class' purpose* \$\endgroup\$Billal BEGUERADJ– Billal BEGUERADJ2017年08月14日 10:56:06 +00:00Commented Aug 14, 2017 at 10:56
-
\$\begingroup\$ Often the information needs to get stored in a language-agnostic way at some point. In such a case, would you keep the structure as-is or first put everything in a variable and print the whole variable in 1 go at the end? \$\endgroup\$Dennis Jaheruddin– Dennis Jaheruddin2017年08月14日 15:15:32 +00:00Commented Aug 14, 2017 at 15:15
-
3\$\begingroup\$ You are right, constants should be written in capital letters as you did, and as the OP did , but
syllable_counts
is a dictionary which is, in the beginning inititalized to be empty, later it is filled with data, so it is not a constant @Wyrmwood \$\endgroup\$Billal BEGUERADJ– Billal BEGUERADJ2017年08月14日 15:56:25 +00:00Commented Aug 14, 2017 at 15:56 -
10\$\begingroup\$ I think 'digestible' (or better still, 'easy to digest') is better English, while still being faithful to your intention in French. \$\endgroup\$Sanchises– Sanchises2017年08月14日 20:04:00 +00:00Commented Aug 14, 2017 at 20:04
Looping over lines of a file
Probably a minor nitpick, but when using a file object returned by open()
, you can just iterate over the object, instead of calling readlines()
, like so:
# Read names from file and generate the segmentData structure
with open(nameFile) as input_names:
for line in input_names:
From the docs:
readlines(hint=-1)
Read and return a list of lines from the stream. hint can be specified to control the number of lines read: no more lines will be read if the total size (in bytes/characters) of all lines so far exceeds hint.
Note that it’s already possible to iterate on file objects using
for line in file: ...
without callingfile.readlines()
.
So, if you're not going to limit the data being read, then there's no need to use readlines()
.
Testing if any element matches a condition
... can be done using any()
, map()
, and an appropriate function, so:
# Filter out names with weird consonant clusters at the end
weird_consonant_clusters = ["bl", "tn", "sr", "sn", "sm", "shm"]
if any(map(nounForm.endswith, weird_consonant_clusters)):
return None
With the bannedSubstring
case, though, there's no direct equivalent of in
: you'd have to use count()
or write a lambda, so there might not be much to be gained here.
Increment-or-set
For the frequencies, where you do an increment or set, you could use the get
method, or a defaultdict, so that this:
if ending in endFrequencies:
endFrequencies[ending] += 1
else:
endFrequencies[ending] = 1
Becomes:
endFrequencies[ending] = endFrequencies.get(ending, 0) + 1
Or if endFrequencies
is a defaultdict(int)
, just:
endFrequencies[ending] += 1
-
7\$\begingroup\$ There’s a specialized collection class made just for this purpose: collections.Counter. Just initialize with
endFrequencies = Counter()
and update withendFrequencies[ending] += 1
. \$\endgroup\$Chortos-2– Chortos-22017年08月14日 12:09:13 +00:00Commented Aug 14, 2017 at 12:09 -
\$\begingroup\$ [facepalm] Forgot about iterating over the file. Due to another answer's suggestion I won't be reading a file anymore, but thanks.
defaultdict
is another good idea, although it looks like @Chortos-2'scollections.Counter
is exactly what I need. \$\endgroup\$DLosc– DLosc2017年08月14日 20:07:26 +00:00Commented Aug 14, 2017 at 20:07
pep8
andpyflakes
on your code \$\endgroup\$