Inspired by CtCI 16.8: Integer to English phrase in C++, I wrote a program to show the Japanese reading of an integer (positive, negative, or zero). The Japanese is written in Hepburn romanization to avoid encoding problems. Also, traditional Hepburn is used to avoid macrons.
Unlike English, Japanese has sound changes when two words come together. The details are given at the end of the question.
The program handles integers \$n\$ in the range \$-2^{63} \le n < 2^{63}\$.
Code
/**
* Integer to Japanese reading
*
* Japanese reading is written in Hepburn Roomaji. Traditional
* Hepburn is used to avoid the need of macrons.
*/
#include <cassert>
#include <cstdint>
#include <iostream>
#include <string>
#include <vector>
using number_t = std::int_fast64_t;
const std::vector<std::string> magnitudes = {
"", "man", "oku", "choo", "kee",
};
const std::vector<std::string> thousands = {
"", "sen", "nisen", "sanzen", "yonsen",
"gosen", "rokusen", "nanasen", "hassen", "kyuusen",
};
const std::vector<std::string> hundreds = {
"", "hyaku", "nihyaku", "sanbyaku", "yonhyaku",
"gohyaku", "roppyaku", "nanahyaku", "happyaku", "kyuuhyaku",
};
const std::vector<std::string> tens = {
"", "juu", "nijuu", "sanjuu", "yonjuu",
"gojuu", "rokujuu", "nanajuu", "hachijuu", "kyuujuu",
};
const std::vector<std::string> ones = {
"", "ichi", "ni", "san", "yon",
"go", "roku", "nana", "hachi", "kyuu",
};
// returns 10000^n
constexpr number_t magnitude(std::size_t n)
{
number_t result = 1;
while (n--)
result *= 10000;
return result;
}
constexpr bool is_vowel(char c)
{
switch (c) {
case 'a': case 'e': case 'i': case 'o': case 'u': case 'y':
return true;
default:
return false;
}
}
// joins two strings according to Japanese rules
void push(std::string& lhs, const std::string& rhs)
{
if (lhs.back() == 'n' && is_vowel(rhs.front()))
lhs += '\'';
lhs += rhs;
}
// converts nonnegative numbers less than 10000
std::string group_name(number_t number)
{
assert(0 <= number && number < 10000);
std::string result = thousands[number / 1000];
number %= 1000;
push(result, hundreds[number / 100]);
number %= 100;
push(result, tens[number / 10]);
number %= 10;
push(result, ones[number]);
return result;
}
std::string to_Japanese(number_t number)
{
if (number == 0)
return "zero";
std::string result;
if (number < 0) {
result = "mainasu";
number = -number;
}
number_t mag = magnitude(magnitudes.size() - 1);
for (std::size_t i = magnitudes.size(); i-- > 0; mag /= 10000) {
if (auto group = number / mag; group > 0) {
push(result, group_name(group));
push(result, magnitudes[i]);
}
number %= mag;
}
return result;
}
int main()
{
for (number_t number; std::cin >> number;)
std::cout << to_Japanese(number) << "\n";
}
Example session
0
zero
1
ichi
2
ni
3
san
6789678967896789
rokusennanahyakuhachijuukyuuchoorokusennanahyakuhachijuukyuuokurokusennanahyakuhachijuukyuumanrokusennanahyakuhachijuukyuu
-1234567898765432
mainasusennihyakusanjuuyonchoogosenroppyakunanajuuhachiokukyuusenhappyakunanajuurokumangosen'yonhyakusanjuuni
Japanese numerals
(You can skip this part if you are familiar with Japanese.)
In the following table, red entries involve sound change.
\begin{array}{ll} \text{Number} & \text{Japanese reading} \\ 1 & \text{一 (ichi)} \\ 2 & \text{二 (ni)} \\ 3 & \text{三 (san)} \\ 4 & \text{四 (yon)} \\ 5 & \text{五 (go)} \\ 6 & \text{六 (roku)} \\ 7 & \text{七 (nana)} \\ 8 & \text{八 (hachi)} \\ 9 & \text{九 (kyuu)} \\ 10 & \text{十 (juu)} \\ 20 & \text{二十 (nijuu)} \\ 30 & \text{三十 (sanjuu)} \\ 40 & \text{四十 (yonjuu)} \\ 50 & \text{五十 (gojuu)} \\ 60 & \text{六十 (rokujuu)} \\ 70 & \text{七十 (nanajuu)} \\ 80 & \text{八十 (hachijuu)} \\ 90 & \text{九十 (kyuujuu)} \\ 100 & \text{百 (hyaku)} \\ 200 & \text{二百 (nihyaku)} \\ 300 & \text{三百 } \color{red}{\text{(sanbyaku)}} \\ 400 & \text{四百 (yonhyaku)} \\ 500 & \text{五百 (gohyaku)} \\ 600 & \text{六百 } \color{red}{\text{(roppyaku)}} \\ 700 & \text{七百 (nanahyaku)} \\ 800 & \text{八百 } \color{red}{\text{(happyaku)}} \\ 900 & \text{九百 (kyuuhyaku)} \\ 1000 & \text{千 (sen)} \\ 2000 & \text{二千 (nisen)} \\ 3000 & \text{三千 } \color{red}{\text{(sanzen)}} \\ 4000 & \text{四千 (yonsen)} \\ 5000 & \text{五千 (gosen)} \\ 6000 & \text{六千 (rokusen)} \\ 7000 & \text{七千 (nanasen)} \\ 8000 & \text{八千 } \color{red}{\text{(hassen)}} \\ 9000 & \text{九千 (kyuusen)} \\ \end{array}
Larger numbers are considered the sums of smaller numbers. For example:
\begin{array}{ccccccc} 2019 & = & 2000 & + & 10 & + & 9 \\ \text{二千十九 (nisenjuukyuu)} & & \text{二千 (nisen)} & & \text{十 (juu)} & & \text{九 (kyuu)} \end{array}
The missing hundred place is simply ignored.
Four digits are considered a group, unlike English, where three digits are a group. The group markers are:
\begin{array}{ccc} 10^4 & 10^8 & 10^{12} & 10^{16} \\ \text{万 (man)} & \text{億 (oku)} & \text{兆 (choo)} & \text{京 (kee)} \end{array}
For example, \1ドル,2345円,6789円\$ is read as 一億二千三百四十五万六千七百八十九 (ichioku nisensanbyakuyonjuugoman rokusennanahyakuhachijuukyuu). (The spaces are for ease of recognition only.) Note that 一 (ichi) is required before these group markers, unlike 十 (juu), 百 (hyaku), and 千 (sen).
0 is ゼロ (zero).
A negative integer \$-n\$ is read as マイナス (mainasu) followed by its absolute value \$n\$. For example, -5 is read as マイナス五 (mainasugo), because 5 is read as 五 (go).
When two syllables are joined, if the first syllable ends with "n" and the second starts with one of "aeiouy", then a separator ' is added between. For example, 1001 is 千一 (sen'ichi) and 1004 is 千四 (sen'yon).
5 Answers 5
The push
function may invoke undefined behavior. When you create a new std::string
, I don't think it is guaranteed to be null-terminated since that would incur a runtime cost for no benefit other than C compatibility. Only after calling c_str()
the terminating null character is there. Because the string is really empty, calling back()
accesses an out-of-bounds element.
You claim that your code works for \$-2^{63}\$ but you didn't add that number as a test case. I would expect it to not work since -(-2**63)
is still negative.
Besides these two issues, your code reads well and is easy to understand. Adding the additional documentation was a very good idea for everyone not fluent in Japanese.
-
\$\begingroup\$ I see. The problem is that
back
accesses the nonexistent element. (The terminating null character doesn't change that.) \$\endgroup\$L. F.– L. F.2019年07月29日 05:22:30 +00:00Commented Jul 29, 2019 at 5:22 -
\$\begingroup\$ @L.F. Yes, in reality most std::string implementations will have a terminating 0 byte at all times. And in fact C++11 and later requires that
operator[]
withpos=size()
references the terminating0
. What makesback()
UB is that en.cppreference.com/w/cpp/string/basic_string/back says it's UB when.empty() == true
. And moreover that it's equivalent tooperator[](size() - 1)
which would wrap the unsigned position. \$\endgroup\$Peter Cordes– Peter Cordes2019年07月29日 06:28:59 +00:00Commented Jul 29, 2019 at 6:28 -
\$\begingroup\$ The code definitely needs a check. Also,
operator[](size() - 1)
only applies ifempty()
is false, so wrapping is irrelevant I guess. \$\endgroup\$L. F.– L. F.2019年07月29日 06:31:00 +00:00Commented Jul 29, 2019 at 6:31 -
1\$\begingroup\$ In practice
g++ -O0
with libstdc++ (on Arch Linux) gives0
when doing.back()
on an emptystd::string
, so it doesn't help you detect this bug by crashing on that UB. \$\endgroup\$Peter Cordes– Peter Cordes2019年07月29日 06:31:02 +00:00Commented Jul 29, 2019 at 6:31 -
3\$\begingroup\$ @L.F.: fun facts: Legality of COW std::string implementation in C++11 shows how some of the new requirements in C++11 basically make efficient COW implementation of std::string impossible. With
.data()
also requiring a 0-terminated string, the obvious/intended implementation is to always maintain that so.data()
and.c_str()
can be a no-op that just returns one of the class member vars. \$\endgroup\$Peter Cordes– Peter Cordes2019年07月29日 06:36:37 +00:00Commented Jul 29, 2019 at 6:36
Your magnitudes should be chou
and kei
(as in the supercomputer, as an aside). Also, to_Japanese()
would be better named to_romaji()
. As an exercise, you could try to_Japanese(number, KANJI|HIRAGANA|ROMAJI)
also.
Furthermore, it's icchou
, not ichichou
, and ikkei
, not ichikei
.
-
1\$\begingroup\$ Thank you for correcting my Japanese! I am a beginner and sorry for the mistakes. Regarding chou vs choo: what I learned is that when おう is a long "o", it is written as "oo" instead of "ou". I figured out that different romaji systems differ on this. And "ichichou" is so silly ;) \$\endgroup\$L. F.– L. F.2019年07月29日 00:50:33 +00:00Commented Jul 29, 2019 at 0:50
-
2\$\begingroup\$ Yes, there's lots of different romaji versions - I looked up the most common Hepburn, and
choo
is correct, although I always usechou
as it's IME-friendly, but thensennana
would be typedsennnana
. We're now straying into Japanese Language territory... \$\endgroup\$Ken Y-N– Ken Y-N2019年07月29日 05:59:56 +00:00Commented Jul 29, 2019 at 5:59
I think it would be better to generate the 漢字 versions of the numbers, as it's more useful in general. Make a separate class responsible for romanization (or phonetic conversion in general so it can support kana).
Using kanji also allows for somewhat trivially adding support for formal numbers (大字), which is kind of cool/useful.
I would also recommend adding support for breaking up the romaji sequences as reading very large numbers in romaji with no spaces is not at all fun. It could be an optional flag. A natural place to add spaces would be between magnitudes at least, or maybe between all of the various groupings you've already put together (ones, tens, hundreds, etc.).
As for using romaji to avoid encoding issues, I recommending biting the bullet and learning how to support Unicode correctly as it will be extremely useful.
-
1\$\begingroup\$ Insightful. Although romanization is nontrivial AFAIK since one kanji can have multiple readings ... Spaces in between romaji is also a good idea. \$\endgroup\$L. F.– L. F.2019年07月29日 03:25:42 +00:00Commented Jul 29, 2019 at 3:25
-
1\$\begingroup\$ @L.F. romanization doesn't have to support the entirety of Japanese, as this code is number focused, and you've already decided on the readings for each of the kanji involved anyway (or rather; there's only one right answer for these in this context); and that means katakana and hiragana phonetic readings are just as easy at that point. \$\endgroup\$briantist– briantist2019年07月29日 03:30:46 +00:00Commented Jul 29, 2019 at 3:30
I know in that in English that each word would be separate where the program is printing the all of the numbers merged together. Is this the actual functionality in Japanese?
Since the code might be useful in many places it might be better of the translation code as in a class.
Use of Vertical Space
Generally code is easier to read and maintain when only one value is on a line. This would apply to the initialization of the vectors and the switch statement in the function is_vowel()
. For maintenance reasons it is much easier to insert a line where it needs to be than it is to add a value to a comma separated list.
is_vowel function
There would be less code if the vowels were in a std::map rather than a switch statement. Here are discussions on stack overflow and software engineering.
This portion of the answer has been modified to remove the statement that there might be a performance improvement using std::map. If map used a simple index into an array that might be true, however it is not a simple index into an array.
Assert
Assert statements are generally used for debugging purposes and terminate the program. Assert statements may be removed when the code is compiled without debugging as well. I don't expect to see asserts in production level code because it implies that the code is not yet debugged.
-
12\$\begingroup\$ "There would be less code and the performance might be better if the vowels were in a std::map rather than a switch statement." - do you have anything to support a claim that std::map is faster than a raw switch statement? Because I very much doubt that. \$\endgroup\$Tomáš Zato– Tomáš Zato2019年07月29日 09:26:07 +00:00Commented Jul 29, 2019 at 9:26
-
5\$\begingroup\$ ^ Agreed with Tomas. Provide proof.
std::map
is dynamically-allocated, node-based, and definitely not cache friendly \$\endgroup\$Vittorio Romeo– Vittorio Romeo2019年07月29日 10:41:44 +00:00Commented Jul 29, 2019 at 10:41 -
10\$\begingroup\$ "Assert statements are generally used for debugging purposes and throw an exception." - NO. Assert statements exist to test preconditions. Exceptions have nothing to do with preconditions. This answer is harmful. \$\endgroup\$Vittorio Romeo– Vittorio Romeo2019年07月29日 10:42:33 +00:00Commented Jul 29, 2019 at 10:42
-
7\$\begingroup\$ Failed
assert()
s do not throw - they terminate the program. \$\endgroup\$Toby Speight– Toby Speight2019年07月29日 11:14:44 +00:00Commented Jul 29, 2019 at 11:14 -
6\$\begingroup\$ "It is generally better to use if statements that provide error messages rather than assert statements." - This is still severely wrong and harmful. They have completely different use cases. Assert -> precondition check, contract breakage, unrecoverable error.
if
statement + error -> recoverable error, can be usually handled by a human. \$\endgroup\$Vittorio Romeo– Vittorio Romeo2019年07月29日 13:15:06 +00:00Commented Jul 29, 2019 at 13:15
You use of std::vector<std::string>
for the arrays of string constants is wasteful. Both std::vector
and std::string
are dynamic types that can potentially allocate. A much more lightweight choice would be constexpr std::array<const char*>
or constexpr std::array<std::string_view>
.
-
\$\begingroup\$ Fair point.
string
is probably not dynamic in this case due to SSO, butvector
is quite wasteful. \$\endgroup\$L. F.– L. F.2019年07月29日 10:42:39 +00:00Commented Jul 29, 2019 at 10:42 -
1\$\begingroup\$ Still, I like using the most minimal tool for the job whenever possible. Even if the constants are going to fit in SSO, it's good practice to use
std::string_view
orconst char*
. The compiler will also very likely be able to optimize better. \$\endgroup\$Vittorio Romeo– Vittorio Romeo2019年07月29日 10:44:12 +00:00Commented Jul 29, 2019 at 10:44