Japanese reading of an integer

Question 1

Inspired by CtCI 16.8: Integer to English phrase in C++, I wrote a program to show the Japanese reading of an integer (positive, negative, or zero). The Japanese is written in Hepburn romanization to avoid encoding problems. Also, traditional Hepburn is used to avoid macrons.

Unlike English, Japanese has sound changes when two words come together. The details are given at the end of the question.

The program handles integers \$n\$ in the range \$-2^{63} \le n < 2^{63}\$.

Code

/**
 * Integer to Japanese reading
 *
 * Japanese reading is written in Hepburn Roomaji. Traditional
 * Hepburn is used to avoid the need of macrons.
 */
#include <cassert>
#include <cstdint>
#include <iostream>
#include <string>
#include <vector>
using number_t = std::int_fast64_t;
const std::vector<std::string> magnitudes = {
 "", "man", "oku", "choo", "kee",
};
const std::vector<std::string> thousands = {
 "", "sen", "nisen", "sanzen", "yonsen",
 "gosen", "rokusen", "nanasen", "hassen", "kyuusen",
};
const std::vector<std::string> hundreds = {
 "", "hyaku", "nihyaku", "sanbyaku", "yonhyaku",
 "gohyaku", "roppyaku", "nanahyaku", "happyaku", "kyuuhyaku",
};
const std::vector<std::string> tens = {
 "", "juu", "nijuu", "sanjuu", "yonjuu",
 "gojuu", "rokujuu", "nanajuu", "hachijuu", "kyuujuu",
};
const std::vector<std::string> ones = {
 "", "ichi", "ni", "san", "yon",
 "go", "roku", "nana", "hachi", "kyuu",
};
// returns 10000^n
constexpr number_t magnitude(std::size_t n)
{
 number_t result = 1;
 while (n--)
 result *= 10000;
 return result;
}
constexpr bool is_vowel(char c)
{
 switch (c) {
 case 'a': case 'e': case 'i': case 'o': case 'u': case 'y':
 return true;
 default:
 return false;
 }
}
// joins two strings according to Japanese rules
void push(std::string& lhs, const std::string& rhs)
{
 if (lhs.back() == 'n' && is_vowel(rhs.front()))
 lhs += '\'';
 lhs += rhs;
}
// converts nonnegative numbers less than 10000
std::string group_name(number_t number)
{
 assert(0 <= number && number < 10000);
 std::string result = thousands[number / 1000];
 number %= 1000;
 push(result, hundreds[number / 100]);
 number %= 100;
 push(result, tens[number / 10]);
 number %= 10;
 push(result, ones[number]);
 return result;
}
std::string to_Japanese(number_t number)
{
 if (number == 0)
 return "zero";
 std::string result;
 if (number < 0) {
 result = "mainasu";
 number = -number;
 }
 number_t mag = magnitude(magnitudes.size() - 1);
 for (std::size_t i = magnitudes.size(); i-- > 0; mag /= 10000) {
 if (auto group = number / mag; group > 0) {
 push(result, group_name(group));
 push(result, magnitudes[i]);
 }
 number %= mag;
 }
 return result;
}
int main()
{
 for (number_t number; std::cin >> number;)
 std::cout << to_Japanese(number) << "\n";
}

Example session

0
zero
1
ichi
2
ni
3
san
6789678967896789
rokusennanahyakuhachijuukyuuchoorokusennanahyakuhachijuukyuuokurokusennanahyakuhachijuukyuumanrokusennanahyakuhachijuukyuu
-1234567898765432
mainasusennihyakusanjuuyonchoogosenroppyakunanajuuhachiokukyuusenhappyakunanajuurokumangosen'yonhyakusanjuuni

Japanese numerals

(You can skip this part if you are familiar with Japanese.)

In the following table, red entries involve sound change.

\begin{array}{ll} \text{Number} & \text{Japanese reading} \\ 1 & \text{一 (ichi)} \\ 2 & \text{二 (ni)} \\ 3 & \text{三 (san)} \\ 4 & \text{四 (yon)} \\ 5 & \text{五 (go)} \\ 6 & \text{六 (roku)} \\ 7 & \text{七 (nana)} \\ 8 & \text{八 (hachi)} \\ 9 & \text{九 (kyuu)} \\ 10 & \text{十 (juu)} \\ 20 & \text{二十 (nijuu)} \\ 30 & \text{三十 (sanjuu)} \\ 40 & \text{四十 (yonjuu)} \\ 50 & \text{五十 (gojuu)} \\ 60 & \text{六十 (rokujuu)} \\ 70 & \text{七十 (nanajuu)} \\ 80 & \text{八十 (hachijuu)} \\ 90 & \text{九十 (kyuujuu)} \\ 100 & \text{百 (hyaku)} \\ 200 & \text{二百 (nihyaku)} \\ 300 & \text{三百 } \color{red}{\text{(sanbyaku)}} \\ 400 & \text{四百 (yonhyaku)} \\ 500 & \text{五百 (gohyaku)} \\ 600 & \text{六百 } \color{red}{\text{(roppyaku)}} \\ 700 & \text{七百 (nanahyaku)} \\ 800 & \text{八百 } \color{red}{\text{(happyaku)}} \\ 900 & \text{九百 (kyuuhyaku)} \\ 1000 & \text{千 (sen)} \\ 2000 & \text{二千 (nisen)} \\ 3000 & \text{三千 } \color{red}{\text{(sanzen)}} \\ 4000 & \text{四千 (yonsen)} \\ 5000 & \text{五千 (gosen)} \\ 6000 & \text{六千 (rokusen)} \\ 7000 & \text{七千 (nanasen)} \\ 8000 & \text{八千 } \color{red}{\text{(hassen)}} \\ 9000 & \text{九千 (kyuusen)} \\ \end{array}

Larger numbers are considered the sums of smaller numbers. For example:

\begin{array}{ccccccc} 2019 & = & 2000 & + & 10 & + & 9 \\ \text{二千十九 (nisenjuukyuu)} & & \text{二千 (nisen)} & & \text{十 (juu)} & & \text{九 (kyuu)} \end{array}

The missing hundred place is simply ignored.

Four digits are considered a group, unlike English, where three digits are a group. The group markers are:

\begin{array}{ccc} 10^4 & 10^8 & 10^{12} & 10^{16} \\ \text{万 (man)} & \text{億 (oku)} & \text{兆 (choo)} & \text{京 (kee)} \end{array}

For example, \1ドル,2345円,6789円\$ is read as 一億二千三百四十五万六千七百八十九 (ichioku nisensanbyakuyonjuugoman rokusennanahyakuhachijuukyuu). (The spaces are for ease of recognition only.) Note that 一 (ichi) is required before these group markers, unlike 十 (juu), 百 (hyaku), and 千 (sen).

0 is ゼロ (zero).

A negative integer \$-n\$ is read as マイナス (mainasu) followed by its absolute value \$n\$. For example, -5 is read as マイナス五 (mainasugo), because 5 is read as 五 (go).

When two syllables are joined, if the first syllable ends with "n" and the second starts with one of "aeiouy", then a separator ' is added between. For example, 1001 is 千一 (sen'ichi) and 1004 is 千四 (sen'yon).

Question 2

The push function may invoke undefined behavior. When you create a new std::string, I don't think it is guaranteed to be null-terminated since that would incur a runtime cost for no benefit other than C compatibility. Only after calling c_str() the terminating null character is there. Because the string is really empty, calling back() accesses an out-of-bounds element.

You claim that your code works for \$-2^{63}\$ but you didn't add that number as a test case. I would expect it to not work since -(-2**63) is still negative.

Besides these two issues, your code reads well and is easy to understand. Adding the additional documentation was a very good idea for everyone not fluent in Japanese.

Question 3

I see. The problem is that back accesses the nonexistent element. (The terminating null character doesn't change that.)

Question 4

@L.F. Yes, in reality most std::string implementations will have a terminating 0 byte at all times. And in fact C++11 and later requires that operator[] with pos=size() references the terminating 0. What makes back() UB is that en.cppreference.com/w/cpp/string/basic_string/back says it's UB when .empty() == true. And moreover that it's equivalent to operator[](size() - 1) which would wrap the unsigned position.

Question 5

The code definitely needs a check. Also, operator[](size() - 1) only applies if empty() is false, so wrapping is irrelevant I guess.

Question 6

In practice g++ -O0 with libstdc++ (on Arch Linux) gives 0 when doing .back() on an empty std::string, so it doesn't help you detect this bug by crashing on that UB.

Question 7

@L.F.: fun facts: Legality of COW std::string implementation in C++11 shows how some of the new requirements in C++11 basically make efficient COW implementation of std::string impossible. With .data() also requiring a 0-terminated string, the obvious/intended implementation is to always maintain that so .data() and .c_str() can be a no-op that just returns one of the class member vars.

Question 8

Your magnitudes should be chou and kei (as in the supercomputer, as an aside). Also, to_Japanese() would be better named to_romaji(). As an exercise, you could try to_Japanese(number, KANJI|HIRAGANA|ROMAJI) also.

Furthermore, it's icchou, not ichichou, and ikkei, not ichikei.

Question 9

Thank you for correcting my Japanese! I am a beginner and sorry for the mistakes. Regarding chou vs choo: what I learned is that when おう is a long "o", it is written as "oo" instead of "ou". I figured out that different romaji systems differ on this. And "ichichou" is so silly ;)

Question 10

Yes, there's lots of different romaji versions - I looked up the most common Hepburn, and choo is correct, although I always use chou as it's IME-friendly, but then sennana would be typed sennnana. We're now straying into Japanese Language territory...

Question 11

I think it would be better to generate the 漢字 versions of the numbers, as it's more useful in general. Make a separate class responsible for romanization (or phonetic conversion in general so it can support kana).

Using kanji also allows for somewhat trivially adding support for formal numbers (大字), which is kind of cool/useful.

I would also recommend adding support for breaking up the romaji sequences as reading very large numbers in romaji with no spaces is not at all fun. It could be an optional flag. A natural place to add spaces would be between magnitudes at least, or maybe between all of the various groupings you've already put together (ones, tens, hundreds, etc.).

As for using romaji to avoid encoding issues, I recommending biting the bullet and learning how to support Unicode correctly as it will be extremely useful.

Question 12

Insightful. Although romanization is nontrivial AFAIK since one kanji can have multiple readings ... Spaces in between romaji is also a good idea.

Question 13

@L.F. romanization doesn't have to support the entirety of Japanese, as this code is number focused, and you've already decided on the readings for each of the kanji involved anyway (or rather; there's only one right answer for these in this context); and that means katakana and hiragana phonetic readings are just as easy at that point.

Question 14

I know in that in English that each word would be separate where the program is printing the all of the numbers merged together. Is this the actual functionality in Japanese?

Since the code might be useful in many places it might be better of the translation code as in a class.

Use of Vertical Space
Generally code is easier to read and maintain when only one value is on a line. This would apply to the initialization of the vectors and the switch statement in the function is_vowel(). For maintenance reasons it is much easier to insert a line where it needs to be than it is to add a value to a comma separated list.

is_vowel function
There would be less code if the vowels were in a std::map rather than a switch statement. Here are discussions on stack overflow and software engineering.

This portion of the answer has been modified to remove the statement that there might be a performance improvement using std::map. If map used a simple index into an array that might be true, however it is not a simple index into an array.

Assert
Assert statements are generally used for debugging purposes and terminate the program. Assert statements may be removed when the code is compiled without debugging as well. I don't expect to see asserts in production level code because it implies that the code is not yet debugged.

Question 15

"There would be less code and the performance might be better if the vowels were in a std::map rather than a switch statement." - do you have anything to support a claim that std::map is faster than a raw switch statement? Because I very much doubt that.

Question 16

^ Agreed with Tomas. Provide proof. std::map is dynamically-allocated, node-based, and definitely not cache friendly

Question 17

"Assert statements are generally used for debugging purposes and throw an exception." - NO. Assert statements exist to test preconditions. Exceptions have nothing to do with preconditions. This answer is harmful.

Question 18

Failed assert()s do not throw - they terminate the program.

Question 19

"It is generally better to use if statements that provide error messages rather than assert statements." - This is still severely wrong and harmful. They have completely different use cases. Assert -> precondition check, contract breakage, unrecoverable error. if statement + error -> recoverable error, can be usually handled by a human.

Question 20

You use of std::vector<std::string> for the arrays of string constants is wasteful. Both std::vector and std::string are dynamic types that can potentially allocate. A much more lightweight choice would be constexpr std::array<const char*> or constexpr std::array<std::string_view>.

Question 21

Fair point. string is probably not dynamic in this case due to SSO, but vector is quite wasteful.

Question 22

Still, I like using the most minimal tool for the job whenever possible. Even if the constants are going to fit in SSO, it's good practice to use std::string_view or const char*. The compiler will also very likely be able to optimize better.

Roland Illig Roland Illig 21.8k2 gold badges36 silver badges83 bronze badges · Accepted Answer · 2019-07-29 05:17:49Z

8

\$\begingroup\$

The push function may invoke undefined behavior. When you create a new std::string, I don't think it is guaranteed to be null-terminated since that would incur a runtime cost for no benefit other than C compatibility. Only after calling c_str() the terminating null character is there. Because the string is really empty, calling back() accesses an out-of-bounds element.

You claim that your code works for \$-2^{63}\$ but you didn't add that number as a test case. I would expect it to not work since -(-2**63) is still negative.

Besides these two issues, your code reads well and is easy to understand. Adding the additional documentation was a very good idea for everyone not fluent in Japanese.

Share

edited Jul 29, 2019 at 5:22

answered Jul 29, 2019 at 5:17

Roland Illig's user avatar

Roland Illig Roland Illig

21.8k2 gold badges36 silver badges83 bronze badges

\$\endgroup\$

5

\$\begingroup\$ I see. The problem is that back accesses the nonexistent element. (The terminating null character doesn't change that.) \$\endgroup\$

L. F.
– L. F.

2019年07月29日 05:22:30 +00:00
Commented Jul 29, 2019 at 5:22
\$\begingroup\$ @L.F. Yes, in reality most std::string implementations will have a terminating 0 byte at all times. And in fact C++11 and later requires that operator[] with pos=size() references the terminating 0. What makes back() UB is that en.cppreference.com/w/cpp/string/basic_string/back says it's UB when .empty() == true. And moreover that it's equivalent to operator[](size() - 1) which would wrap the unsigned position. \$\endgroup\$

Peter Cordes
– Peter Cordes

2019年07月29日 06:28:59 +00:00
Commented Jul 29, 2019 at 6:28
\$\begingroup\$ The code definitely needs a check. Also, operator[](size() - 1) only applies if empty() is false, so wrapping is irrelevant I guess. \$\endgroup\$

L. F.
– L. F.

2019年07月29日 06:31:00 +00:00
Commented Jul 29, 2019 at 6:31
1

\$\begingroup\$ In practice g++ -O0 with libstdc++ (on Arch Linux) gives 0 when doing .back() on an empty std::string, so it doesn't help you detect this bug by crashing on that UB. \$\endgroup\$

Peter Cordes
– Peter Cordes

2019年07月29日 06:31:02 +00:00
Commented Jul 29, 2019 at 6:31
3

\$\begingroup\$ @L.F.: fun facts: Legality of COW std::string implementation in C++11 shows how some of the new requirements in C++11 basically make efficient COW implementation of std::string impossible. With .data() also requiring a 0-terminated string, the obvious/intended implementation is to always maintain that so .data() and .c_str() can be a no-op that just returns one of the class member vars. \$\endgroup\$

Peter Cordes
– Peter Cordes

2019年07月29日 06:36:37 +00:00
Commented Jul 29, 2019 at 6:36

Add a comment |

Stack Exchange Network

Japanese reading of an integer

Code

Example session

Japanese numerals

5 Answers 5

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Japanese reading of an integer

Code

Example session

Japanese numerals

5 Answers 5

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions