Transliterate wide-character input

Question 1

One shortcoming of GNU's implementation of the tr utility is that it operates on bytes rather than characters. As the man page says under BUGS:

Full support is available only for safe single-byte locales, in which every possible input byte represents a single character.

I've created a limited replacement that works with wide characters according to the inherited locale.

Known limitations

None of the options of tr are supported, such as -d to delete matched characters instead of replacing them, or -c to complement the match set. These two would be fairly easy to implement; the -s option to squeeze successive matches less so.
The "from" and "to" arguments must match in length, and there's no support for ranges, character classes or octal notation. Notably, this prevents us matching or substituting null characters.
Replacement is by wide-character code-point. So there's no normalisation of combining accents, and if your wide-character coding uses multi-char sequences (hello, UTF-16) then it's not usable for those.

These limitations are acceptable for my use-cases, so I don't need to change anything.

The code

I'll start with the core function:

#include <map>
#include <ranges>
#include <stdexcept>
#include <string_view>
#include <utility>
// Returns a function object that converts any character present in
// "from" to the corresponding character in "to".
auto make_substitutor(std::wstring_view from, std::wstring_view to)
{
 if (from.size() != to.size()) {
 throw std::invalid_argument("Replacement length mismatch");
 }
 auto map = std::views::zip(from, to) | std::ranges::to<std::map>();
 return [map=std::move(map)](wchar_t c) {
 auto it = map.find(c);
 return it == map.end() ? c : it->second;
 };
}

I have unit tests for it:

#include <gtest/gtest.h>
#include <algorithm>
#include <string>
auto tr_string(auto const& tr, std::wstring s)
{
 std::ranges::transform(s, s.begin(), tr);
 return s;
}
TEST(tr, bad_args)
{
 EXPECT_THROW(make_substitutor(L"abcde", L"bcde"), std::invalid_argument);
}
TEST(tr, noop)
{
 auto const tr = make_substitutor(L"", L"");
 EXPECT_EQ(tr_string(tr, L""), L"");
 EXPECT_EQ(tr_string(tr, L"hello"), L"hello");
}
TEST(tr, english)
{
 auto const tr = make_substitutor(L"ehlo", L"ipza");
 EXPECT_EQ(tr_string(tr, L""), L"");
 EXPECT_EQ(tr_string(tr, L"hello"), L"pizza");
}
TEST(tr, greek)
{
 auto const tr = make_substitutor(L"αβγδεζηθικλμνξοπρσςτυφχψω",
 L"ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΣΤΥΦΧΨΩ");
 EXPECT_EQ(tr_string(tr, L"Γεια σας"), L"ΓΕΙΑ ΣΑΣ");
}

And a main() that makes it a complete program:

#include <algorithm>
#include <iostream>
#include <iterator>
#include <locale>
#include <cstdlib>
#include <cstring>
#include <print>
// Convert from platform string to wide string
static std::wstring arg_to_wstring(const char* arg)
{
 std::wstring s(std::strlen(arg), '0円'); // likely over-allocation
 auto len = std::mbstowcs(s.data(), arg, s.size());
 if (len == -1uz) {
 throw std::invalid_argument("Malformed multibyte string");
 }
 s.resize(len);
 return s;
}
int main(int argc, char **argv)
{
 if (!std::setlocale(LC_ALL, "")) {
 println(std::cerr, "Invalid locale settings");
 return EXIT_FAILURE;
 }
 if (argc != 3) {
 println(std::cerr, "Usage: {} from_chars to_chars", *argv ? *argv : "tr");
 return EXIT_FAILURE;
 }
 try {
 std::wstring const from = arg_to_wstring(argv[1]);
 std::wstring const to = arg_to_wstring(argv[2]);
 std::wcin.exceptions(std::ios::badbit | std::ios::failbit);
 std::wcout.exceptions(std::ios::badbit | std::ios::failbit);
 std::transform(std::istreambuf_iterator<wchar_t>(std::wcin),
 std::istreambuf_iterator<wchar_t>(),
 std::ostreambuf_iterator<wchar_t>(std::wcout),
 make_substitutor(from, to));
 } catch (std::exception& e) {
 println(std::cerr, "{}", e.what());
 return EXIT_FAILURE;
 }
}

Quick demo

$ ./wchar-tr αβγδεζηθικλμνξοπρσςτυφχψω ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΣΤΥΦΧΨΩ <<<'Γεια σας'
ΓΕΙΑ ΣΑΣ

Question 2

Well, that's funny! I thank you for the very honest "known limitations" discussion. As I was reading that I thought to myself that I occasionally use tr -d X for some unwanted X, but far and away the most common invocation is tr A-Z a-z. And low and behold, the "quick demo" immediately goes, in inconvenient form, to exactly that for Γεια σου κόσμο!.

Question 3

@J_H, I cheated a bit, removing the accent from Γειά for a simpler test.

Question 4

BTW, @indi, how is the wide-character concept invented to accommodate Windows? My understanding is that Unix systems introduced wchar_t to support (mainly east Asian) character sets, later also used for Unicode code-points. Was Windows even a thing in the 1980s?

Question 5

Actually, I should have said 1970s, given that e.g. JIS X 0208 dates from 1978, and others are probably earlier (ISO 2022 was published in 1971).

Question 6

Eh, it was overstating to say it was created for Windows, but it has never served any major real-world usage other than Windows. You mentioned ISO 2022, for example: that is a shift-style encoding that uses bytes (either 7 or 8, I can’t recall), so it would be done with char, not wchar_t. Whatever the ancient origins of wchar_t, its standardization has been entirely dictated by "what Windows needs", and even Microsoft admits they have made it a "unique burden".

Question 7

Make a function to transliterate streams

There's a nice make_substitutor() function, but the rest of the code all lives in main(). Why not make a function that takes two std::wstrings, a std::istream and a std::ostream, and that does the transliteration? Ideally you'd be able to write:

int main(int argc, char **argv)
{
 ...
 std::wstring const from = arg_to_wstring(argv[1]);
 std::wstring const to = arg_to_wstring(argv[2]);
 std::wcin.exceptions(std::ios::badbit | std::ios::failbit);
 std::wcout.exceptions(std::ios::badbit | std::ios::failbit);
 transliterate(from, std::wcin, to, std::wcout);
}

Make it more generic

Your code is a straightforward extension of tr's handling of chars to wchar_ts. But what if you want to handle any of the other standard character types? Or for that matter, non-standard types? You can easily turn this into a template that would make it much more flexible.

How to store the map

You are using a std::map() to store the mapping, but this has a \$O(\log N)\$ lookup cost, where \$N\$ is the length of the from string. Also, a std::map here has a lot of overhead, as each entry in the map is allocated separately on the heap.

Some alternatives available in the standard library are:

std::unordered_map for \$O(1)\$ lookup complexity
std::flat_map for storing the map in a single contiguous memory region, which improves cache efficiency
std::vector indexed by character values for just the unicode range spanned by the from string. If that's not too large, it should be even faster than a std::unordered_map

Question 8

Oh, I meant to comment on the choice of std::map. There's probably a size where the overhead of std::unordered_map pays off, but I haven't benchmarked to find where that is. std::vector is an interesting choice, given that on my (Unicode) platform that would be 68 GB (0-0x10FFFF ✕ 4 bytes/codepoint) and I only have about 50 GB of virtual memory to share amongst all processes. I'm less familiar with std::flat_map; it seems that it's a way to use the two parallel strings without directly converting to pairs?

Question 9

You wouldn't add all Unicode code points to the vector, only those spanned by the lowest one in from to the highest one. So unless you have both 0x0 and 0x10FFFF in it, you wouldn't use all 68 GB. As for std::flat_map, that's just a sorted vector of pairs, with the interface of std::map, and it just uses binary search for lookups and keeps the vector sorted on inserts.

Question 10

It seems that the flat-map is a good choice for a lookup table that's created once and never modified. Lookup should be cheaper than std::map, and also scales as O(log n) in the length of from; we could use an unordered map when we exceed whatever threshold makes that faster for lookup (possibly larger than one might pass in process arguments?). If we're really clever, we might use a table for one or more subranges, when the from values are in clusters. All that can be transparent if we return a std::function to erase the type.

Question 11

doc comment

// Returns a function object that converts any character present in
// "from" to the corresponding character in "to".
auto make_substitutor(std::wstring_view from, std::wstring_view to)

nit: I don't get the sense that doxygen or similar approaches to publishing documentation would do a great job with that. However the contract is admirably clear and concise.

time complexity

I am willing to believe that map.find(c) completes in \$O(1)\$ constant time.

Overall it looks good. Ship it!

Question 12

Lookup scales as O(n), where n is the length of the "from" string. If it gets really large, it may be advantageous to switch over to std::unordered_map - but I haven't benchmarked to find that inflection point.

Question 13

Yeah, that was my concern. So for twenty-ish letters we could call that "constant". But if we ever introduce "complement" or similar, then it seems we'd want a hash table lookup.

G. Sliepen G. Sliepen 68.6k3 gold badges73 silver badges178 bronze badges · Accepted Answer · 2025-03-02 19:04:06Z

Make a function to transliterate streams

There's a nice make_substitutor() function, but the rest of the code all lives in main(). Why not make a function that takes two std::wstrings, a std::istream and a std::ostream, and that does the transliteration? Ideally you'd be able to write:

int main(int argc, char **argv)
{
 ...
 std::wstring const from = arg_to_wstring(argv[1]);
 std::wstring const to = arg_to_wstring(argv[2]);
 std::wcin.exceptions(std::ios::badbit | std::ios::failbit);
 std::wcout.exceptions(std::ios::badbit | std::ios::failbit);
 transliterate(from, std::wcin, to, std::wcout);
}

Make it more generic

Your code is a straightforward extension of tr's handling of chars to wchar_ts. But what if you want to handle any of the other standard character types? Or for that matter, non-standard types? You can easily turn this into a template that would make it much more flexible.

How to store the map

You are using a std::map() to store the mapping, but this has a \$O(\log N)\$ lookup cost, where \$N\$ is the length of the from string. Also, a std::map here has a lot of overhead, as each entry in the map is allocated separately on the heap.

Some alternatives available in the standard library are:

std::unordered_map for \$O(1)\$ lookup complexity
std::flat_map for storing the map in a single contiguous memory region, which improves cache efficiency
std::vector indexed by character values for just the unicode range spanned by the from string. If that's not too large, it should be even faster than a std::unordered_map

Oh, I meant to comment on the choice of std::map. There's probably a size where the overhead of std::unordered_map pays off, but I haven't benchmarked to find where that is. std::vector is an interesting choice, given that on my (Unicode) platform that would be 68 GB (0-0x10FFFF ✕ 4 bytes/codepoint) and I only have about 50 GB of virtual memory to share amongst all processes. I'm less familiar with std::flat_map; it seems that it's a way to use the two parallel strings without directly converting to pairs?
You wouldn't add all Unicode code points to the vector, only those spanned by the lowest one in from to the highest one. So unless you have both 0x0 and 0x10FFFF in it, you wouldn't use all 68 GB. As for std::flat_map, that's just a sorted vector of pairs, with the interface of std::map, and it just uses binary search for lookups and keeps the vector sorted on inserts.
It seems that the flat-map is a good choice for a lookup table that's created once and never modified. Lookup should be cheaper than std::map, and also scales as O(log n) in the length of from; we could use an unordered map when we exceed whatever threshold makes that faster for lookup (possibly larger than one might pass in process arguments?). If we're really clever, we might use a table for one or more subranges, when the from values are in clusters. All that can be transparent if we return a std::function to erase the type.

Stack Exchange Network

Transliterate wide-character input

Known limitations

The code

Quick demo

2 Answers 2

Make a function to transliterate streams

Make it more generic

How to store the map

doc comment

time complexity

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Transliterate wide-character input

Known limitations

The code

Quick demo

2 Answers 2

Make a function to transliterate streams

Make it more generic

How to store the map

doc comment

time complexity

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions