One shortcoming of GNU's implementation of the tr
utility is that it operates on bytes rather than characters. As the man page says under BUGS:
Full support is available only for safe single-byte locales, in which every possible input byte represents a single character.
I've created a limited replacement that works with wide characters according to the inherited locale.
Known limitations
- None of the options of
tr
are supported, such as-d
to delete matched characters instead of replacing them, or-c
to complement the match set. These two would be fairly easy to implement; the-s
option to squeeze successive matches less so. - The "from" and "to" arguments must match in length, and there's no support for ranges, character classes or octal notation. Notably, this prevents us matching or substituting null characters.
- Replacement is by wide-character code-point. So there's no normalisation of combining accents, and if your wide-character coding uses multi-char sequences (hello, UTF-16) then it's not usable for those.
These limitations are acceptable for my use-cases, so I don't need to change anything.
The code
I'll start with the core function:
#include <map>
#include <ranges>
#include <stdexcept>
#include <string_view>
#include <utility>
// Returns a function object that converts any character present in
// "from" to the corresponding character in "to".
auto make_substitutor(std::wstring_view from, std::wstring_view to)
{
if (from.size() != to.size()) {
throw std::invalid_argument("Replacement length mismatch");
}
auto map = std::views::zip(from, to) | std::ranges::to<std::map>();
return [map=std::move(map)](wchar_t c) {
auto it = map.find(c);
return it == map.end() ? c : it->second;
};
}
I have unit tests for it:
#include <gtest/gtest.h>
#include <algorithm>
#include <string>
auto tr_string(auto const& tr, std::wstring s)
{
std::ranges::transform(s, s.begin(), tr);
return s;
}
TEST(tr, bad_args)
{
EXPECT_THROW(make_substitutor(L"abcde", L"bcde"), std::invalid_argument);
}
TEST(tr, noop)
{
auto const tr = make_substitutor(L"", L"");
EXPECT_EQ(tr_string(tr, L""), L"");
EXPECT_EQ(tr_string(tr, L"hello"), L"hello");
}
TEST(tr, english)
{
auto const tr = make_substitutor(L"ehlo", L"ipza");
EXPECT_EQ(tr_string(tr, L""), L"");
EXPECT_EQ(tr_string(tr, L"hello"), L"pizza");
}
TEST(tr, greek)
{
auto const tr = make_substitutor(L"αβγδεζηθικλμνξοπρσςτυφχψω",
L"ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΣΤΥΦΧΨΩ");
EXPECT_EQ(tr_string(tr, L"Γεια σας"), L"ΓΕΙΑ ΣΑΣ");
}
And a main()
that makes it a complete program:
#include <algorithm>
#include <iostream>
#include <iterator>
#include <locale>
#include <cstdlib>
#include <cstring>
#include <print>
// Convert from platform string to wide string
static std::wstring arg_to_wstring(const char* arg)
{
std::wstring s(std::strlen(arg), '0円'); // likely over-allocation
auto len = std::mbstowcs(s.data(), arg, s.size());
if (len == -1uz) {
throw std::invalid_argument("Malformed multibyte string");
}
s.resize(len);
return s;
}
int main(int argc, char **argv)
{
if (!std::setlocale(LC_ALL, "")) {
println(std::cerr, "Invalid locale settings");
return EXIT_FAILURE;
}
if (argc != 3) {
println(std::cerr, "Usage: {} from_chars to_chars", *argv ? *argv : "tr");
return EXIT_FAILURE;
}
try {
std::wstring const from = arg_to_wstring(argv[1]);
std::wstring const to = arg_to_wstring(argv[2]);
std::wcin.exceptions(std::ios::badbit | std::ios::failbit);
std::wcout.exceptions(std::ios::badbit | std::ios::failbit);
std::transform(std::istreambuf_iterator<wchar_t>(std::wcin),
std::istreambuf_iterator<wchar_t>(),
std::ostreambuf_iterator<wchar_t>(std::wcout),
make_substitutor(from, to));
} catch (std::exception& e) {
println(std::cerr, "{}", e.what());
return EXIT_FAILURE;
}
}
Quick demo
$ ./wchar-tr αβγδεζηθικλμνξοπρσςτυφχψω ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΣΤΥΦΧΨΩ <<<'Γεια σας'
ΓΕΙΑ ΣΑΣ
2 Answers 2
Make a function to transliterate streams
There's a nice make_substitutor()
function, but the rest of the code all lives in main()
. Why not make a function that takes two std::wstring
s, a std::istream
and a std::ostream
, and that does the transliteration? Ideally you'd be able to write:
int main(int argc, char **argv)
{
...
std::wstring const from = arg_to_wstring(argv[1]);
std::wstring const to = arg_to_wstring(argv[2]);
std::wcin.exceptions(std::ios::badbit | std::ios::failbit);
std::wcout.exceptions(std::ios::badbit | std::ios::failbit);
transliterate(from, std::wcin, to, std::wcout);
}
Make it more generic
Your code is a straightforward extension of tr
's handling of char
s to wchar_t
s. But what if you want to handle any of the other standard character types? Or for that matter, non-standard types? You can easily turn this into a template that would make it much more flexible.
How to store the map
You are using a std::map()
to store the mapping, but this has a \$O(\log N)\$ lookup cost, where \$N\$ is the length of the from
string. Also, a std::map
here has a lot of overhead, as each entry in the map is allocated separately on the heap.
Some alternatives available in the standard library are:
std::unordered_map
for \$O(1)\$ lookup complexitystd::flat_map
for storing the map in a single contiguous memory region, which improves cache efficiencystd::vector
indexed by character values for just the unicode range spanned by thefrom
string. If that's not too large, it should be even faster than astd::unordered_map
-
\$\begingroup\$ Oh, I meant to comment on the choice of
std::map
. There's probably a size where the overhead ofstd::unordered_map
pays off, but I haven't benchmarked to find where that is.std::vector
is an interesting choice, given that on my (Unicode) platform that would be 68 GB (0-0x10FFFF ✕ 4 bytes/codepoint) and I only have about 50 GB of virtual memory to share amongst all processes. I'm less familiar withstd::flat_map
; it seems that it's a way to use the two parallel strings without directly converting to pairs? \$\endgroup\$Toby Speight– Toby Speight2025年03月02日 20:16:11 +00:00Commented Mar 2 at 20:16 -
2\$\begingroup\$ You wouldn't add all Unicode code points to the vector, only those spanned by the lowest one in
from
to the highest one. So unless you have both 0x0 and 0x10FFFF in it, you wouldn't use all 68 GB. As forstd::flat_map
, that's just a sorted vector of pairs, with the interface ofstd::map
, and it just uses binary search for lookups and keeps the vector sorted on inserts. \$\endgroup\$G. Sliepen– G. Sliepen2025年03月02日 22:30:54 +00:00Commented Mar 2 at 22:30 -
1\$\begingroup\$ It seems that the flat-map is a good choice for a lookup table that's created once and never modified. Lookup should be cheaper than
std::map
, and also scales as O(log n) in the length offrom
; we could use an unordered map when we exceed whatever threshold makes that faster for lookup (possibly larger than one might pass in process arguments?). If we're really clever, we might use a table for one or more subranges, when thefrom
values are in clusters. All that can be transparent if we return astd::function
to erase the type. \$\endgroup\$Toby Speight– Toby Speight2025年03月06日 07:30:51 +00:00Commented Mar 6 at 7:30
doc comment
// Returns a function object that converts any character present in
// "from" to the corresponding character in "to".
auto make_substitutor(std::wstring_view from, std::wstring_view to)
nit: I don't get the sense that doxygen or similar approaches to publishing documentation would do a great job with that. However the contract is admirably clear and concise.
time complexity
I am willing to believe that map.find(c)
completes in
\$O(1)\$ constant time.
Overall it looks good. Ship it!
-
\$\begingroup\$ Lookup scales as O(n), where n is the length of the "from" string. If it gets really large, it may be advantageous to switch over to
std::unordered_map
- but I haven't benchmarked to find that inflection point. \$\endgroup\$Toby Speight– Toby Speight2025年03月02日 20:22:56 +00:00Commented Mar 2 at 20:22 -
1\$\begingroup\$ Yeah, that was my concern. So for twenty-ish letters we could call that "constant". But if we ever introduce "complement" or similar, then it seems we'd want a hash table lookup. \$\endgroup\$J_H– J_H2025年03月02日 20:37:17 +00:00Commented Mar 2 at 20:37
tr -d X
for some unwanted X, but far and away the most common invocation istr A-Z a-z
. And low and behold, the "quick demo" immediately goes, in inconvenient form, to exactly that for Γεια σου κόσμο!. \$\endgroup\$wchar_t
to support (mainly east Asian) character sets, later also used for Unicode code-points. Was Windows even a thing in the 1980s? \$\endgroup\$char
, notwchar_t
. Whatever the ancient origins ofwchar_t
, its standardization has been entirely dictated by "what Windows needs", and even Microsoft admits they have made it a "unique burden". \$\endgroup\$