10
\$\begingroup\$

Intro

This snippet is a translation of the Java version.

Code

string_util.hpp:

#ifndef IO_GITHUB_CODERODDE_STRING_UTILS_HPP
#define IO_GITHUB_CODERODDE_STRING_UTILS_HPP
#include <unordered_map>
#include <string_view>
namespace io::github::coderodde::string::utils {
 template <typename Char>
 constexpr bool are_isomorphic(const std::basic_string_view<Char>& s,
 const std::basic_string_view<Char>& z) {
 if (s.length() != z.length()) {
 return false;
 }
 std::unordered_map<Char, Char> mapping1;
 std::unordered_map<Char, Char> mapping2;
 
 for (std::size_t i = 0; i < s.length(); ++i) {
 Char chs = s[i];
 Char chz = z[i];
 if (!mapping1.contains(chs)) {
 mapping1[chs] = chz;
 }
 if (!mapping2.contains(chz)) {
 mapping2[chz] = chs;
 }
 if (mapping1[chs] != chz || mapping2[chz] != chs) {
 return false;
 }
 }
 return true;
 }
} // namespace io::github::coderodde::string::utils
#endif // IO_GITHUB_CODERODDE_STRING_UTILS_HPP

main.cpp:

#include "string_utils.hpp"
#include <string>
#include <iostream>
using io::github::coderodde::string::utils::are_isomorphic;
using std::string_view;
using std::wstring_view;
using std::u16string_view;
using io::github::coderodde::string::utils::are_isomorphic;
int main() {
 std::cout << are_isomorphic(std::string_view{ "egg" }, std::string_view{ "add" }) << '\n';
 std::cout << are_isomorphic(std::u16string_view{ u"paper" }, std::u16string_view{ u"title" }) << '\n';
 std::cout << are_isomorphic(std::wstring_view{ L"ab" }, std::wstring_view{ L"cc" }) << '\n';
 return 0;
}

Critique request

My only concern here is whether the string_util.hpp adheres to the principles of modern C++.

asked Sep 11 at 10:34
\$\endgroup\$
1
  • \$\begingroup\$ An answer on the Java version of the question (Checking for string isomorphism in Java) proposes a single mapping; on non-match for chs, check that that there's no mapping for chz` otherwise that means something else already maps to this chs. That's similar to what @indi is proposing with a mapping using a vector of pairs, although even more efficient would be a std::array<Char,256> which you index for one map, and linear-search for the other direction (with memchr to take advantage of libc hand-written SIMD asm.) But only with 8-bit Char \$\endgroup\$ Commented Sep 12 at 3:42

3 Answers 3

10
\$\begingroup\$
namespace io::github::coderodde::string::utils {

I know this is ported from a Java solution, but in C++ we don’t really use the "reversed domain name" notation for namespacing. Because of that, things that would probably not be a problem in Java become problematic. In particular, the top-level identifier io is extremely likely to be to used for something else. (Also, what would you do if you had a .int domain?)

If you really must use the reversed domain name for your namespace, the better way to do it in C++ would be:

namespace io_github_coderodde::string::utils {

But frankly, there is pretty much zero chance of just coderodde colliding with anything else, so I’d just do namespace coderodde::string::utils. (I’d probably also ditch the utils sub-namespace, because what else would be in a string namespace other than string utilities? String algorithms and/or string concepts?... sure, but are you really going to need separate namespaces for those?)

 template <typename Char>
 constexpr bool are_isomorphic(const std::basic_string_view<Char>& s,
 const std::basic_string_view<Char>& z) {

Really not a fan of indenting at the namespace level, not least because it will change depending on how you write the namespace:

namespace io::github::coderodde::string::utils {
 // indent one level, but...
namespace io {
namespace github {
namespace coderodde {
namespace string {
namespace utils {
 // indent five levels? for the exact same thing?

You may argue that it would be silly to break up a namespace preamble as above, and for that extreme case it would be, but in reality it often happens organically. Very often you are in a namespace and have to open and close a sub-namespace—like a detail namespace—or close the current namespace to refer to something higher up—like what you would have to do to add std::hash support to std::chrono::duration.

It also just wastes at least one indentation level’s worth of horizontal space, for no good reason, because all code should be in a namespace.

Anywho, as I’m sure you’re aware, basic_string_view has two template parameters: the character and the character traits. If you want to ignore the character traits, that’s fine—I wouldn’t blame you, they’re a bad idea that should be obsoleted—but you should at least document that choice.

And finally, never take views by reference, const or otherwise. That is not what views are about. Views are defined to be lightweight and copyable references to other data. Re-referencing them is redundant.

 std::unordered_map<Char, Char> mapping1;
 std::unordered_map<Char, Char> mapping2;

This strikes me as a pretty good case study of something I observe in a lot of computer science-y types: for them, the big-O asymptotic behaviour of an algorithm is everything, to the point where it completely blinds the coder to some otherwise pretty obvious realities. I mean, "obviously" a hash map will be faster than a non-hashed map, because the former has \$O(1)\$ lookup while the latter has \$O(\log n)\$, right? Right?

No. Not right.

Think about it. What do you think the hash of a character—any character, of any character type that would be legal to use in a basic_string_view—would be?

Would it not be... just the character?

That means the hash is "perfect" and free, but it also means that you are paying for all the hashing machinery and bookkeeping for no reason at all. Without even thinking about the structure of a non-hashed map versus a hash map, it should be painfully obvious that if a hash is just "the thing", then a plain map of "the thing" will be more efficient than something that has to have all the extra hashing machinery.

So even without leaving the theoretical domain, it should be pretty obvious that a map without hashing will be more efficient than a map with hashing, when the hash function is the identity operation.

But now we need to leave the theoretical realm and consider the reality of modern computer hardware, the expected sizes of the data you will be working with this function, and what you are actually doing with that data.

There is a very famous experiment that Bjarne Stroustrup did, and mentioned in a talk years ago, where he measured the performance of randomly inserting data into a sorted std::list versus a sorted std::vector, and then randomly removing elements from them. (Note: even though both containers were sorted, I believe Stroustrup still used linear searches, rather than binary searches, just to be fair to std::list.) Now, you’d probably think—as Stroustrup did—that while std::vector might dominate with small data sets, surely std::list with its \$O(1)\$ insertion and removal will outperform std::vector... which requires shifting all subsequent elements anytime anything is inserted or removed in the middle. But it turns out that std::vector wins, even up to data sets with hundreds of thousands of elements.

Why? Because of modern computer architecture. Modern computers basically always have caches... often multiple levels of caches, and often caches that are huge, measured in several megabytes. They are also often specially tuned to access data sequentially, using prefetch to get data that it can expect will be accessed next given that you have accessed the previous data in order. Put those facts together, and even though node-based accesses technically have "better" big-O performance, the hidden constants that come from cache miss costs are so huge that simple linear access wins out for virtually all real-world data set sizes.

And let’s think about what you’re actually doing here... you are doing two searches... through two totally different node-based containers. Even if each of those were faster than a simple linear search through a vector—which, again, Stroustrup’s experiment showed it is not, for data set sizes up into the hundreds of thousands—you’re doing two of them. (Actually, in same cases you’re doing up to six of them. More on that shortly.) And given that you can do a single pass through the mapping... yeah, using node-based containers really doesn’t make a lot of sense.

If you doubt that, then by all means, profile. But I would bet that for any size string you would ever expect to see, a simple, flat std::vector mapping will be much faster.

The algorithm becomes simpler, too! And you can reserve if you want to, to pay only once for an allocation, rather than for every character (twice!!!).

template <typename Char>
constexpr auto are_isomorphic(
 std::basic_string_view<Char> s,
 std::basic_string_view<Char> z) -> bool
{
 if (s.length() != z.length())
 return false;
 auto mapping = std::vector<std::pair<Char, Char>>{};
 // You can reserve here if you want. The tough part is knowing how much
 // to reserve. Reserving s.length() will almost certainly be an
 // over-estimate. I suppose you could use a heuristic, which you can tune.
 for (auto const [cs, cz] : std::views::zip(s, z))
 {
 auto p = std::ranges::begin(mapping);
 auto const end = std::ranges::end(mapping);
 // Go through the mapping, searching for *either* cs *or* cz.
 // Note that this loop could be parallelized, or even vectorized!
 // You could just use a parallel algorithm instead of a raw loop.
 //
 // You should probably use an algorithm in any case, but I thought
 // that would make the algorithm less clear. The *correct* way to
 // do this would probably be to use std::ranges::find_if() (and also
 // maybe std::ranges::for_each() for the outer loop). How, though?
 // I'll say that's an "exercise for the reader". (It's not hard!)
 for (; p != end; ++p)
 {
 // If either cs or cz is found, check the other half of the
 // mapping. If it doesn't match, the strings are not isomorphic.
 // If it does match, we're done looping through the mapping.
 if (std::get<0>(*p) == cs)
 {
 if (std::get<1>(*p) != cz)
 return false;
 else
 break;
 }
 if (std::get<1>(*p) == cz)
 {
 if (std::get<0>(*p) != cs)
 return false;
 else
 break;
 }
 }
 // If we got to the end of the mapping without finding either cs or
 // cz, we need to add a new mapping item for them.
 if (p == end)
 mapping.emplace_back(cs, cz);
 }
 return true;
}

If you are expecting really, really long strings, then it might be a good idea to use std::deque rather than std::vector. std::deque is almost std::vector, but where std::vector has to completely reallocate and move all data every time it needs to grow, std::deque does not; it just allocates a new chunk. It’s a little slower, but probably not noticeably so for this use case. Especially when compared to node-based containers.

Back to the code:

 if (!mapping1.contains(chs)) {
 mapping1[chs] = chz;
 }

This is a really, really bad way to go about this. This requires doing up to two runs through the map: first for the .contains(), and possibly second for the insert. This is what .insert_or_assign() or .try_emplace() were made for. What you probably want to do here is to try-emplace the new data, and if that returns that the data already exists, then do the check:

auto const [it1, inserted1] = mapping1.try_emplace(chs, chz);

Then later you do:

 if (mapping1[chs] != chz || mapping2[chz] != chs) {
 return false;
 }

... which requires running through the mapping again, even though you already had the mapping item you wanted! Either it was found within .contains(), or created in the assignment... but either way, you threw that information away, and now have to search through the map again.

Instead, because you already have the iterator to the mapping item you are interested in—whether it was newly inserted or not—you should use it:

if (std::get<1>(*it1) != chz or std::get<1>(*it2) != chs)
 return false;

So, altogether:

auto const [it1, _] = mapping1.try_emplace(chs, chz);
auto const [it2, _] = mapping1.try_emplace(chz, chs);
if (std::get<1>(*it1) != chz or std::get<1>(*it2) != chs)
 return false;

Or to avoid an unnecessary romp through the second map when you already know the first one failed the test:

// Basically: if chs was already in the map (not inserted), but its partner
// was not chz, we're done.
if (auto const [p, inserted] = mapping1.try_emplace(chs, chz); not inserted and std::get<1>(*p) != chz)
 return false;
// Same for the reverse map.
if (auto const [p, inserted] = mapping2.try_emplace(chz, chs); not inserted and std::get<1>(*p) != chs)
 return false;

Either way, we get, at max, a single pass through the two maps (and possibly only a single pass through one of them).

But again, a single pass through a flat bi-mapped vector would probably be much, much faster than two passes through node-based maps (whether unnecessarily hashed or not).

QUICK EDIT: Ah, one thing I forgot I wanted to mention: if you really want to follow best practices, then you should have an overload that takes an allocator, which you can use in your mappings. That gives the user control over memory allocation... which is crucial for me in many of the domains I work in. You can still have an overload that just takes the two string views and uses the default standard allocator, for ergonomic purposes.

Extension

Implicit string views

I want to demonstrate how to solve a problem that comes up all the time when trying to write string algorithms... and also came up here, but I didn’t notice it in the original review.

Here is the problem: You want to write a function that can work with strings—or more specifically, views of strings, because you only want to inspect the string data, not take it or change it. For plain old char strings, this is trivial:

auto f(std::string_view sv) { ... }

This will automatically work for anything string-like... provided the character type is char:

f("NUL terminated char arrays work");
f("std::strings work"s);
f("std::string_views work (obv)"sv);
f(my_custom_string_type{"works so long as it is implicitly convertible to std::string_view (which anything 'string-like' should be)"});

Great!

Okay, but now you want to support all character types, not just char. So you naturally do this:

template<typename Char>
auto f(std::basic_string_view<Char> sv) { ... }

Only now... everything breaks. Not only do non char strings (still) not work (except for specializations of basic_string_view)... even the char strings that worked before no longer work:

f("nope");
f("nope"s);
f("well, obviously this works"sv);
f(my_custom_string_type{"nope"});

Your compiler will probably tell you that it could not find a matching function. The only candidate requires a std::basic_string_view<Char> (and then it would deduce Char), and even though everything you threw at it will implicitly convert to a std::basic_string_view<Char>, they are not, in fact, actually std::basic_string_view<Char>.

So how to solve this problem?

As always, the solution is just another level of indirection:

template<typename Char>
auto f(std::basic_string_view<Char> sv) { ... }
template<typename S>
auto f(S&& s) { return f(std::basic_string_view<???>{std::forward<S>(s)}); }

Putting aside the question of what the ??? for a moment, this works. Anything string-like, with any character type, will either be converted to a string view... or will be a string view... so there will be no problem finding a match.

But this is hacky, to say the least. So let’s tighten things up a bit.

The first issue is what the ??? should be. How do you get the character type from... literally anything?

Well... you don’t. You just let std::basic_string_view do the work. There are three possible cases here:

  1. The type is something that std::basic_string_view<Char> can be constructed from directly... which is specifically just Char const*. In that case, Char is obvious.
  2. The type is something that std::basic_string_view<Char> has a converting constructor for... specifically as a contiguous character range. In that case, std::basic_string_view<Char> has a deduction guide that deduces Char.
  3. The type is something that has a conversion operator to std::basic_string_view<Char>... that is, anything string-like that isn’t already a contiguous range. In that case, the type itself will tell you what Char is via its conversion operator.

In any case, we can let class template argument deduction (CTAD) do the work:

template<typename Char>
auto f(std::basic_string_view<Char> sv) { ... }
template<typename S>
auto f(S&& s) { return f(std::basic_string_view{std::forward<S>(s)}); }

But this is still not so great, because the second overload claims it takes literally everything. We clearly need to constrain it a bit.

What we need is a "string-like" concept:

template<typename T>
concept string_like = ???;
template<typename Char>
auto f(std::basic_string_view<Char> sv) { ... }
template<string_like S>
auto f(S&& s) { return f(std::basic_string_view{std::forward<S>(s)}); }

Okay, but how do we define "string-like"?

Well, anything that is implicitly convertible to a basic_string_view must be string-like. So:

template<typename T>
concept string_like = std::convertible_to<T, std::basic_string_view<???>>;
template<typename Char>
auto f(std::basic_string_view<Char> sv) { ... }
template<string_like S>
auto f(S&& s) { return f(std::basic_string_view{std::forward<S>(s)}); }

But again we have the problem... what do we put in the ????

This is a problem we already solved! We don’t put anything there. We just let CTAD do its thing:

template<typename T>
concept string_like = std::convertible_to<T, decltype(std::basic_string_view{std::declval<T>()})>;
template<typename Char>
auto f(std::basic_string_view<Char> sv) { ... }
template<string_like S>
auto f(S&& s) { return f(std::basic_string_view{std::forward<S>(s)}); }

The concept works in three steps:

  1. Use CTAD to deduce a string view type for the given type.
  2. Use decltype to get the type that was deduced.
  3. Make sure T is implicitly convertible to that kind of string view.

(We ultimately do an explicit conversion in the function itself, but we require that the type is implicitly convertible. That is on purpose. Anything that is only explicitly convertible to a string view may not be all that string-like. But anything that implicitly converts to a string view must be string-like.)

string_like or something like it is definitely something that should be in your toolkit—or at least you should keep the technique in mind for when you need it.

In the specific case here, it would look like this:

template<typename T>
concept string_like = std::convertible_to<T, decltype(std::basic_string_view{std::declval<T>()})>;
template<typename Char>
constexpr auto are_isomorphic(std::basic_string_view<Char> s1, std::basic_string_view<Char> s2) { ... }
template<string_like S1, string_like S2>
constexpr auto are_isomorphic(S1&& s1, S2&& s2)
{
 return are_isomorphic(std::basic_string_view{std::forward<S1>(s1)}, std::basic_string_view{std::forward<S2>(s2)});
}

Because there are multiple strings involved here, you may also want to confirm that they both have the same character type.

For that, you would need a type trait, like so:

template<string_like S>
using string_char_t = typename decltype(std::basic_string_view{std::declval<S>()})::value_type;

Which you would use to constrain the generic template:

template<string_like S1, string_like S2>
requires std::same_as<string_char_t<S1>, string_char_t<S2>>
constexpr auto are_isomorphic(S1&& s1, S2&& s2)
{
 return are_isomorphic(std::basic_string_view{std::forward<S1>(s1)}, std::basic_string_view{std::forward<S2>(s2)});
}

And now you’re off to the races.

Avoiding linear search

As suggested by @PeterCordes in the comments, if the character size is small enough, you could avoid linear searches for matches by using indexing instead. There is one little wrinkle, which is why this was worth extending the review for, but it’s otherwise pretty straightforward. The following is not at all hand-optimized, but any decent optimizing compiler should have an easy time with it.

template<typename Char>
// Specialize if the bit size of the type is 8 (or less, I suppose).
//
// This is perfectly safe because any type that is legit for use as a
// character in a basic_string_view *must* have a numeric_limits
// specialization.
requires ((std::numeric_limits<Char>::digits + std::numeric_limits<Char>::is_signed) >= 8uz)
constexpr auto are_isomorphic(
 std::basic_string_view<Char> s,
 std::basic_string_view<Char> z) -> bool
{
 // Get the easy check out of the way first.
 if (s.length() != z.length())
 return false;
 constexpr auto size = std::numeric_limits<std::make_unsigned_t<Char>>::max();
 auto mapping = std::array<Char, size>{};
 auto seen = std::bitset<size>{};
 return std::ranges::all_of(std::views::zip(s, z),
 [&mapping, &seen](auto&& t)
 {
 auto const [cs, cz] = t;
 auto const i = static_cast<std::make_unsigned_t<Char>>(cs);
 // If we've already seen the character, and it is *not* mapped to
 // its partner, then we're done.
 //
 // Otherwise, record the mapping.
 if (seen.test(i) and mapping[i] != cz)
 {
 return false;
 }
 else
 {
 mapping[i] = cz;
 seen.set(i);
 }
 // This is the only tricky part!
 //
 // If the target character is "zero", then we can't just search
 // for it in the mapping... because "zero" will be the value set
 // for every character we have not yet seen.
 //
 // When the target character is *not* zero, we just do the easy
 // thing.
 //
 // When the target character *is* zero, we need to loop through
 // all seen mappings, and check them. Unfortunate, but without a
 // sentinel value, it's probably the best we can do. And we can
 // how that "zero" values are vanishingly rare in strings (though
 // undeniably not so in generic data).
 if (cz != Char{})
 {
 if (auto const p = std::ranges::find(mapping, cz); p != std::ranges::end(mapping))
 {
 auto const found_index = std::ranges::distance(std::ranges::begin(mapping), p);
 
 // In theory we should test that the mapping we just found
 // is actually a seen mapping.
 //
 // However, because we have special-cased "zero", the only
 // way there can be a match is if this *is* a seen
 // mapping.
 //
 // If we did not zero-fill the mapping at the beginning,
 // it would have garbage data in it, and so then we would
 // have to do the check. But we did zero-fill the mapping,
 // and cz is not "zero" in this branch, so... we're cool.
 //
 // So there is only one thing left to check:
 if (i != found_index)
 return false;
 }
 }
 else
 {
 // So cz is "zero". That means we can't just search for it in
 // the mapping, because the mapping holds "zero" for every
 // unseen pair.
 //
 // So instead we loop through the seen bitset, and for every
 // mapping that has been seen - other than the one we found or
 // created earlier - we check to see if the mapped value is
 // cz. If it is, then the strings can't be isomorphic.
 for (auto n = 0uz; n != seen.size(); ++n)
 {
 if (i != n and seen.test(n))
 {
 if (index[n] == cz)
 return false;
 }
 }
 }
 }
 );
}

Can this be parallelized or, dare we hope, vectorized?

Maybe!

What you could do is split long strings up into chunks, and for each chunk, generate the array<Char, size> mapping and bitset<size> seen table, then merge them all.

Merging the seen tables is trivial; it’s just merged = seen1 | seen2.

Merging the mapping is a bit trickier. What you want to do is when the mapped element does not match, check that the seen flag is only set in one of the two seen tables—that is, for every position n where mapping1[n] != mapping2[n], seen1.test(n) ^ seen2.test(n) must be true. When the element does match... no biggie; either it was the same mapped pair, or both were never seen (and thus both still "zero"), or one was seen and it happened to be "zero" (which happens to match the unseen case)... all three cases are fine. I’m not really a SIMD guy, but the operations I just described—producing a mask where the mappings don’t match, and then getting their indices and doing the XOR tests with those indices—sure seem doable.

But all through this section we have not asked the Goldblum question. Sure, this can be done... but...

...

... should it?

In my opinion? No.

Let’s be real. This function, while useful and possibly used in some utility or another, it hardly a massively important, linchpin operation. It probably won’t be called thousands or millions of times a second. And it probably won’t be called on strings that are tens of thousands or millions of characters long.

Even in the situation where everything is stacked in such away as to give the best possible advantage to the algorithm that avoids the linear search as opposed to the indexing, you’re literally just talking about avoiding the cost of iterating two bytes at a time over 512 bytes (so 256 iterations) that are always right there, hot in the closest cache. And that is the absolute worst case. In the average case, you’d only be iterating two bytes at a time over 256 bytes (128 iterations), and in the expected realistic case, your strings will usually not be 100+ characters long... so the actual amount of iterating will be much, much smaller. I’m actually skeptical that any cost difference would even be measurable above noise. (And remember that even the indexed solution might need to loop too! Granted, it is far less likely, because NUL does not usually appear in strings, but there is actually a theoretically possible worst case with string data that is just a million NUL characters long. The indexed solution would have to do 256 loop iterations for every character, while the linear search would only ever do one.)

Yes, sure, an algorithm should be as efficient as reasonably possible... but the key word there is "reasonably".

I get the allure of being nerd sniped by a deep, low-level optimization problem that could even get you into managing individual registers and flags and farting around with vector operations. But I am not a scientist or mathematician; I am an engineer. And as an engineer, I can’t allow myself to be blinded by the shininess of the problem, because there are important things that are being forgotten by the kinds of people who want to drop everything and dive into problems like this.

Probably the biggest one, is this: nothing is free. Every tiny little bit of extra complication or complexity you add to a problem brings with it an order of magnitude increase in the cost of making your solution real. And by real I mean: actually coded... AND FULLY TESTED... so that it can actually be used in real-world systems.

If you write a single solution to the problem—either like the original code, or any of the versions offered in the reviews—then you need a battery of tests to verify it. So long as you have only a single solution, that same battery of tests should suffice... meaning that you can freely refactor and experiment with tweaks to try and improve the efficiency, without increasing the maintenance burden. You would have working code, that is verifiably tested, and can be improved.

But the moment you add a second solution—one that special-cases for some factor—you not only double the maintenance burden—because now you need a second battery of tests to exercise the new solution—you actually more than double it—because you also need tests to confirm that the second solution is being selected when you think it is, and is not being selected when you don’t want it to be.

And that is the thing that too many people fail to consider: any additional complexity you add to the solution... even if it may increase the efficiency of the actual code... always requires more testing... and that testing very often costs far, far more than any tiny gains you get from twiddling bits in individual registers, however satisfying that may be. I am perpetually horrified by how few people think of testing at all, or at best give it only a cursory thought. As far as I am concerned, any code that is not properly tested is garbage. I don’t care how sexy or efficient it is; if it has not been verified by a battery of tests, it is absolutely useless. I cannot put it in any project I am responsible for. Only a truly incompetent engineer ships untested code. (And yes, I am straight up calling out the 90+% of coders out there who never test their code... many of whom don’t even know how!)

Of course, if you don’t care about producing code for real-world systems... if you’re actually just in it for the intellectual thrill... then by all means, twiddle bits as much as you please. But if you care about being a real, profession programmer, who produces code that gets used in real-world systems, then you might need to tamp down that nerd lust to fart around with a problem until you’ve squeezed every interesting bit out of it, and instead of sexiness or shininess or coolness, make robustness and proven reliability the thing you focus on.

And yes yes yes of course, there are cases where it is worthwhile, and maybe even necessary, to squeeze every last bit of performance out of some bit of code, and where special-casing is not only useful, but possibly even required to get that performance in the key use cases. That is not generally the case, but yes, there are some functions that are so important, and called so often, that they should be hyper-optimized. But, again, if you want to make real-world code, you need to be able to identify when that is the case... and when it is not.

This case? Sure seems like it’s a "not".

My advice is that your time as a programmer is far better spent on writing (and learning how to write) proper tests for functions like this, than on dicking around with hyper-optimizing it by special-casing. That skill—properly testing and validating code—is worth a billion times more than whatever you might gain counting bytes and cycles. You could literally spend years hyper-fixating on making this function as fast as possible in every possible case... all time which will otherwise be effectively wasted. Meanwhile, every minute you spend doing testing and other validation of whatever code you have now will pay off in piles, because not only will the skills themselves be immediately valuable and transferable... you will also have made this function usable in real-world projects that require verified code (even if you never actually use it).

Don’t get nerd sniped by esoteric crap. Focus on reliability and testing. It is and always will be far more important, and useful, to make code that works... provably works... than to make code that is a few nanoseconds faster.

answered Sep 11 at 21:53
\$\endgroup\$
12
  • \$\begingroup\$ Do you have evidence that Stroustrup thought that list will outperform vector? \$\endgroup\$ Commented Sep 11 at 22:33
  • 2
    \$\begingroup\$ @KellyBundy Yes. But how is that in any way relevant to the code review? \$\endgroup\$ Commented Sep 11 at 23:36
  • \$\begingroup\$ CHAR_BIT is allowed to be larger than 8 (unless that changed in C++23 or something). Would it make sense to use uint8_t arrays instead of Char, to make sure you don't do out-of-bounds access with std::array<Char,256>? Or wait, you're using a std::vector< std::pair<Char,Char> > and searching it instead of just indexing. I guess that avoids the array-size problem of potentially-large Char; with CHAR_BIT = 24 on a DSP or something, you could potentially have inputs that needed a lot of vector elements, but only with large strings, not every string like my 2-array strategy. \$\endgroup\$ Commented Sep 12 at 3:03
  • \$\begingroup\$ With an array as a map for uint8_t-sized Chars, one lookup can be indexing, the other a linear search. (Which as you say can go very fast with SIMD if you call memchr to get hand-written asm. But unfortunately there isn't a libc function that takes two chars and searches for either, so no chance to get optimal asm for that without manual intrinsics. The ones that take a string of accept-chars like strpbrk may only special-case short sets with SSE4.2 stricmps, and only work on 0-terminated strings not arrays.) \$\endgroup\$ Commented Sep 12 at 3:30
  • 1
    \$\begingroup\$ @PeterCordes Charchar. The OPs examples show use with UTF-16 code units, and wide characters (which are usually UTF-32 anywhere but Windows). If you want to specialize for char to avoid the dynamic allocation, sure. But you could also just use an allocator to get basically the same speed-up, and that would work for any character type. \$\endgroup\$ Commented Sep 12 at 9:45
7
\$\begingroup\$

For a more modern C++ take, I think we'd likely replace your for loop indexing over indices with a range-based for loop that zips together the two views, and then destructures them into chs and chz.

E.g.

 for (const auto &[chs, chz] : std::ranges::zip_view(s, z)) {
 if (!mapping1.contains(chs)) {
 mapping1[chs] = chz;
 }
 if (!mapping2.contains(chz)) {
 mapping2[chz] = chs;
 }
 if (mapping1[chs] != chz || mapping2[chz] != chs) {
 return false;
 }
 }

We might also use the algorithms library so that we don't have to write the short-circuiting semantics of this ourselves. (Not that it was difficult.)

 template <typename Char>
 constexpr bool are_isomorphic(const std::basic_string_view<Char>& s,
 const std::basic_string_view<Char>& z) {
 if (s.length() != z.length()) {
 return false;
 }
 std::unordered_map<Char, Char> mapping1;
 std::unordered_map<Char, Char> mapping2;
 const auto zip = std::ranges::zip_view(s, z);
 return std::all_of(
 zip.cbegin(), zip.cend(),
 [&mapping1, &mapping2] (const auto &e) {
 const auto chs = std::get<0>(e);
 const auto chz = std::get<1>(e);
 if (!mapping1.contains(chs)) {
 mapping1[chs] = chz;
 }
 if (!mapping2.contains(chz)) {
 mapping2[chz] = chs;
 }
 return mapping1[chs] == chz && mapping2[chz] == chs;
 }
 );
 }
answered Sep 11 at 14:24
\$\endgroup\$
1
  • 2
    \$\begingroup\$ std::ranges::all_of() has a simpler interface: return std::all_of(std::ranges::zip_view(s, z), ...). \$\endgroup\$ Commented Sep 12 at 10:00
4
\$\begingroup\$

The problem statement

There's no definition of what it means for two strings to be "isomorphic" (not in comments, nor even the review request itself), so readers have to deduce which meaning of the word is intended by inspection of the source code. That's not ideal. Saying that it's a translation of code from some other language isn't really helpful unless that other code is well-known amongst the C++ community.

Translation from Java

This code looks like it is Java code that has been syntactically rewritten; it doesn't look like idiomatic C++ code. For example, parallel indexing of the two strings rather than simply iterating a zip-view looks very Java to me.

Interface

Because this is a template, it will bind only to std::string_view classes, requiring callers to perform that conversion explicitly. It would be more convenient if it were more general, accepting arguments that can both be converted to a common string-view type. Then we could call with a std::wstring and a wchar_t const *, for example.

Data types

It's quite likely that this will be used for Char = char, where the overhead of std::unordered_map is going to be quite significant compared to a simple std:array for the mapping. Consider specialising the representation for Char types that have a small range. Even for wider types, real-life texts tend to use characters that are clustered in the code-point range (e.g. in Unicode, texts that use characters from more than a couple of Unicode blocks are rare), so you might choose a two-level structure for mapping from characters (probably a worthwhile utility class of its own).

Map interface

The standard map classes have insert() and try_emplace() members that test for existence, which we could use instead of the separate contains():

 if (mapping1.try_emplace(chs, chz).first->second != chz) {
 return false;
 }
 if (mapping2.try_emplace(chz, chs).first->second != chs) {
 return false;
 }

I might consider a small helper to make those firsts and seconds more palatable:

 static auto const insert_or_check =
 [](auto& map, auto key, auto value) -> bool
 {
 auto const [it, _] = map.try_emplace(key, value);
 auto const& actual = it->second;
 return actual == value;
 };
 if (!insert_or_check(mapping1, chs, chz) || !insert_or_check(mapping2, chz, chs)) {
 return false;
 }

Demo program

The demo program doesn't seem to know whether it's a usage example or a set of unit tests. If it's to be an example, I'd expect it to be more flexible, probably comparing two command-line arguments, and have more human-friendly output (even something as simple as std::cout << std::boolalpha). If it's a unit-test, I would expect it to actually confirm whether the results matched expectations, and return appropriate success/fail exit code.

It includes <string> but never uses it, and declares alias names with using that are never used. Also, there's a duplicate using for is_isomorphic.

If it's intended to be a test suite, then there are some cases that haven't been considered: empty strings and unequal-length strings for starters. A more advanced test would exercise non-default character traits on the views (the usual simple example is a traits that compares case-insensitively according to the "C" locale). Note that you'll have to replace the character equality check with one that uses the traits object's comparison method for this to work, and also use that for the maps' comparison parameters.

answered Sep 12 at 8:09
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.