I wrote a simple string pool for my compiler in C++ and want to hear your thoughts on its design.
String pool has 2 functions: intern
and internLiteral
. The only difference is that after interning literal, its memory is used. So no allocation happens. If memory for the same string was already allocated, it will be deallocated.
Hence, I have 2 destruction behaviours differentiated by StringType
enum.
I also want my InternedString
s to have interface of const std::string_view
(basically because of size
method).
EDIT:
- Added concept for char allocator
- Strings are always null-terminated
- Added tests (all passed)
- Fixed compilation
EDIT:
- Removed
shared_ptr
andweak_ptr
. Use raw pointer - Now, after interning literals, previously allocated memory for the same non-literal string will be deallocated
- Simplified logic
- Updated test
FAQ:
- How to use
InternedString
?
InternedString
is only meant to do 2 things:
- O(1) equality check (as it is a pointer to
const std::string_view
) - read-only access to string
StringPool
shouldn't be destroyed, while InternedString
exists.
- Is
StringPool
worth it?
Well, it depends.
My use case is a compiler, where I have a lot of repeated strings for keywords, common functions names, etc.
Let's take while
keyword for example.
Null-terminated "while"
string contains 6 bytes.
StringPool
uses additional sizeof(std::string_view)
bytes for it.
After 4 while
keywords we start gaining profit.
And comparing pointers is extremely fast enter image description here
- Why care about literals?
I want to "preallocate" keywords, I already use in my program as literals. But it's a very small optimization, and I may delete it to simplify pool.
#pragma once
#include <string_view>
#include <unordered_map>
/**
* @file StringPool.hpp
* @brief Manage pool of strings.
* Pros:
* - No allocation for repeated strings
* - No allocation for literals
* - Strings have stable addresses -- fast to check for equality
*/
namespace ppl::common
{
/// Reference to string, interned in the pool
using InternedString = const std::string_view *;
namespace concepts
{
/// Concept for char allocator
template<typename T>
concept CharAllocator = requires(T t, char *str, size_t size)
{
{ t.allocate(size) } -> std::same_as<char *>;
{ t.deallocate(str, size) };
};
}
/// Pool of strings
template<concepts::CharAllocator Allocator = std::allocator<char>>
class StringPool
{
public:
/// Allocator, used by pool
Allocator allocator;
/// Perfect forward arguments to allocator
template<typename ...Args>
requires std::is_constructible_v<Allocator, Args...>
StringPool(Args &&... args) noexcept
: allocator(std::forward<Args>(args)...) {}
/// Forbid copy
StringPool(const StringPool &) = delete;
/// Forbid copy
StringPool &operator=(const StringPool &) = delete;
/// Allow move
StringPool(StringPool &&) = default;
/// Allow move
StringPool &operator=(StringPool &&) = default;
/// @brief Intern a string literal on the pool
/// @note Never allocates memory for string
/// @note Deallocates memory for previously interned non-literal string
InternedString internLiteral(const char *literal)
{
std::string_view view{literal};
auto [it, _] = strings.emplace(view, StringType::literal);
if (it->second == StringType::allocated)
{
// We can safely change it to be a view over literal,
// because it's exactly the same string.
//
// ...But I'm not sure
auto &oldView = const_cast<std::string_view &>(it->first);
deallocateViewContent(oldView);
oldView = view;
it->second = StringType::literal;
}
return &it->first;
}
/// @brief Intern a string on the pool
/// @note
/// If string is a literal,
/// use @c internLiteral() instead for memory efficiency
InternedString intern(std::string_view str)
{
if (auto it = strings.find(str); it != strings.end())
{
return &it->first;
}
auto *allocated = allocator.allocate(str.size() + 1);
if (!allocated) { return nullptr; }
std::uninitialized_copy(str.begin(), str.end(), allocated);
allocated[str.size()] = '0円';
return &strings.emplace(
std::string_view{allocated, str.size()},
StringType::allocated
).first->first;
}
/// Deallocate memory
~StringPool()
{
for (auto &[str, type] : strings)
{
if (type == StringType::allocated)
{
deallocateViewContent(str);
}
}
}
private:
/// Deallocate string that is viewed
void deallocateViewContent(std::string_view view)
{
allocator.deallocate(
const_cast<char *>(view.data()),
view.size() + 1
);
}
/// Type of string
enum class StringType
{
literal, // Doesn't need deallocation
allocated // Needs deallocation
};
/// Interned strings
std::unordered_map<
std::string_view,
StringType
> strings;
};
} // namespace ppl::common
Google tests:
#include <gtest/gtest.h>
#include "ppl/common/StringPool.hpp"
template<typename Allocator>
struct TrackAllocator
{
Allocator allocator;
char * allocate(size_t size)
{
++allocations;
return allocator.allocate(size);
}
void deallocate(char *str, size_t size)
{
++deallocations;
allocator.deallocate(str, size);
}
size_t allocations = 0;
size_t deallocations = 0;
};
/// Doesn't allocate memory
struct NullAllocator
{
char * allocate(size_t) { return nullptr; }
void deallocate([[maybe_unused]]char *ptr, size_t)
{
assert(!ptr);
}
};
using namespace ppl::common;
TEST(StringPool, internLiteralFirst)
{
auto pool =
StringPool<TrackAllocator<NullAllocator>> {
TrackAllocator<NullAllocator>{}
};
const char *literal = "Hello";
auto internedLiteral = pool.internLiteral(literal);
ASSERT_EQ(pool.allocator.allocations, 0);
ASSERT_EQ(pool.allocator.deallocations, 0);
ASSERT_EQ(internedLiteral->data(), literal);
auto internedLiteral1 = pool.internLiteral(literal);
ASSERT_EQ(pool.allocator.allocations, 0);
ASSERT_EQ(pool.allocator.deallocations, 0);
ASSERT_EQ(internedLiteral1, internedLiteral);
std::string str = literal;
auto internedStr = pool.intern(str);
ASSERT_EQ(pool.allocator.allocations, 0);
ASSERT_EQ(pool.allocator.deallocations, 0);
ASSERT_EQ(internedStr, internedLiteral);
}
TEST(StringPool, internFirst)
{
TrackAllocator<std::allocator<char>> trackAllocator;
{
StringPool<TrackAllocator<std::allocator<char>> &> pool(trackAllocator);
const char *literal = "Hello";
std::string str = literal;
auto internedStr = pool.intern(str);
ASSERT_EQ(pool.allocator.allocations, 1);
ASSERT_EQ(pool.allocator.deallocations, 0);
ASSERT_EQ(*internedStr, literal);
auto internedLiteral = pool.internLiteral(literal);
ASSERT_EQ(pool.allocator.allocations, 1);
// Note that internedStr was deallocated
ASSERT_EQ(pool.allocator.deallocations, 1);
ASSERT_EQ(internedLiteral, internedStr);
// Note that literal's memory is used instead
ASSERT_EQ(internedLiteral->data(), literal);
}
ASSERT_EQ(trackAllocator.deallocations, 1);
}
TEST(StringPool, allocationFailure)
{
TrackAllocator<NullAllocator> trackAllocator;
{
auto pool =
StringPool<TrackAllocator<NullAllocator> &> {
trackAllocator
};
std::string str = "hello";
auto internedStr = pool.intern(str);
ASSERT_FALSE(internedStr);
ASSERT_EQ(pool.allocator.allocations, 1);
ASSERT_EQ(pool.allocator.deallocations, 0);
}
ASSERT_EQ(trackAllocator.deallocations, 0);
}
P.S: don't know how to handle if InternedString
outlives the pool. I guess, it's not an issue of string pool, but an incorrect usage.
-
2\$\begingroup\$ Incorporating advice from an answer into the question violates the question-and-answer nature of this site. You could post improved code as a new question, as an answer, or as a link to an external site - as described in I improved my code based on the reviews. What next?. I have rolled back the edit, so the answers make sense again. \$\endgroup\$Toby Speight– Toby Speight2022年09月21日 05:51:06 +00:00Commented Sep 21, 2022 at 5:51
4 Answers 4
Make it more generic
You have already written somewhat generic code, as StringPool
is a template. However, you can go further than that. First, consider that std::string
is actually a std::basic_string<char>
. You could add the same template parameters as std::basic_string
has:
template<typename CharT = char,
typename Traits = std::char_traits<CharT>,
typename Allocator = std::allocator<CharT>>
class StringPool
{
const Allocator& allocator;
...
};
Note that the allocator is stored by reference, just like std::basic_string<>
does. This allows multiple pools to use the same allocator object.
Replacing an allocated string with a literal is dangerous
There are potential issues with replacing an allocated string with a literal. While InternedString
is a stable pointer to a std::string_view
stored in the pool, consider that some code might at some point copy a dereferenced InternedString
somewhere. That is of course something it probably shouldn't do, but still.
A more realistic problem is if you have multi-threaded code. In that case, one thread might have dereferenced an allocated string, while another thread is replacing it with a literal.
You could place restrictions on the users of your code, and forbid multi-threaded access, but it's hard to enforce that programmatically. Make sure that these restrictions are documented, and possibly add run-time assertions to the code to catch misuse.
Also consider that pre-C++11, std::string
was copy-on-write, which caused lots of issues, and ultimately led to the standards committee to forbid that.
Don't handle literals at all
I don't think there is much point in letting your pool handle the difference between literals and non-literals. Most compilers will already detect duplicate string literals and will merge them.
If you don't need to deal with literals, all strings will be allocated, and then you can avoid the whole manual allocation dance by storing strings in a std::unordered_set<std::string>
.
Is it worth it?
Consider whether it is worth having a pool at all. How many bytes are saved by deduplicating strings? Conversely, how much memory is now used by all the std::string_view
objects (one in the pool itself, and one copy the caller of intern()
has to keep), and how much is the overhead of the std::unordered_map
?
Also consider that if you are mostly interested in deduplicating and comparing small strings: since C++11, on most platforms std::string
comes with small string optimization (SSO). That means if the string is smaller than almost sizeof(std::string)
, it doesn't perform an allocation, and comparing two such std::string
s is almost as fast as comparing two std::string_view
s. In fact, it is likely to be faster to compare two small std::string
s if they are unequal!
-
1\$\begingroup\$ I check for
it->second == StringType::allocated
to see, if this string was already interned withintern
. There is a testinternFirst
that shows that specific case \$\endgroup\$gavrilikhin.d– gavrilikhin.d2022年09月20日 13:56:50 +00:00Commented Sep 20, 2022 at 13:56 -
\$\begingroup\$
Note that the allocator is stored by reference, just like std::basic_string<> does. This allows multiple pools to use the same allocator object.
... and disallows allocator to have state, I suppose \$\endgroup\$gavrilikhin.d– gavrilikhin.d2022年09月20日 13:58:54 +00:00Commented Sep 20, 2022 at 13:58 -
\$\begingroup\$
You cannot safely deallocate an allocated string and replace it with a literal, as someone might still have an InternedString that points to the old allocated string.
Interned strings are pointers to keys in map. I update key with const_cast (it only changes address, not the data, so it's somewhat safe), hence it must work \$\endgroup\$gavrilikhin.d– gavrilikhin.d2022年09月20日 14:00:01 +00:00Commented Sep 20, 2022 at 14:00 -
1\$\begingroup\$
const
allocators can still havemutable
state. \$\endgroup\$G. Sliepen– G. Sliepen2022年09月20日 14:01:43 +00:00Commented Sep 20, 2022 at 14:01 -
\$\begingroup\$ Fair enough XD Thanks \$\endgroup\$gavrilikhin.d– gavrilikhin.d2022年09月20日 14:02:26 +00:00Commented Sep 20, 2022 at 14:02
Reconsider your design:
- Why do you limit yourself to strings, and only those fitting the specialization string_view?
With a slight adjustment you could use it for any kind of read-only data. Mostly, you would need to make InternedString a full class, best templated like basic_string_view, provide the same for span, and use reference_wrapper for arbitrary literal objects. - Deallocating and replacing interned strings is suicidal.
Thus, don't try. - You only remove an interned string when the pool is destroyed.
Thus, use a custom linear allocator for string-data and map-nodes to avoid overhead, potentially one each. - Do you really want/need your strings to be 0-terminated?
If not, that saves space, and would synergize with the next point. - Depending on your space/time tradeoff, it might make sense to get a list of read-only regions and search it as well as accumulated string-data for a match before adding the string.
- Consider chaining your string-pools, if that makes sense for your application.
Try different combinations (including foregoing it completely) and measure with a real workload.
-
\$\begingroup\$ What do you mean by read-only regions? \$\endgroup\$gavrilikhin.d– gavrilikhin.d2022年09月20日 23:10:04 +00:00Commented Sep 20, 2022 at 23:10
-
\$\begingroup\$ And I just don't need to have pool of anything but utf-8 encoded strings, so making it more generic won't give me anything \$\endgroup\$gavrilikhin.d– gavrilikhin.d2022年09月21日 02:29:52 +00:00Commented Sep 21, 2022 at 2:29
-
\$\begingroup\$ Yes, YAGNI applies. And your program and libraries are most likely mapped ro, and the mappung can be retrieved on most modern OSs, though that is dreadfully implementation specific.. \$\endgroup\$Deduplicator– Deduplicator2022年09月21日 05:01:11 +00:00Commented Sep 21, 2022 at 5:01
-
\$\begingroup\$ @gavrilikhin.d, if you're working with UTF-8, the preferred string type would be
std::u8string
(and its corresponding view). \$\endgroup\$Toby Speight– Toby Speight2022年09月21日 06:27:14 +00:00Commented Sep 21, 2022 at 6:27
Errors, warnings and style
std::uninitialized_copy()
requires <memory>
to be included.
std::size_t
is misspelt throughout.
Some g++ -Weffc++
warnings were easily addressed by adding = {}
initialisers to member variables. This is mostly a limitation of the warning, that doesn't understand default-initialization of aggregate types - but being explicit about the initialisation is always good for readers, so worth fixing.
#pragma once
isn't standard; prefer a portable include-guard.
I dislike the trailing whitespace that's present on many lines. That can be dangerous if you ever continue lines using \
. I configure my editor to delete extraneous whitespace when saving C++ source.
Lifetime and ownership
A pool is generally used as a long-lived resource, so it's not completely unreasonable to specify that references to its contents are valid only during the pool's lifetime.
However, this is exactly the kind of scenario for which std::shared_ptr
was created. That's the way to express ownership when you don't have a clear foreknowledge of which owner will live longest.
And the shared pointer carries around with it the information needed to delete its value, so you can support a mixed pool of literals and allocated strings. That said, I'd advise against that, but instead support layering of pools, so that you can have a constant pool of literals that uses a modifiable pool of allocated strings when an element isn't found.
Using shared pointers even allows use of a weak pointer inside the pool (once there are no external users, the exact pointer value is no longer necessary). This approach should be used with care, as it could lead to frequent re-creation of values (e.g. in your use case of a compiler, we might end up with creation and deletion of "tmp"
in each function that contains that identifier).
String literal handling
The complexity of managing string literal lifetimes (including the additional cognitive load imposed on the users is unlikely to be worth the gain). Just intern those like any other string - the expense of constructing a string object will be incurred only once per string, after all. You might be able to shift the burden to compile-time with suitable constexpr
constructor.
Performance
You should consider that std::unordered_map
might not be the best container for strings, depending on the contents. The hash function can cause lookups to be slower than simple <
comparison as used by std::map
. This is one circumstance where it is useful to pass a template as template parameter, to give the user a choice of implementation strategy.
Generality
This could easily be extended to support other string types (wide, UTF-8, etc). But why stop there? If we free ourselves from std::basic_string_view<>
, this could become an intern table for any kind of object (at least, any regular object that can be hashed and/or sorted). That would increase its value, and decouple it from serving just one particular application.
-
\$\begingroup\$
#pragma once
is de-facto standard as it works withgcc
,clang
and Microsoft's compiler. It's much convenient and safer to use that header guards. So, while I understand your concerns, I'll stick to pragma \$\endgroup\$gavrilikhin.d– gavrilikhin.d2022年09月21日 10:09:28 +00:00Commented Sep 21, 2022 at 10:09 -
2\$\begingroup\$ Yes, it's common, and may one day become a standard. Which advice you heed and which you disregard is entirely your own concern, but I thought I ought to point that out so you have made an informed decision. \$\endgroup\$Toby Speight– Toby Speight2022年09月21日 11:54:14 +00:00Commented Sep 21, 2022 at 11:54
Applied changes
- Completely removed special handling of literals.
- Use
unordered_set
as an underlying container
Handling of literals overcomplicated my pool without much gain.
A lot of suggested changes were connected with this problem.
What I considered
Make it more generic.
This StringPool is generic enough for me, at least for now. I will generalise it once there is a need.
Use header guards instead of #pragma once
#pragma once
is supported by all major compilers and is much safer to use than header guards
Code
#pragma once
/**
* @file StringPool.hpp
* @brief Manage pool of strings.
* Pros:
* - No allocation for repeated strings
* - Strings have stable addresses -- fast to check for equality
*/
#include <string_view>
#include <unordered_set>
#include <memory> // uninitialized_copy
namespace ppl::common
{
/// Pointer to view of string, interned on the pool
using InternedString = const std::string_view *;
namespace concepts
{
/// Concept for char allocator
template<typename T>
concept CharAllocator = requires(T t, char *str, size_t size)
{
{ t.allocate(size) } -> std::same_as<char *>;
{ t.deallocate(str, size) };
};
}
/// Pool of strings.
/// @warning isn't thread safe
template<concepts::CharAllocator Allocator = std::allocator<char>>
class StringPool
{
public:
/// Allocator, used by pool
Allocator allocator;
/// Perfect forward arguments to allocator
template<typename ...Args>
requires std::is_constructible_v<Allocator, Args...>
StringPool(Args &&... args) noexcept
: allocator(std::forward<Args>(args)...) {}
/// Forbid copy
StringPool(const StringPool &) = delete;
/// Forbid copy
StringPool &operator=(const StringPool &) = delete;
/// Allow move
StringPool(StringPool &&) = default;
/// Allow move
StringPool &operator=(StringPool &&) = default;
/// Intern a string on the pool.
/// @warning isn't thread safe
InternedString intern(std::string_view str)
{
if (auto it = strings.find(str); it != strings.end())
{
return &*it;
}
auto *allocated = allocator.allocate(str.size() + 1);
if (!allocated) { return nullptr; }
std::uninitialized_copy(str.begin(), str.end(), allocated);
allocated[str.size()] = '0円';
return &*strings.emplace(allocated, str.size()).first;
}
/// Deallocate memory
~StringPool()
{
for (auto &str : strings)
{
deallocateViewContent(str);
}
}
private:
/// Deallocate string that is viewed
void deallocateViewContent(std::string_view view)
{
allocator.deallocate(
const_cast<char *>(view.data()),
view.size() + 1
);
}
/// Interned strings
std::unordered_set<std::string_view> strings;
};
} // namespace ppl::common
```
-
1\$\begingroup\$ Just .insert() and handle success by fixing up the new element. Also use allocator_traits instead of naked allocator. Btw: If you used the right allocator, special-casing true constants would be simply omitting the copy, as no manual deletes would happen at all. \$\endgroup\$Deduplicator– Deduplicator2022年09月21日 15:17:19 +00:00Commented Sep 21, 2022 at 15:17
Explore related questions
See similar questions with these tags.