I've written a bytes to number converter for cpp similar to the python int.from_bytes
.
It does not make any assumptions on:
- The endianness of the input buffer
- The endianness of the system
- The endianness consistency of the the system ( = no assumption that all numeric types have the same endianness )
Additionally the source buffer can be smaller than the target.
#include <cstdint>
#include <span>
#include <stdexcept>
namespace Binary
{
// Used to indicate the endianness of a type
// Better readability than a simple bool
enum class Endianness
{
Little,
Big,
};
namespace Int {
// Base is the underlying type that will be used to store the eventual result
template< typename Base >
Base FromBytes( std::span< std::uint8_t > inBuffer, Binary::Endianness inEndianness )
{
// Check if we have enough space
if( sizeof( Base ) < inBuffer.size() )
{
throw std::length_error( "unable to fit data in requested type" );
}
// Determine endianness of the Base type on this system
union {
Base i;
char c[ sizeof( Base ) ];
} test = { .i = 1 };
bool tmp = test.c[ sizeof( Base ) - 1 ] == '1円';
Binary::Endianness type_endianness = tmp ? Binary::Endianness::Big : Binary::Endianness::Little;
// Always initialize your variables
Base result = Base{ 0 };
std::uint8_t * destination = ( std::uint8_t * ) &result;
// If we have big endian and we have a smaller buffer the first bytes should
// remain zero, meaning we should start filling further back
if( type_endianness == Binary::Endianness::Big ) { destination += sizeof( Base ) - inBuffer.size(); }
if( type_endianness == inEndianness )
{
// equal endianness => simple copy
memcpy( destination, inBuffer.data(), inBuffer.size() );
}
else
{
// differing endianness => copy in reverse
for( size_t i = 0; i < inBuffer.size(); i+=1 )
{
destination[ i ] = inBuffer[ inBuffer.size() - i - 1 ];
}
}
return result;
}
}
}
Are there any improvements I could make?
1 Answer 1
I like the enum. Except that it is different from the one C++20 defines.
We're missing some review context, like the range of use cases we're trying to address. This submission contains no unit tests or other demo code. It's unclear if caller is likely to call into this repeatedly through an array loop. Perhaps "exotic" 24-bit quantities motivated this code.
use constexpr
Every call to FromBytes
asks about native endianess.
It's a nice piece of code, very clear, but we should
cache the result as it won't change.
Or better, find the result at compile time.
Even better, dispense with it and rely on
std::endian::native
.
Or write down the underlying requirements
so we understand what systems are being
targeted and what language / compiler variants
are within scope.
nit: Assigning tmp = test.c[0] == '1円';
seems a little more convenient. But perhaps
you wanted the order for the ternary
to be "big, little" for aesthetic reasons.
Consider renaming tmp
to is_big
.
Delete uninformative comments like "Always initialize your variables".
I am glad to see that .size()
is the first thing we check. Good.
nit: I find the second form slightly easier to read.
... inBuffer[ inBuffer.size() - i - 1 ]
... inBuffer[ inBuffer.size() - 1 - i ]
Why? I read it as "constant minus i
"
where I can reason about the constant location
independent of the loop.
Rather than having to puzzle out the numeric
relationship between size and i
, observe
it is always positive, and then decrement
to get to the zero-th element.
document language version
std::reverse_copy has been available since C++17, but we choose not to use it, unclear why.
Using a standard routine might yield performance identical to a naïve loop. But maybe a vendor went to some trouble to vectorize it for your target processor, and it will go quicker. Use it for the same reason you prefer memcpy().
Tell us which language variants are acceptable for this project, write it down so future maintainers will know the rules.
I was slightly surprised to not see ntohl
mentioned,
nor be64toh
, but maybe {3, 5, 6, 7}-byte inBuffers
are just as important for the use case we're addressing.
Those standard functions should absolutely appear in
your automated test suite.
document the sign bit
The brief review context explicitly says that we implement a
well-known interface.
We do support a corresponding endianness parameter.
But interpretation of the source sign bit is
left implicit. Presumably it matches the Base
type.
This should be explicitly written down.
(Also, "Base" seems a slightly odd name as it suggests "Radix" which isn't relevant here. Maybe "NumericType"?)
API design
Imagine the caller is assigning a great many 16-bit integers via an array loop.
I worry about whether we're giving the compiler a good chance at vectorizing such a bulk assignment. Consider offering a second method that moves K integers of identical size.
Imagine the input numbers are already in native order and they have a standard {16, 32, 64}-bit size.
Would it help your use case if this routine was given the flexibility to report no-op by returning address of input buffer, avoiding the memcpy() ?
This code achieves its design goals.
Before merging it to main
it should be better documented
and should be accompanied by unit tests.
I would be happy to delegate or accept maintenance tasks on this codebase.