i'm currently trying to rewrite sql server newSequentialId() function in pure c# that is performing as fast as possible and eats minimum memory.
Question is can I it be done better - current benchamark shows it is at most 2x slower than native function used by SqlServer (p/invoke)?
using System;
using System.Buffers;
using System.Linq;
using System.Net.NetworkInformation;
using System.Threading;
namespace PerfBenchmarkDotNet
{
internal static class FastGuid
{
private static readonly byte[] macBytes;
private static long baseDateTicks;
private static long ClockSequenceNumber = 0;
static FastGuid()
{
macBytes = GetDefaultMacAdress();
baseDateTicks = new DateTime(1582, 10, 15, 0, 0, 0, DateTimeKind.Utc).Ticks;
}
public static Guid NewGuid()
{
var guidArray = ArrayPool<byte>.Shared.Rent(16);
var nowTicks = DateTime.UtcNow.Ticks;
var sequenceNumber = Interlocked.Increment(ref ClockSequenceNumber) % 65536;
var ticksDiff = nowTicks - baseDateTicks;
var ticksDiffBytes = BitConverter.GetBytes(ticksDiff);
// Copy the bytes into the guid
// mac
guidArray[15] = macBytes[5];
guidArray[14] = macBytes[4];
guidArray[13] = macBytes[3];
guidArray[12] = macBytes[2];
guidArray[11] = macBytes[1];
guidArray[10] = macBytes[0];
// sequence number
guidArray[8] = (byte)(sequenceNumber >> 8);
guidArray[9] = (byte)(sequenceNumber);
// time
guidArray[7] = ticksDiffBytes[0];
guidArray[6] = ticksDiffBytes[1];
guidArray[5] = ticksDiffBytes[2];
guidArray[4] = ticksDiffBytes[3];
guidArray[3] = ticksDiffBytes[4];
guidArray[2] = ticksDiffBytes[5];
guidArray[1] = ticksDiffBytes[6];
guidArray[0] = ticksDiffBytes[7];
var guid = new Guid(guidArray);
ArrayPool<byte>.Shared.Return(guidArray);
return guid;
}
private static byte[] GetDefaultMacAdress()
{
var nic = NetworkInterface.GetAllNetworkInterfaces().FirstOrDefault();
if (nic != null)
{
return nic.GetPhysicalAddress().GetAddressBytes();
}
var fallback = Guid.NewGuid().ToByteArray();
return fallback.Take(6).ToArray();
}
}
}
-
1\$\begingroup\$ Did you try to benchmark against this library? \$\endgroup\$Gert Arnold– Gert Arnold2020年01月01日 21:46:26 +00:00Commented Jan 1, 2020 at 21:46
-
\$\begingroup\$ Yup, newid is very fast and it looks like a have to do better ;) \$\endgroup\$DevOnBike– DevOnBike2020年01月02日 09:34:19 +00:00Commented Jan 2, 2020 at 9:34
2 Answers 2
You want to reduce both computation time and memory allocation. Memory allocation itself takes time, so that's a good place to start -- try to find all the places where you're allocating memory and see if they can be avoided.
ArrayPool<byte>.Shared.Rent(16);
can avoid an allocation, but not always. Can we do better? Turns out we can; we'll see how.
BitConverter.GetBytes(ticksDiff)
makes an allocation. ticksDiff
is just an integer so it's quite easy to extract its binary representation using binary operators. We'll see how we can use this later on.
But first:
Profiling
Profilers are your friend! You should always start the optimization journey by asking "Where are the bottlenecks?", which you can promptly answer with cold hard data from a profiling tool.
As suspected, dotTrace reports that the ArrayPool<byte>.Shared.Rent/Return
and BitConverter.GetBytes(ticksDiff)
calls take a significant chunk (~21%) of the total running time. Let's get rid of those.
Guid constructor
You're using the Guid
constructor that takes a byte[]
, which has perpetuated an implementation that necessarily allocates a byte array somewhere. You can get clever with this to allocate on the stack instead of the heap (Span<byte>
with stackalloc
) or to minimize the number of allocations that actually need to happen (ThreadLocal<byte[]>
), but the best solution is to not allocate anything at all. To accomplish this, you can use one of the other Guid
constructors that takes its values directly, instead of taking a byte array.
There's a constructor overload that takes an int
, two short
s, and eight byte
s. You can pack these values directly (see the reference source for endianness on the int
and short
s):
int a = ticksDiffBytes[7] | (ticksDiffBytes[6] << 8) | (ticksDiffBytes[5] << 16) | (ticksDiffBytes[4] << 24);
short b = (short)(ticksDiffBytes[3] | (ticksDiffBytes[2] << 8));
short c = (short)(ticksDiffBytes[1] | (ticksDiffBytes[0] << 8));
var guid = new Guid(
a,
b,
c,
(byte)(sequenceNumber >> 8),
(byte)(sequenceNumber),
macBytes[0],
macBytes[1],
macBytes[2],
macBytes[3],
macBytes[4],
macBytes[5]);
Binary stuff
You can extract bits directly from ticksDiff
to avoid the call to BitConverter.GetBytes
:
var ticksDiff = nowTicks - baseDateTicks;
int a = (int)(
((ticksDiff >> 56) & 0xFF) |
((ticksDiff & 0x00FF000000000000) >> 40) |
((ticksDiff & 0x0000FF0000000000) >> 24) |
((ticksDiff & 0x000000FF00000000) >> 8));
short b = (short)(
((ticksDiff & 0x00000000FF000000) >> 24) |
((ticksDiff & 0x0000000000FF0000) >> 8));
short c = (short)(
((ticksDiff & 0x000000000000FF00) >> 8) |
((ticksDiff & 0x00000000000000FF) << 8));
If you can tolerate a change in endianness, the following is a lot cleaner to read (but provides no measurable change in runtime for me):
var ticksDiff = nowTicks - baseDateTicks;
int a = (int)(ticksDiff);
short b = (short)(ticksDiff >> 32);
short c = (short)(ticksDiff >> 48);
Remainder
You don't need to do a remainder when computing sequenceNumber
because we're extracting the remainder bits directly anyhow when we pass (byte)(sequenceNumber >> 8), (byte)(sequenceNumber)
to the Guid
constructor.
My numbers below make it hard to conclude if this helps at all. This is the kind of thing that might be optimized away by the jitter anyway.
Benchmarks
With these changes, my benchmarks look as follows:
| Method | Mean | Error | StdDev | Ratio | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------------------------------------ |---------:|--------:|--------:|------:|-------:|------:|------:|----------:|
| NewGuid_Original | 224.2 ns | 4.37 ns | 4.49 ns | 1.00 | 0.0041 | - | - | 20 B |
| NewGuid_WithPackedValues | 153.6 ns | 2.51 ns | 2.35 ns | 1.00 | 0.0041 | - | - | 20 B |
| NewGuid_WithPackedValuesAndNoBitConverter | 148.6 ns | 2.43 ns | 2.28 ns | 1.00 | - | - | - | - |
| NewGuid_WithReorderedBytes | 148.9 ns | 2.92 ns | 3.00 ns | 1.00 | - | - | - | - |
| NewGuid_WithReorderedBytesAndNoRemainder | 146.4 ns | 2.11 ns | 1.98 ns | 1.00 | - | - | - | - |
Final code:
public static Guid NewGuid_WithBinaryOptimizationsAndWithoutRemainder()
{
var nowTicks = DateTime.UtcNow.Ticks;
var sequenceNumber = Interlocked.Increment(ref ClockSequenceNumber);
var ticksDiff = nowTicks - baseDateTicks;
int a = (int)(ticksDiff);
short b = (short)(ticksDiff >> 32);
short c = (short)(ticksDiff >> 48);
var guid = new Guid(
a,
b,
c,
(byte)(sequenceNumber >> 8),
(byte)(sequenceNumber),
macBytes[0],
macBytes[1],
macBytes[2],
macBytes[3],
macBytes[4],
macBytes[5]);
return guid;
}
-
\$\begingroup\$ how did u format benchmark results? \$\endgroup\$DevOnBike– DevOnBike2020年01月02日 21:43:27 +00:00Commented Jan 2, 2020 at 21:43
-
\$\begingroup\$ I used BenchmarkDotNet which does the table formatting by default. \$\endgroup\$Jeff– Jeff2020年01月03日 14:28:37 +00:00Commented Jan 3, 2020 at 14:28
So after updating my source code it looks that it performs fast enough and allocates no memory. Here are benchmark for comparing the most popular methods of generating GUIDs.
Runtime=.NET Core 3.0 Force=True Server=True
| Method | Mean | Error | StdDev | Ratio | RatioSD | Gen 0 | Gen 1 | Gen 2 | Allocated | |------------------------------------ |----------:|---------:|---------:|------:|--------:|-------:|------:|------:|----------:| | SqlServerNewSequentialGuid | 38.69 ns | 0.787 ns | 1.643 ns | 1.00 | 0.00 | 0.0009 | - | - | 80 B | | Guid_Standard | 61.44 ns | 0.665 ns | 0.622 ns | 1.60 | 0.06 | - | - | - | - | | NHibernate_GuidCombGenerator | 160.22 ns | 3.070 ns | 2.721 ns | 4.16 | 0.20 | 0.0012 | - | - | 104 B | | Guid_Comb_New | 291.56 ns | 5.421 ns | 4.805 ns | 7.57 | 0.37 | 0.0010 | - | - | 104 B | | EFCore_SequentialGuidValueGenerator | 82.04 ns | 1.623 ns | 1.667 ns | 2.13 | 0.11 | 0.0007 | - | - | 72 B | | NewId_Lib | 78.93 ns | 1.150 ns | 1.076 ns | 2.05 | 0.09 | - | - | - | - | | NewSequentialGuid_PureNetCore | 70.28 ns | 1.332 ns | 1.481 ns | 1.81 | 0.08 | - | - | - | - |
Full source code in c# .net core 2.2+:
using System;
using System.Buffers;
using System.Linq;
using System.Net.NetworkInformation;
using System.Threading;
namespace PerfBenchmarkDotNet
{
internal readonly struct FastGuid
{
private static readonly byte[] _macBytes;
private static long _baseDateTicks;
private static long _clockSequenceNumber;
static FastGuid()
{
_clockSequenceNumber = 0;
_macBytes = GetDefaultMacAdress();
_baseDateTicks = new DateTime(1582, 10, 15, 0, 0, 0, DateTimeKind.Utc).Ticks;
}
public static Guid NewGuid()
{
var nowTicks = DateTime.UtcNow.Ticks;
var sequenceNumber = Interlocked.Increment(ref _clockSequenceNumber);
var ticksDiff = nowTicks - _baseDateTicks;
var a = (int)(ticksDiff);
short b = (short)(ticksDiff >> 32);
short c = (short)(ticksDiff >> 48);
return new Guid(
a,
b,
c,
(byte)(sequenceNumber >> 8),
(byte)(sequenceNumber),
_macBytes[0],
_macBytes[1],
_macBytes[2],
_macBytes[3],
_macBytes[4],
_macBytes[5]);
}
private static byte[] GetDefaultMacAdress()
{
var nic = NetworkInterface.GetAllNetworkInterfaces().FirstOrDefault();
if (nic != null)
{
return nic.GetPhysicalAddress().GetAddressBytes();
}
var fallback = Guid.NewGuid().ToByteArray();
return fallback.Take(6).ToArray();
}
}
}
-
\$\begingroup\$ I'm curious as to how you profiled SqlServerNewSequentialGuid (
NEWSEQUENTIALID()
) since that can only be used in a default constraint expression, AFAIK. \$\endgroup\$Dan Guzman– Dan Guzman2020年02月17日 14:00:17 +00:00Commented Feb 17, 2020 at 14:00