NewSequentialId - pure net core 2.2 implementation of sql server function

Question 1

i'm currently trying to rewrite sql server newSequentialId() function in pure c# that is performing as fast as possible and eats minimum memory.

Question is can I it be done better - current benchamark shows it is at most 2x slower than native function used by SqlServer (p/invoke)?

using System;
using System.Buffers;
using System.Linq;
using System.Net.NetworkInformation;
using System.Threading;
namespace PerfBenchmarkDotNet
{
 internal static class FastGuid
 {
 private static readonly byte[] macBytes;
 private static long baseDateTicks;
 private static long ClockSequenceNumber = 0;
 static FastGuid()
 {
 macBytes = GetDefaultMacAdress();
 baseDateTicks = new DateTime(1582, 10, 15, 0, 0, 0, DateTimeKind.Utc).Ticks;
 }
 public static Guid NewGuid()
 {
 var guidArray = ArrayPool<byte>.Shared.Rent(16);
 var nowTicks = DateTime.UtcNow.Ticks;
 var sequenceNumber = Interlocked.Increment(ref ClockSequenceNumber) % 65536;
 var ticksDiff = nowTicks - baseDateTicks;
 var ticksDiffBytes = BitConverter.GetBytes(ticksDiff);
 // Copy the bytes into the guid 
 // mac
 guidArray[15] = macBytes[5];
 guidArray[14] = macBytes[4];
 guidArray[13] = macBytes[3];
 guidArray[12] = macBytes[2];
 guidArray[11] = macBytes[1];
 guidArray[10] = macBytes[0];
 // sequence number
 guidArray[8] = (byte)(sequenceNumber >> 8);
 guidArray[9] = (byte)(sequenceNumber);
 // time
 guidArray[7] = ticksDiffBytes[0];
 guidArray[6] = ticksDiffBytes[1];
 guidArray[5] = ticksDiffBytes[2];
 guidArray[4] = ticksDiffBytes[3];
 guidArray[3] = ticksDiffBytes[4];
 guidArray[2] = ticksDiffBytes[5];
 guidArray[1] = ticksDiffBytes[6];
 guidArray[0] = ticksDiffBytes[7];
 var guid = new Guid(guidArray);
 ArrayPool<byte>.Shared.Return(guidArray);
 return guid;
 }
 private static byte[] GetDefaultMacAdress()
 {
 var nic = NetworkInterface.GetAllNetworkInterfaces().FirstOrDefault();
 if (nic != null)
 {
 return nic.GetPhysicalAddress().GetAddressBytes();
 }
 var fallback = Guid.NewGuid().ToByteArray();
 return fallback.Take(6).ToArray();
 }
 }
}

Question 2

Did you try to benchmark against this library?

Question 3

Yup, newid is very fast and it looks like a have to do better ;)

Question 4

You want to reduce both computation time and memory allocation. Memory allocation itself takes time, so that's a good place to start -- try to find all the places where you're allocating memory and see if they can be avoided.

ArrayPool<byte>.Shared.Rent(16); can avoid an allocation, but not always. Can we do better? Turns out we can; we'll see how.

BitConverter.GetBytes(ticksDiff) makes an allocation. ticksDiff is just an integer so it's quite easy to extract its binary representation using binary operators. We'll see how we can use this later on.

But first:

Profiling

Profilers are your friend! You should always start the optimization journey by asking "Where are the bottlenecks?", which you can promptly answer with cold hard data from a profiling tool.

As suspected, dotTrace reports that the ArrayPool<byte>.Shared.Rent/Return and BitConverter.GetBytes(ticksDiff) calls take a significant chunk (~21%) of the total running time. Let's get rid of those.

Guid constructor

You're using the Guid constructor that takes a byte[], which has perpetuated an implementation that necessarily allocates a byte array somewhere. You can get clever with this to allocate on the stack instead of the heap (Span<byte> with stackalloc) or to minimize the number of allocations that actually need to happen (ThreadLocal<byte[]>), but the best solution is to not allocate anything at all. To accomplish this, you can use one of the other Guid constructors that takes its values directly, instead of taking a byte array.

There's a constructor overload that takes an int, two shorts, and eight bytes. You can pack these values directly (see the reference source for endianness on the int and shorts):

int a = ticksDiffBytes[7] | (ticksDiffBytes[6] << 8) | (ticksDiffBytes[5] << 16) | (ticksDiffBytes[4] << 24);
short b = (short)(ticksDiffBytes[3] | (ticksDiffBytes[2] << 8));
short c = (short)(ticksDiffBytes[1] | (ticksDiffBytes[0] << 8));
var guid = new Guid(
 a,
 b,
 c,
 (byte)(sequenceNumber >> 8),
 (byte)(sequenceNumber),
 macBytes[0],
 macBytes[1],
 macBytes[2],
 macBytes[3],
 macBytes[4],
 macBytes[5]);

Binary stuff

You can extract bits directly from ticksDiff to avoid the call to BitConverter.GetBytes:

var ticksDiff = nowTicks - baseDateTicks;
int a = (int)(
 ((ticksDiff >> 56) & 0xFF) |
 ((ticksDiff & 0x00FF000000000000) >> 40) |
 ((ticksDiff & 0x0000FF0000000000) >> 24) |
 ((ticksDiff & 0x000000FF00000000) >> 8));
short b = (short)(
 ((ticksDiff & 0x00000000FF000000) >> 24) |
 ((ticksDiff & 0x0000000000FF0000) >> 8));
short c = (short)(
 ((ticksDiff & 0x000000000000FF00) >> 8) |
 ((ticksDiff & 0x00000000000000FF) << 8));

If you can tolerate a change in endianness, the following is a lot cleaner to read (but provides no measurable change in runtime for me):

var ticksDiff = nowTicks - baseDateTicks;
int a = (int)(ticksDiff);
short b = (short)(ticksDiff >> 32);
short c = (short)(ticksDiff >> 48);

Remainder

You don't need to do a remainder when computing sequenceNumber because we're extracting the remainder bits directly anyhow when we pass (byte)(sequenceNumber >> 8), (byte)(sequenceNumber) to the Guid constructor.

My numbers below make it hard to conclude if this helps at all. This is the kind of thing that might be optimized away by the jitter anyway.

Benchmarks

With these changes, my benchmarks look as follows:

| Method | Mean | Error | StdDev | Ratio | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------------------------------------ |---------:|--------:|--------:|------:|-------:|------:|------:|----------:|
| NewGuid_Original | 224.2 ns | 4.37 ns | 4.49 ns | 1.00 | 0.0041 | - | - | 20 B |
| NewGuid_WithPackedValues | 153.6 ns | 2.51 ns | 2.35 ns | 1.00 | 0.0041 | - | - | 20 B |
| NewGuid_WithPackedValuesAndNoBitConverter | 148.6 ns | 2.43 ns | 2.28 ns | 1.00 | - | - | - | - |
| NewGuid_WithReorderedBytes | 148.9 ns | 2.92 ns | 3.00 ns | 1.00 | - | - | - | - |
| NewGuid_WithReorderedBytesAndNoRemainder | 146.4 ns | 2.11 ns | 1.98 ns | 1.00 | - | - | - | - |

Final code:

public static Guid NewGuid_WithBinaryOptimizationsAndWithoutRemainder()
{
 var nowTicks = DateTime.UtcNow.Ticks;
 var sequenceNumber = Interlocked.Increment(ref ClockSequenceNumber);
 var ticksDiff = nowTicks - baseDateTicks;
 int a = (int)(ticksDiff);
 short b = (short)(ticksDiff >> 32);
 short c = (short)(ticksDiff >> 48);
 var guid = new Guid(
 a,
 b,
 c,
 (byte)(sequenceNumber >> 8),
 (byte)(sequenceNumber),
 macBytes[0],
 macBytes[1],
 macBytes[2],
 macBytes[3],
 macBytes[4],
 macBytes[5]);
 return guid;
}

Question 5

how did u format benchmark results?

Question 6

I used BenchmarkDotNet which does the table formatting by default.

Question 7

So after updating my source code it looks that it performs fast enough and allocates no memory. Here are benchmark for comparing the most popular methods of generating GUIDs.

Runtime=.NET Core 3.0 Force=True Server=True

| Method | Mean | Error | StdDev | Ratio | RatioSD | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------------------------------ |----------:|---------:|---------:|------:|--------:|-------:|------:|------:|----------:|
| SqlServerNewSequentialGuid | 38.69 ns | 0.787 ns | 1.643 ns | 1.00 | 0.00 | 0.0009 | - | - | 80 B |
| Guid_Standard | 61.44 ns | 0.665 ns | 0.622 ns | 1.60 | 0.06 | - | - | - | - |
| NHibernate_GuidCombGenerator | 160.22 ns | 3.070 ns | 2.721 ns | 4.16 | 0.20 | 0.0012 | - | - | 104 B |
| Guid_Comb_New | 291.56 ns | 5.421 ns | 4.805 ns | 7.57 | 0.37 | 0.0010 | - | - | 104 B |
| EFCore_SequentialGuidValueGenerator | 82.04 ns | 1.623 ns | 1.667 ns | 2.13 | 0.11 | 0.0007 | - | - | 72 B |
| NewId_Lib | 78.93 ns | 1.150 ns | 1.076 ns | 2.05 | 0.09 | - | - | - | - |
| NewSequentialGuid_PureNetCore | 70.28 ns | 1.332 ns | 1.481 ns | 1.81 | 0.08 | - | - | - | - |

Full source code in c# .net core 2.2+:

using System;
using System.Buffers;
using System.Linq;
using System.Net.NetworkInformation;
using System.Threading;
namespace PerfBenchmarkDotNet
{
 internal readonly struct FastGuid
 {
 private static readonly byte[] _macBytes;
 private static long _baseDateTicks;
 private static long _clockSequenceNumber;
 static FastGuid()
 {
 _clockSequenceNumber = 0;
 _macBytes = GetDefaultMacAdress();
 _baseDateTicks = new DateTime(1582, 10, 15, 0, 0, 0, DateTimeKind.Utc).Ticks;
 }
 public static Guid NewGuid()
 {
 var nowTicks = DateTime.UtcNow.Ticks;
 var sequenceNumber = Interlocked.Increment(ref _clockSequenceNumber);
 var ticksDiff = nowTicks - _baseDateTicks;
 var a = (int)(ticksDiff);
 short b = (short)(ticksDiff >> 32);
 short c = (short)(ticksDiff >> 48);
 return new Guid(
 a,
 b,
 c,
 (byte)(sequenceNumber >> 8),
 (byte)(sequenceNumber),
 _macBytes[0],
 _macBytes[1],
 _macBytes[2],
 _macBytes[3],
 _macBytes[4],
 _macBytes[5]);
 }
 private static byte[] GetDefaultMacAdress()
 {
 var nic = NetworkInterface.GetAllNetworkInterfaces().FirstOrDefault();
 if (nic != null)
 {
 return nic.GetPhysicalAddress().GetAddressBytes();
 }
 var fallback = Guid.NewGuid().ToByteArray();
 return fallback.Take(6).ToArray();
 }
 }
}

Question 8

I'm curious as to how you profiled SqlServerNewSequentialGuid (NEWSEQUENTIALID()) since that can only be used in a default constraint expression, AFAIK.

Jeff Jeff 8604 silver badges7 bronze badges · Answer 1 · 2020-01-02 17:36:46Z

You want to reduce both computation time and memory allocation. Memory allocation itself takes time, so that's a good place to start -- try to find all the places where you're allocating memory and see if they can be avoided.

ArrayPool<byte>.Shared.Rent(16); can avoid an allocation, but not always. Can we do better? Turns out we can; we'll see how.

BitConverter.GetBytes(ticksDiff) makes an allocation. ticksDiff is just an integer so it's quite easy to extract its binary representation using binary operators. We'll see how we can use this later on.

But first:

Profiling

Profilers are your friend! You should always start the optimization journey by asking "Where are the bottlenecks?", which you can promptly answer with cold hard data from a profiling tool.

As suspected, dotTrace reports that the ArrayPool<byte>.Shared.Rent/Return and BitConverter.GetBytes(ticksDiff) calls take a significant chunk (~21%) of the total running time. Let's get rid of those.

Guid constructor

You're using the Guid constructor that takes a byte[], which has perpetuated an implementation that necessarily allocates a byte array somewhere. You can get clever with this to allocate on the stack instead of the heap (Span<byte> with stackalloc) or to minimize the number of allocations that actually need to happen (ThreadLocal<byte[]>), but the best solution is to not allocate anything at all. To accomplish this, you can use one of the other Guid constructors that takes its values directly, instead of taking a byte array.

There's a constructor overload that takes an int, two shorts, and eight bytes. You can pack these values directly (see the reference source for endianness on the int and shorts):

int a = ticksDiffBytes[7] | (ticksDiffBytes[6] << 8) | (ticksDiffBytes[5] << 16) | (ticksDiffBytes[4] << 24);
short b = (short)(ticksDiffBytes[3] | (ticksDiffBytes[2] << 8));
short c = (short)(ticksDiffBytes[1] | (ticksDiffBytes[0] << 8));
var guid = new Guid(
 a,
 b,
 c,
 (byte)(sequenceNumber >> 8),
 (byte)(sequenceNumber),
 macBytes[0],
 macBytes[1],
 macBytes[2],
 macBytes[3],
 macBytes[4],
 macBytes[5]);

Binary stuff

You can extract bits directly from ticksDiff to avoid the call to BitConverter.GetBytes:

var ticksDiff = nowTicks - baseDateTicks;
int a = (int)(
 ((ticksDiff >> 56) & 0xFF) |
 ((ticksDiff & 0x00FF000000000000) >> 40) |
 ((ticksDiff & 0x0000FF0000000000) >> 24) |
 ((ticksDiff & 0x000000FF00000000) >> 8));
short b = (short)(
 ((ticksDiff & 0x00000000FF000000) >> 24) |
 ((ticksDiff & 0x0000000000FF0000) >> 8));
short c = (short)(
 ((ticksDiff & 0x000000000000FF00) >> 8) |
 ((ticksDiff & 0x00000000000000FF) << 8));

If you can tolerate a change in endianness, the following is a lot cleaner to read (but provides no measurable change in runtime for me):

var ticksDiff = nowTicks - baseDateTicks;
int a = (int)(ticksDiff);
short b = (short)(ticksDiff >> 32);
short c = (short)(ticksDiff >> 48);

Remainder

You don't need to do a remainder when computing sequenceNumber because we're extracting the remainder bits directly anyhow when we pass (byte)(sequenceNumber >> 8), (byte)(sequenceNumber) to the Guid constructor.

My numbers below make it hard to conclude if this helps at all. This is the kind of thing that might be optimized away by the jitter anyway.

Benchmarks

With these changes, my benchmarks look as follows:

| Method | Mean | Error | StdDev | Ratio | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------------------------------------ |---------:|--------:|--------:|------:|-------:|------:|------:|----------:|
| NewGuid_Original | 224.2 ns | 4.37 ns | 4.49 ns | 1.00 | 0.0041 | - | - | 20 B |
| NewGuid_WithPackedValues | 153.6 ns | 2.51 ns | 2.35 ns | 1.00 | 0.0041 | - | - | 20 B |
| NewGuid_WithPackedValuesAndNoBitConverter | 148.6 ns | 2.43 ns | 2.28 ns | 1.00 | - | - | - | - |
| NewGuid_WithReorderedBytes | 148.9 ns | 2.92 ns | 3.00 ns | 1.00 | - | - | - | - |
| NewGuid_WithReorderedBytesAndNoRemainder | 146.4 ns | 2.11 ns | 1.98 ns | 1.00 | - | - | - | - |

Final code:

public static Guid NewGuid_WithBinaryOptimizationsAndWithoutRemainder()
{
 var nowTicks = DateTime.UtcNow.Ticks;
 var sequenceNumber = Interlocked.Increment(ref ClockSequenceNumber);
 var ticksDiff = nowTicks - baseDateTicks;
 int a = (int)(ticksDiff);
 short b = (short)(ticksDiff >> 32);
 short c = (short)(ticksDiff >> 48);
 var guid = new Guid(
 a,
 b,
 c,
 (byte)(sequenceNumber >> 8),
 (byte)(sequenceNumber),
 macBytes[0],
 macBytes[1],
 macBytes[2],
 macBytes[3],
 macBytes[4],
 macBytes[5]);
 return guid;
}

I used BenchmarkDotNet which does the table formatting by default.

DevOnBike DevOnBike 713 bronze badges · Answer 2 · 2020-01-02 21:38:21Z

So after updating my source code it looks that it performs fast enough and allocates no memory. Here are benchmark for comparing the most popular methods of generating GUIDs.

Runtime=.NET Core 3.0 Force=True Server=True

| Method | Mean | Error | StdDev | Ratio | RatioSD | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------------------------------ |----------:|---------:|---------:|------:|--------:|-------:|------:|------:|----------:|
| SqlServerNewSequentialGuid | 38.69 ns | 0.787 ns | 1.643 ns | 1.00 | 0.00 | 0.0009 | - | - | 80 B |
| Guid_Standard | 61.44 ns | 0.665 ns | 0.622 ns | 1.60 | 0.06 | - | - | - | - |
| NHibernate_GuidCombGenerator | 160.22 ns | 3.070 ns | 2.721 ns | 4.16 | 0.20 | 0.0012 | - | - | 104 B |
| Guid_Comb_New | 291.56 ns | 5.421 ns | 4.805 ns | 7.57 | 0.37 | 0.0010 | - | - | 104 B |
| EFCore_SequentialGuidValueGenerator | 82.04 ns | 1.623 ns | 1.667 ns | 2.13 | 0.11 | 0.0007 | - | - | 72 B |
| NewId_Lib | 78.93 ns | 1.150 ns | 1.076 ns | 2.05 | 0.09 | - | - | - | - |
| NewSequentialGuid_PureNetCore | 70.28 ns | 1.332 ns | 1.481 ns | 1.81 | 0.08 | - | - | - | - |

Full source code in c# .net core 2.2+:

using System;
using System.Buffers;
using System.Linq;
using System.Net.NetworkInformation;
using System.Threading;
namespace PerfBenchmarkDotNet
{
 internal readonly struct FastGuid
 {
 private static readonly byte[] _macBytes;
 private static long _baseDateTicks;
 private static long _clockSequenceNumber;
 static FastGuid()
 {
 _clockSequenceNumber = 0;
 _macBytes = GetDefaultMacAdress();
 _baseDateTicks = new DateTime(1582, 10, 15, 0, 0, 0, DateTimeKind.Utc).Ticks;
 }
 public static Guid NewGuid()
 {
 var nowTicks = DateTime.UtcNow.Ticks;
 var sequenceNumber = Interlocked.Increment(ref _clockSequenceNumber);
 var ticksDiff = nowTicks - _baseDateTicks;
 var a = (int)(ticksDiff);
 short b = (short)(ticksDiff >> 32);
 short c = (short)(ticksDiff >> 48);
 return new Guid(
 a,
 b,
 c,
 (byte)(sequenceNumber >> 8),
 (byte)(sequenceNumber),
 _macBytes[0],
 _macBytes[1],
 _macBytes[2],
 _macBytes[3],
 _macBytes[4],
 _macBytes[5]);
 }
 private static byte[] GetDefaultMacAdress()
 {
 var nic = NetworkInterface.GetAllNetworkInterfaces().FirstOrDefault();
 if (nic != null)
 {
 return nic.GetPhysicalAddress().GetAddressBytes();
 }
 var fallback = Guid.NewGuid().ToByteArray();
 return fallback.Take(6).ToArray();
 }
 }
}

I'm curious as to how you profiled SqlServerNewSequentialGuid (NEWSEQUENTIALID()) since that can only be used in a default constraint expression, AFAIK.

Stack Exchange Network

NewSequentialId - pure net core 2.2 implementation of sql server function

2 Answers 2

Profiling

Guid constructor

Binary stuff

Remainder

Benchmarks

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

NewSequentialId - pure net core 2.2 implementation of sql server function

2 Answers 2

Profiling

Guid constructor

Binary stuff

Remainder

Benchmarks

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions