I want an integer parser so fast it risks discovering time travel.
I'm stuck in C# for a language, but that doesn't mean I can't try to use c++. So I wrote the following:
The best I could do
public static unsafe bool FastTryParseInt(string input, out int result)
{
fixed (char* cString = input) {
unchecked {
char* nextChar = cString;
bool isNegative = false;
if (*nextChar == CharNegative) {
isNegative = true;
nextChar++;
}
result = 0;
while (*nextChar >= '0' && *nextChar <= '9')
result = result * 10 + (*nextChar++ - '0');
if (*nextChar != Char.MinValue) return false;
long ptrLen = nextChar - cString;
if (isNegative) {
result = -result;
return ptrLen < 11L || ptrLen <= 11L && result <= 0;
}
return ptrLen < 10L || ptrLen <= 10L && result >= 0;
}
}
}
Shoutout to @chux who gave me many good suggestions while reviewing our custom Double parser at Custom double parser optimized for performance .
With comments
Here it is with comments, because I'm not a monster (at least not that kind of monster):
// High performance integer parser with rudimentary flexibility.
// Supports leading negatives, but no white-space or other non-numeric characters.
public static unsafe bool FastTryParseInt(string input, out int result)
{
fixed (char* cString = input) { // Allows pointer math
unchecked { // Allows faster integer math by not testing for overflows
char* nextChar = cString;
bool isNegative = false;
// Handle a possible negative sign at the beginning of the string.
if (*nextChar == CharNegative)
{
isNegative = true;
nextChar++;
}
// Now process each character of the string
result = 0;
while (*nextChar >= '0' && *nextChar <= '9')
result = result * 10 + (*nextChar++ - '0');
// If the non-numeric character encountered to end the while loop
// wasn't the null terminator, the string is invalid.
if (*nextChar != Char.MinValue) return false;
// We need to check for an integer overflow based on the length of the string
long ptrLen = nextChar - cString;
// Result and overflow logic is different if there was a minus sign.
if (isNegative)
{
result = -result;
// Less than 11 characters (including negative) is no risk of overflow
// Longest possible negative int is 11 chars (-2147483648)
// If exactly 11 characters, we know our negative integer overflowed
// if the final result is greater than zero, otherwise it's fine.
return ptrLen < 11L || ptrLen <= 11L && result <= 0;
}
// Otherwise, overflow logic is the same, but opposite, and one fewer
// characters is allowed (because there was no minus sign)
return ptrLen < 10L || ptrLen <= 10L && result >= 0;
}
}
}
This is ~6x faster than a billion calls to Int.TryParse
.
Native parser took 13808 ms. Custom parser took 2191 ms. Performance gain was 530%
Unsafe? Gross!
I tried getting as close as I could to the above algorithm without using unsafe
, and to my surprise, it's about as good. It made me sad, because I was proud of my pointer math:
public static bool FastTryParseIntOld(string input, out int result)
{
result = 0;
int length = input.Length;
if (length == 0) return false;
bool isNegative = false;
int currentIndex = 0;
char nextChar = input[0];
unchecked
{
if (nextChar == CharNegative)
{
isNegative = true;
++currentIndex;
}
while (currentIndex < length)
{
nextChar = input[currentIndex++];
if (nextChar < '0' || nextChar > '9') return false;
result = result * 10 + (nextChar - '0');
}
if (isNegative)
{
result = -result;
return length < 11 || length <= 11 && result <= 0;
}
return length < 10 || length <= 10 && result >= 0;
}
}
Native parser took 13727 ms. Custom parser took 2377 ms. Performance gain was 477%
Frequently Asked Questions
Is
Double.Parse
really your performance bottleneck?
Yes! We benchmarked it in release mode and everything. We're ripping through billions of strings loaded into memory from a super-fast CSV reader, and calling native Double.TryParse
was around ~70% of our execution time. More than I/O even.
Then why are you using C#?
It's much nicer for the 100 other things our app has to do that aren't performance critical.
What's your benchmark code?
Here it is. I'm pretty sure it's a good measure:
const int iterations = 100000000;
string[] randomIntegers = Enumerable.Repeat(0, iterations )
.Select(i => generateRandomInput()).ToArray();
CultureInfo cachedCulture = CultureInfo.CurrentCulture;
Stopwatch timer = Stopwatch.StartNew();
foreach (string value in randomIntegers)
Int32.TryParse(value, nativeNumberStyles, cachedCulture, out T _);
Console.WriteLine($"Native parser took {timer.ElapsedMilliseconds} ms.");
timer.Restart();
foreach (string value in randomIntegers)
FastTryParseInt(value, out T _);
Console.WriteLine($"Custom parser took {timer.ElapsedMilliseconds} ms.");
Compiled and run as "Release|Any CPU" on a 64 bit machine.
My Question
Can we do better? Did I miss something in unsafe
mode that can make it way better than boring, safe C#?
3 Answers 3
I normally don't write performance-sensitive code like this, but I think I managed to find a few improvements:
- Writing to
out
parameters is slower than using local variables, so you only want to assign toresult
once. This makes a big difference. - Rejecting null/empty input right away might improve performance in some cases, perhaps because it enables some JIT optimizations.
- Overflow always causes a mismatch between the first input digit and the resulting value, so a single check should be sufficient.
And a few bug-fixes:
"00000000001"
was rejected because it has 11 digits, but it's a valid input. Leading zeroes should be ignored for length/overflow checks."405円"
was parsed as4
, but it's an invalid input.input.Length
should be used instead of checking for null-terminators.""
was parsed as0
, but it's an invalid input.
According to my tests the following (safe) implementation is roughly 10%-15% faster. An unsafe variant should increase that to some 20%:
public static bool TryParseInt(string input, out int result)
{
// This check sometimes improves performance - perhaps the extra guarantees enable some JIT optimizations?
if (input == null || input.Length == 0)
{
result = default(int);
return false;
}
var length = input.Length;
var isNegative = input[0] == '-';
var offset = isNegative ? 1 : 0;
// It's faster to not operate directly on 'out' parameters:
int value = 0;
for (int i = offset; i < length; i++)
{
var c = input[i];
if (c < '0' || c > '9')
{
result = default(int);
return false;
}
else
{
value = (value * 10) + (c - '0');
}
}
// Inputs with 10 digits or more might not fit in an integer, so they'll require additional checks:
if (length - offset >= 10)
{
// Overflow/length checks should ignore leading zeroes:
var meaningfulDigits = length - offset;
for (int i = offset; i < length && input[i] == '0'; i++)
meaningfulDigits -= 1;
if (meaningfulDigits > 10)
{
// Too many digits, this certainly won't fit:
result = default(int);
return false;
}
else if (meaningfulDigits == 10)
{
// 10-digit numbers can be several times larger than int.MaxValue, so overflow may result in any possible value.
// However, we only need to check the most significant digit to see if there's a mismatch.
// Note that int.MinValue always overflows, making it the only case where overflow is allowed:
if (!isNegative || value != int.MinValue)
{
// Any overflow will cause a leading digit mismatch:
if (value / 1000000000 != (input[length - 10] - '0'))
{
result = default(int);
return false;
}
}
}
}
// -int.MinValue overflows back into int.MinValue, so that's ok:
result = isNegative ? -value : value;
return true;
}
-
\$\begingroup\$ Wow, just switching to a local
int
for the accumulation rather than using theout result
directly upped the performance from +530% to +710% vs the native integer parser. Great find! \$\endgroup\$Alain– Alain2018年08月08日 16:24:53 +00:00Commented Aug 8, 2018 at 16:24
Like @202_accepted has mentioned in his comment you have a bug in your code which is hidden because of using unchecked in addition with your resulting condition.
I would divide this algorithm into 3 branches.
If the length of
input
is greater than the length of eitherint.MaxValue
orint.MìnValue
we can justreturn false
.If the length of
input
is smaller than the length of eitherint.MaxValue
orint.MìnValue
we can returnfalse
if any char is> 9
or< 0
.If the length is the same we iterate over
input
and compare it to a string representation of eitherint.MaxValue
orint.MinValue
and if any char, which isn't> 9
or< 0
, is greater than the char to compare we returnfalse
.
Because you may use this method in a different project as well I would validate the
input
fornull
and whitespace instead of checking theLength
property.If I see a
xxxTryParse
method I would assume theout
parameter beingdefault(T)
if the method returnsfalse
.
Implementing the mentioned points the code would loke like so
private const char CharNegative = '-';
private const int possibleMaxLength = 10;
private const int possibleMaxNegativeLength = 11;
private static string comparePositive = int.MaxValue.ToString();
private static string compareNegative =int.MinValue.ToString();
public static bool FastTryParseInt(string input, out int result)
{
result = 0;
if (string.IsNullOrWhiteSpace(input)) { return false; }
int length = input.Length;
bool isNegative = false;
int currentIndex = 0;
if (input[0] == CharNegative)
{
if (length > possibleMaxNegativeLength) { return false; }
isNegative = true;
++currentIndex;
}
int maxLength = isNegative ? possibleMaxNegativeLength : possibleMaxLength;
if (length > maxLength)
{
return false;
}
else if (length < maxLength)
{
char nextChar;
while (currentIndex < length)
{
nextChar = input[currentIndex++];
if (nextChar < '0' || nextChar > '9')
{
result = 0;
return false;
}
result = result * 10 + (nextChar - '0');
}
}
else
{
bool needsToBeCompared = true;
string valueToCompare = isNegative ? compareNegative : comparePositive;
char nextChar;
char compareChar;
while (currentIndex < maxLength)
{
nextChar = input[currentIndex];
compareChar = valueToCompare[currentIndex];
if (nextChar < '0' || nextChar > '9' || (needsToBeCompared && compareChar < nextChar))
{
result = 0;
return false;
}
if (needsToBeCompared)
{
needsToBeCompared = compareChar == nextChar;
}
result = result * 10 + (nextChar - '0');
currentIndex++;
}
}
if (isNegative) { result = -result; }
return true;
}
Timings (100000000 values, compiled AnyCpu, running on x64):
int.TryParse()
: 11200 ms
FastTryParse()
: 2730 ms
with generating random input like
private static Random random = new Random();
private static IEnumerable<string> GenerateRandomInput(int iterations)
{
if (iterations == 0) yield break;
for (int i = 0; i < iterations; i++)
{
if (i % 2 == 0)
{
yield return LongRandom((long)int.MinValue, ((long)int.MaxValue)).ToString();
}
else if (i % 3 == 0)
{
yield return LongRandom((long)int.MaxValue, ((long)int.MaxValue + 10000)).ToString();
}
else
{
yield return LongRandom(long.MinValue, long.MaxValue).ToString().Insert(3, "a");
}
}
}
private static long LongRandom(long min, long max)
{
byte[] buf = new byte[8];
random.NextBytes(buf);
long longRand = BitConverter.ToInt64(buf, 0);
return (Math.Abs(longRand % (max - min)) + min);
}
Edit
Since you don't bother about special cases like null
being passed to the method and performance is your target, I have changed it like so
public static bool FastTryParseInt(string input, out int result)
{
result = 0;
int length = input.Length;
if (length == 0) { return false; }
bool isNegative = false;
int currentIndex = 0;
if (input[0] == CharNegative)
{
if (length > possibleMaxNegativeLength) { return false; }
isNegative = true;
++currentIndex;
}
int maxLength = isNegative ? possibleMaxNegativeLength : possibleMaxLength;
if (length > maxLength)
{
return false;
}
char nextChar;
while (currentIndex < length)
{
nextChar = input[currentIndex++];
if (nextChar < '0' || nextChar > '9')
{
result = 0;
return false;
}
result = result * 10 + (nextChar - '0');
if (result < 0)
{
result = 0;
return false;
}
}
if (isNegative) { result = -result; }
return true;
}
If int result
is overflowing it becomes negative so we just check if result < 0
to return false
.
This runs in 2065 ms vs int.TryParse
in 11220 ms.
Btw. you only need the unchecked
keyword if you have checked "Check for arithmetic overflow/underflow" in the advanced build settings.
-
\$\begingroup\$
string.IsNullOrWhiteSpace
is latin sensitive, which could be optimized in this case, maybe? Seeing howchar.IsDigit
was not used in favor ofnextChar < '0' || nextChar > '9'
, which ignores globalization. I don't know what satisfies the requirements of the parser though. \$\endgroup\$Brad M– Brad M2018年08月07日 13:28:34 +00:00Commented Aug 7, 2018 at 13:28 -
\$\begingroup\$ That's an interesting concept, but in practice, it looks like the performance cost is too great. This new approach is only 310% faster than
int.TryParse
(down from 530%). Any 'checks' that can be saved until absolutely necessary are important to maximizing performance in the success case. Even just getting the length of the string could force an enumeration of the underlying character array that consumes half as many operations as the parsing of the string itself. \$\endgroup\$Alain– Alain2018年08月07日 13:42:48 +00:00Commented Aug 7, 2018 at 13:42 -
1\$\begingroup\$ Since you're rejecting anything with too many digits you should only need to check for overflow after the last digit. Also note that
int.MinValue
is a special case. \$\endgroup\$Pieter Witvoet– Pieter Witvoet2018年08月07日 15:13:00 +00:00Commented Aug 7, 2018 at 15:13 -
1\$\begingroup\$ Note that your latest edit has the same bug as the OP - a value like 4294967300 doesn't exceed the max length, but will overflow so far that it will become positive again - so your final check to see if the result is negative won't catch the overflow. \$\endgroup\$Alain– Alain2018年08月07日 18:00:00 +00:00Commented Aug 7, 2018 at 18:00
-
\$\begingroup\$ @Alain thanks. I just shouldn't change code without checking. Fixed \$\endgroup\$Heslacher– Heslacher2018年08月07日 18:47:42 +00:00Commented Aug 7, 2018 at 18:47
Here's the most performant version I've got so far based on suggestions by @PieterWitvoet and 202_accepted. It also corrects the overflow bug from my OP.
/// <summary>High performance integer parser with rudimentary flexibility.</summary>
/// <remarks>Supports negative numbers, but no whitespace or other non-digit characters
/// may be present.
/// Will not parse strings with more than 10 numeric characters,
/// even if there are leading zeros (such that the integer does not overflow).</remarks>
public static unsafe bool FastTryParseInt(string input, out int result)
{
// We never expect null, but enforcing this may enable some JIT optimizations.
if (input == null)
{
result = default(int);
return false;
}
fixed (char* cString = input)
{
unchecked
{
char* nextChar = cString;
bool isNegative = false;
// Check whether the first character is numeric
if (*nextChar < '0' || *nextChar > '9')
{
// Only allow a negative sign at the beginning of the string.
if (*nextChar != CharNegative)
{
result = default(int);
return false;
}
isNegative = true;
// Any other non-numeric characters after this is an error.
if (*++nextChar < '0' || *nextChar > '9')
{
result = default(int);
// Special Case: Excel has been known to format zero as "-".
// So return true here IFF this non-digit char is the end-of-string.
return *nextChar == Char.MinValue;
}
}
// Now process each character of the string
int localValue = *nextChar++ - '0';
while (*nextChar >= '0' && *nextChar <= '9')
localValue = localValue * 10 + (*nextChar++ - '0');
// If the non-numeric character encountered to end the while loop
// wasn't the null terminator, the string is invalid.
if (*nextChar != Char.MinValue)
{
result = default(int);
return false;
}
// We need to check for an integer overflow based on the length of the string
long ptrLen = nextChar - cString;
// Result and overflow logic is different if there was a minus sign.
if (isNegative)
{
result = -localValue;
// Longest possible negative int is 11 chars (-2147483648)
// Less than 11 characters (including negative) is no risk of overflow
if (ptrLen < 11L) return true;
// More than 11 characters is definitely an overflow.
if (ptrLen > 11L) return false;
// Exactly 11 characters needs to be checked for overflow.
// Neat Trick: An overflow will always result in the first digit changing.
return *(cString + 1) - '0' == localValue / 1000000000
// Special case, parsing 2147483648 overflows to -2147483648, but this
// value should be supported if there was a leading minus sign.
|| localValue == Int32.MinValue;
}
// Same logic if positive, but one fewer characters is allowed (no minus sign)
result = localValue;
if (ptrLen < 10L) return true;
if (ptrLen > 10L) return false;
return *cString - '0' == localValue / 1000000000;
}
}
}
Benchmarks
I was seeing a lot of variability in my benchmark runs when I looped over a billion strings at once, so to mange noise, I changed my test code to run several smaller test in succession. Here are the results for the above code:
Native parser took 1202 ms. Custom parser took 181 ms. Performance gain was 564.09% Native parser took 1211 ms. Custom parser took 177 ms. Performance gain was 584.18% Native parser took 1226 ms. Custom parser took 241 ms. Performance gain was 408.71% Native parser took 1315 ms. Custom parser took 180 ms. Performance gain was 630.56% Native parser took 1402 ms. Custom parser took 182 ms. Performance gain was 670.33% Native parser took 1212 ms. Custom parser took 181 ms. Performance gain was 569.61% Native parser took 1221 ms. Custom parser took 178 ms. Performance gain was 585.96% Native parser took 1203 ms. Custom parser took 178 ms. Performance gain was 575.84% Native parser took 1203 ms. Custom parser took 178 ms. Performance gain was 575.84% Native parser took 1220 ms. Custom parser took 179 ms. Performance gain was 581.56%
The average is about 575% faster than native parse.
Performance Improvements
Uses
unsafe
andfixed
to treat the string as a null-terminated character array, avoiding the need to monitor or pre-compute the length of the string as we traverse it.Initializes the local value directly using the first numeric digit, avoiding a superfluous multiplication and addition with zero in the first loop.
A null check at the beginning may enable some JIT optimizations.
Uses a local integer value for accumulation during parsing, and only assigns
out result
once - since manipulatingout
variables directly is more expensive.Corrected the overflow check. In the majority case (less than 11/10 characters) there are no additional ops for overflow detection. Otherwise, for a number close to the limit, it takes just a couple additional ops to check whether the first digit changed due to overflow.
Omissions
Still does not account for leading zeros in overflow detection. Doing so slows down the expected case too much.
Still does not allow white-space, thousand separators, a leading positive sign, or trailing signs.
Note that in all above 'unhandled' cases, we're very careful to return false
- we will never return true
with an incorrect result.
Note that you could replace all instances of return false
with return Int.TryParse(input, out result)
if you wanted to "fall-back" to the native integer parser in these rare cases to strike a balance between performance and flexibility. In our case, something similar is done further up the chain, so I haven't included it in this code.
-
\$\begingroup\$ This fails to handle inputs like
""
,"00000000001"
and"405円"
. The overflow/length checks should ignore leading zeroes, and you may want to check the actual length of the input rather than relying on null-terminators. \$\endgroup\$Pieter Witvoet– Pieter Witvoet2018年08月08日 07:04:46 +00:00Commented Aug 8, 2018 at 7:04 -
\$\begingroup\$ One more thing: what about inputs with a leading positive sign, like
"+1"
? \$\endgroup\$Pieter Witvoet– Pieter Witvoet2018年08月08日 11:06:35 +00:00Commented Aug 8, 2018 at 11:06 -
1\$\begingroup\$
string.Length
is precomputed, so it's quite inexpensive, and the leading-zeroes check doesn't need to be done up-front, so it should only affect performance for 10+ digit input. Regarding the null-check, did you actually notice an improvement? For me it was very hit-and-miss. \$\endgroup\$Pieter Witvoet– Pieter Witvoet2018年08月10日 09:25:33 +00:00Commented Aug 10, 2018 at 9:25 -
\$\begingroup\$ I noticed no performance change trends with the null check (especially given the inherent noise in the results as seen above.) But so long as there's no negative impact, it's nice to be able to handle that case rather than throw a null reference exception. \$\endgroup\$Alain– Alain2018年08月10日 18:50:28 +00:00Commented Aug 10, 2018 at 18:50
-
\$\begingroup\$ Good to know about
string.Length
. I intuitively thought it might get lazily computed as needed, but it looks like it is populated at construction time. \$\endgroup\$Alain– Alain2018年08月10日 18:59:42 +00:00Commented Aug 10, 2018 at 18:59
FastTryParseInt(4294967296)
will returnTrue / 0
. This works for[4294967296,6442450943]
,[8589934592,9999999999]
,[-6442450944,-4294967296]
, and[-9999999999,-8589934592]
. \$\endgroup\$