You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

Required fields*

Rev

Required fields*

High throughput Fizz Buzz

Fizz Buzz is a common challenge given during interviews. The challenge goes something like this:

Write a program that prints the numbers from 1 to n. If a number is divisible by 3, write Fizz instead. If a number is divisible by 5, write Buzz instead. However, if the number is divisible by both 3 and 5, write FizzBuzz instead.

The goal of this question is to write a FizzBuzz implementation that goes from 1 to infinity (or some arbitrary very very large number), and that implementation should do it as fast as possible.

Checking throughput

Write your fizz buzz program. Run it. Pipe the output through <your_program> | pv > /dev/null. The higher the throughput, the better you did.

Example

A naive implementation written in C gets you about 170MiB/s on an average machine:

#include <stdio.h>
int main() {
 for (int i = 1; i < 1000000000; i++) {
 if ((i % 3 == 0) && (i % 5 == 0)) {
 printf("FizzBuzz\n");
 } else if (i % 3 == 0) {
 printf("Fizz\n");
 } else if (i % 5 == 0) {
 printf("Buzz\n");
 } else {
 printf("%d\n", i);
 }
 }
}

There is a lot of room for improvement here. In fact, I've seen an implementation that can get more than 3GiB/s on the same machine.

I want to see what clever ideas the community can come up with to push the throughput to its limits.

Rules

All languages are allowed.
The program output must be exactly valid fizzbuzz. No playing tricks such as writing null bytes in between the valid output - null bytes that don't show up in the console but do count towards pv throughput.

Here's an example of valid output:

1
2
Fizz
4
Buzz
Fizz
7
8
Fizz
Buzz
11
Fizz
13
14
FizzBuzz
16
17
Fizz
19
Buzz
# ... and so on

Valid output must be simple ASCII, single-byte per character, new lines are a single \n and not \r\n. The numbers and strings must be correct as per the FizzBuzz requirements. The output must go on forever (or a very high astronomical number, at-least 2^58) and not halt / change prematurely.

Parallel implementations are allowed (and encouraged).
Architecture specific optimizations / assembly is also allowed. This is not a real contest - I just want to see how people push fizz buzz to its limit - even if it only works in special circumstances/platforms.

Scores

Scores are from running on my desktop with an AMD 5950x CPU (16C / 32T). I have 64GB of 3200Mhz RAM. CPU mitigations are disabled.

**By far the best score so far is in C++ by @David Frank - generating FizzBuzz at around 1.7 Terrabit/s

At the second place - @ais523 - generating FizzBuzz at 61GiB/s with x86 assembly.

Results for java:

Author	Throughput
ioan2	2.6 GiB/s
randy	0.6 GiB/s
randy_ioan	3.3 GiB/s
ioan	4.6 GiB/s
olivier	5.8 GiB/s

Results for python:

Author	Throughput
bconstanzo	0.1 GiB/s
michal	0.1 GiB/s
ksousa_chunking	0.1 GiB/s
ksousa_initial	0.0 GiB/s
arrmansa	0.2 GiB/s
antoine	0.5 GiB/s

Results for pypy:

Author	Throughput
bconstanzo_pypy	0.3 GiB/s

Results for rust:

Author	Throughput
aiden	3.4 GiB/s
xavier	0.9 GiB/s

Results for ruby:

Author	Throughput
lonelyelk	0.1 GiB/s
dgutov	1.7 GiB/s

Results for asm:

Author	Throughput
ais523	60.8 GiB/s
paolo	39.0 GiB/s

Results for julia:

Author	Throughput
marc	0.7 GiB/s
tkluck	15.5 GiB/s

Results for c:

Author	Throughput
random832	0.5 GiB/s
neil	8.4 GiB/s
kamila	8.4 GiB/s
xiver77	20.9 GiB/s
isaacg	5.7 GiB/s

Results for cpp:

Author	Throughput
jdt	4.8 GiB/s
tobycpp	5.4 GiB/s
david	208.3 GiB/s

Results for numba:

Author	Throughput
arrmansa	0.1 GiB/s

Results for numpy:

Author	Throughput
arrmansa	0.3 GiB/s
arrmansa-multiprocessing	0.7 GiB/s
arrmansa-multiprocessing-2	0.7 GiB/s

Results for go:

Author	Throughput
Bysmyyr	3.7 GiB/s
psaikko	6.8 GiB/s

Results for php:

Author	Throughput
no_gravity	0.5 GiB/s

Results for elixir:

Author	Throughput
technusm1	0.3 GiB/s

Results for csharp:

Author	Throughput
neon-sunset	1.2 GiB/s

asm submissions java submissions pypy submissions python submissions ruby submissions rust submissions c submissions cpp submissions julia submissions numba submissions numpy submissions go submissions php submissions csharp submissions

Plots generated using https://github.com/omertuc/fizzgolf

Answer*

My solution maintains a buffer with a batch of lines (6000 lines worked best on my system), and updates all the numbers in the buffer in a parallelisable loop. We use an auxiliary array `nl[]` to keep track of where each newline lies, so we have random access to all the numbers.

The addition is all in-place decimal character-by-character arithmetic, with no arithmetic division after the buffer is initialised (I could have created the buffer without division, too, but opted for shorter, readable code!). Every so often, when the number of digits rolls over, we have to stop and re-position all the numbers within the buffer (that's what the `shuffle` counter is for), and update the corresponding entries in `nl[]`; this happens more and more infrequently as we proceed.

I compiled using `gcc -std=gnu17 -Wall -Wextra -fopenmp -O3 -march=native`, and ran with `OMP_NUM_THREADS=3` set in the environment (a different number of threads may be optimal on another host).

 #include <stdatomic.h>
 #include <stdio.h> /* sprintf */
 #include <string.h> /* memset */
 #include <unistd.h>
 
 /* This is the single tunable you need to adjust for your platform */
 #define chunk 6000 /* must be multiple of 3*5, with only one nonzero digit */
 /* i.e. 3, 6 or 9 times an exact power of ten */
 
 /* Select a number of digits to use. If we produce one billion numbers
 per second, then we'll finish all the 18-digit numbers in just 30
 years. 24 digits should suffice until next geological epoch, at least. */
 #define numlen 25 /* 24 decimal digits plus newline */
 
 #define STR_(x) #x
 #define STR(x) STR_(x)
 #define chunk_str STR(chunk)
 
 #define unlikely(e) __builtin_expect((e), 0)
 
 char format[chunk * numlen];
 char *nl[chunk+1];
 
 int main()
 {
 /* Create the format string. */
 /* We do this twice, as the numbers written first time round are
 too short for the addition. */
 for (int j = 0, n = 1; j < 2; ++j)
 {
 nl[0] = format;
 char *p = format;
 for (int i = 0; i <= chunk; ++i, ++n) {
 if ((n % 15) == 0) {
 p += sprintf(p, "FizzBuzz\n");
 } else if ((n % 5) == 0) {
 p += sprintf(p, "Buzz\n");
 } else if ((n % 3) == 0) {
 p += sprintf(p, "Fizz\n");
 } else {
 p += sprintf(p, "%d\n", n);
 }
 nl[i] = p;
 }
 write(1, format, nl[chunk] - format);
 }
 
 atomic_int shuffle = 0;
 for (;;) {
 #pragma omp parallel for schedule(static)
 for (int i = 0; i < chunk; ++i) {
 if (nl[i+1][-2] == 'z') {
 /* fizz and/or buzz - not a number */
 continue;
 }
 /* else add 'chunk' to the number */
 static const int units_offset = sizeof chunk_str;
 static const int digit = chunk_str[0] - '0';
 char *p = nl[i+1] - units_offset;
 *p += digit;
 while (*p > '9') {
 *p-- -= 10;
 ++*p;
 }
 if (unlikely(p < nl[i])) {
 /* digit rollover */
 ++shuffle;
 }
 }
 if (unlikely(shuffle)) {
 /* add a leading one to each overflowing number */
 char **nlp = nl + chunk;
 char *p = *nlp;
 char *dest = p + shuffle;
 while (p < dest) {
 if (*p == '\n') {
 *nlp-- = dest + 1;
 } else if (*p == '\n'+1) {
 --*p;
 *dest-- = '1';
 *nlp-- = dest + 1;
 }
 *dest-- = *p--;
 }
 shuffle = 0;
 }
 write(1, format, nl[chunk] - format);
 }
 }

If this is an answer to a challenge…

…Be sure to follow the challenge specification. However, please refrain from exploiting obvious loopholes. Answers abusing any of the standard loopholes are considered invalid. If you think a specification is unclear or underspecified, comment on the question instead.
…Try to optimize your score. For instance, answers to code-golf challenges should attempt to be as short as possible. You can always include a readable version of the code in addition to the competitive one. Explanations of your answer make it more interesting to read and are very much encouraged.
…Include a short header which indicates the language(s) of your code and its score, as defined by the challenge.

More generally…

…Please make sure to answer the question and provide sufficient detail.
…Avoid asking for help, clarification or responding to other answers (use comments instead).

Draft saved

Draft discarded

Edit Summary*

Cancel

\$\begingroup\$ Wouldn't memcpy(p,"Fizz\n",6); p+=5 be faster than sprintf? \$\endgroup\$

EasyasPi
– EasyasPi

2021年01月19日 22:05:11 +00:00
Commented Jan 19, 2021 at 22:05
1

\$\begingroup\$ Optimal with OMP_NUM_THREADS=1 🤔. @EasyasPi tried that, didn't make a noticeable difference \$\endgroup\$

Omer Tuchfeld
– Omer Tuchfeld

2021年01月19日 22:22:25 +00:00
Commented Jan 19, 2021 at 22:22
\$\begingroup\$ @EasyasPi, ITYM memcpy(p,"Fizz\n",5); to avoid pointlessly writing the null character each time? memcpy/sprintf tuning makes no measurable difference, as that's only the setup, outside the main loop. sprint() makes for more maintainable code. (I'm new to programming for all-out speed; in my day job that comes in third, after robustness and maintainability) \$\endgroup\$

Toby Speight
– Toby Speight

2021年01月20日 07:24:45 +00:00
Commented Jan 20, 2021 at 7:24
\$\begingroup\$ @Omer: yes, the poor parallelisation was a disappointment. I tried some other parallelisations, too (separate array of formatted numbers, and putting the serialisation and writing into its own thread). If I'm to get any real benefit, I might have to go low-level and hand-craft the threads and their synchronisation (probably switch to C++ for that). \$\endgroup\$

Toby Speight
– Toby Speight

2021年01月20日 07:31:15 +00:00
Commented Jan 20, 2021 at 7:31
\$\begingroup\$ I think I manage slightly better if I have atomic_int shuffle instead of the reduction. (time passes...) Yes, and I've updated to code that actually works faster in parallel, at last! \$\endgroup\$

Toby Speight
– Toby Speight

2021年01月20日 07:39:13 +00:00
Commented Jan 20, 2021 at 7:39

| Show 2 more comments

How to Edit

Correct minor typos or mistakes
Clarify meaning without changing it
Add related resources or links
Always respect the author’s intent
Don’t use edits to reply to the author

How to Format

create code fences with backticks ` or tildes ~
```
like so
```
add language identifier to highlight code
```python
def function(foo):
print(foo)
```
put returns between paragraphs
for linebreak add 2 spaces at end
_italic_ or **bold**
indent code by 4 spaces
backtick escapes `like _so_`
quote by placing > at start of line
to make links (use https whenever possible)

<https://example.com>

[example](https://example.com)

<a href="https://example.com">example</a>
MathJax equations \$\sin^2 \theta\$

formatting help »
answering help »

MathJax help »

How to Tag

A tag is a keyword or label that categorizes your question with other, similar questions. Choose one or more (up to 5) tags that will help answerers to find and interpret your question.

complete the sentence: my question is about...
use tags that describe things or concepts that are essential, not incidental to your question
favor using existing popular tags
read the descriptions that appear below the tag

If your question is primarily about a topic for which you can't find a tag:

combine multiple words into single-words with hyphens (e.g. code-golf), up to a maximum of 35 characters
creating new tags is a privilege; if you can't yet create a tag you need, then post this question without it, then ask the community to create it for you

popular tags »