High throughput Fizz Buzz

Question 1

Fizz Buzz is a common challenge given during interviews. The challenge goes something like this:

Write a program that prints the numbers from 1 to n. If a number is divisible by 3, write Fizz instead. If a number is divisible by 5, write Buzz instead. However, if the number is divisible by both 3 and 5, write FizzBuzz instead.

The goal of this question is to write a FizzBuzz implementation that goes from 1 to infinity (or some arbitrary very very large number), and that implementation should do it as fast as possible.

Checking throughput

Write your fizz buzz program. Run it. Pipe the output through <your_program> | pv > /dev/null. The higher the throughput, the better you did.

Example

A naive implementation written in C gets you about 170MiB/s on an average machine:

#include <stdio.h>
int main() {
 for (int i = 1; i < 1000000000; i++) {
 if ((i % 3 == 0) && (i % 5 == 0)) {
 printf("FizzBuzz\n");
 } else if (i % 3 == 0) {
 printf("Fizz\n");
 } else if (i % 5 == 0) {
 printf("Buzz\n");
 } else {
 printf("%d\n", i);
 }
 }
}

There is a lot of room for improvement here. In fact, I've seen an implementation that can get more than 3GiB/s on the same machine.

I want to see what clever ideas the community can come up with to push the throughput to its limits.

Rules

All languages are allowed.
The program output must be exactly valid fizzbuzz. No playing tricks such as writing null bytes in between the valid output - null bytes that don't show up in the console but do count towards pv throughput.

Here's an example of valid output:

1
2
Fizz
4
Buzz
Fizz
7
8
Fizz
Buzz
11
Fizz
13
14
FizzBuzz
16
17
Fizz
19
Buzz
# ... and so on

Valid output must be simple ASCII, single-byte per character, new lines are a single \n and not \r\n. The numbers and strings must be correct as per the FizzBuzz requirements. The output must go on forever (or a very high astronomical number, at-least 2^58) and not halt / change prematurely.

Parallel implementations are allowed (and encouraged).
Architecture specific optimizations / assembly is also allowed. This is not a real contest - I just want to see how people push fizz buzz to its limit - even if it only works in special circumstances/platforms.

Scores

Scores are from running on my desktop with an AMD 5950x CPU (16C / 32T). I have 64GB of 3200Mhz RAM. CPU mitigations are disabled.

**By far the best score so far is in C++ by @David Frank - generating FizzBuzz at around 1.7 Terrabit/s

At the second place - @ais523 - generating FizzBuzz at 61GiB/s with x86 assembly.

Results for java:

Author	Throughput
ioan2	2.6 GiB/s
randy	0.6 GiB/s
randy_ioan	3.3 GiB/s
ioan	4.6 GiB/s
olivier	5.8 GiB/s

Results for python:

Author	Throughput
bconstanzo	0.1 GiB/s
michal	0.1 GiB/s
ksousa_chunking	0.1 GiB/s
ksousa_initial	0.0 GiB/s
arrmansa	0.2 GiB/s
antoine	0.5 GiB/s

Results for pypy:

Author	Throughput
bconstanzo_pypy	0.3 GiB/s

Results for rust:

Author	Throughput
aiden	3.4 GiB/s
xavier	0.9 GiB/s

Results for ruby:

Author	Throughput
lonelyelk	0.1 GiB/s
dgutov	1.7 GiB/s

Results for asm:

Author	Throughput
ais523	60.8 GiB/s
paolo	39.0 GiB/s

Results for julia:

Author	Throughput
marc	0.7 GiB/s
tkluck	15.5 GiB/s

Results for c:

Author	Throughput
random832	0.5 GiB/s
neil	8.4 GiB/s
kamila	8.4 GiB/s
xiver77	20.9 GiB/s
isaacg	5.7 GiB/s

Results for cpp:

Author	Throughput
jdt	4.8 GiB/s
tobycpp	5.4 GiB/s
david	208.3 GiB/s

Results for numba:

Author	Throughput
arrmansa	0.1 GiB/s

Results for numpy:

Author	Throughput
arrmansa	0.3 GiB/s
arrmansa-multiprocessing	0.7 GiB/s
arrmansa-multiprocessing-2	0.7 GiB/s

Results for go:

Author	Throughput
Bysmyyr	3.7 GiB/s
psaikko	6.8 GiB/s

Results for php:

Author	Throughput
no_gravity	0.5 GiB/s

Results for elixir:

Author	Throughput
technusm1	0.3 GiB/s

Results for csharp:

Author	Throughput
neon-sunset	1.2 GiB/s

asm submissions java submissions pypy submissions python submissions ruby submissions rust submissions c submissions cpp submissions julia submissions numba submissions numpy submissions go submissions php submissions csharp submissions

Plots generated using https://github.com/omertuc/fizzgolf

Question 2

How are we scoring speed? If you change the machine the speed will change drastically. You mentioned architecture specific optimizations as well, so it seems like it is tested on multiple different architectures. How do you ensure that comparison is fair?

Question 3

That's definitely a problem. I will try and run the implementations posted here on my desktop and keep score in the main question by editing it when new answers are posted. If it's something architecture specific I can't run, it'll still interesting to see, doesn't have to compete in the same score table.

Question 4

@agtoever note that the benchmark is done using piping to pv, which means that the Linux default pipe buffer size of 64K is used, as well as pv's default 64K buffer size. My entry exploits these properties by setting the buffer size to 64K, but it still blocks on write due to many factors; one of which is the generation speed being affected by the varying length of the printed numbers.

Question 5

related: My answer on an SO question about x86-64 asm fizz buzz includes a tutorial on tuning some parts of it for performance. (e.g. unroll by 3 and use a down-counter starting at 5 instead of modulo on i). But it still uses a slow div for int->string, and critically uses write for each group of 15 output lines built up in a buffer. Full-on perf tuning would probably just unroll by 15, and of course use much bigger buffers.

Question 6

Hoe do you measure throughput of a program which runs "for ever"? If I have a program which first does calculations for a year, then outputs a googol lines in a minute, it will lose if you stop measuring if you cut off the measurement within a year, but it'll be a winner if you measure for a year and a minute. This is of course an extreme example, but many programs will get slower when numbers get larger, so how you long you measure matters.

Question 7

this is inspired by the previous Rust entry by aiden4 at https://codegolf.stackexchange.com/a/217455/118590

it's a threaded version of aiden4's code, with the macros replaced by inline functions, because i haven't learned macros yet =)

this will utilize all logical CPUs up to MAX_THREADS, and uses one thread to handle synchronization.

build with cargo build --release and run with ./target/release/fizz_buzz_threaded

/.cargo/config.toml (please adjust as necessary):

[target.x86_64-unknown-linux-gnu]
rustflags = ["-Ctarget-cpu=native"]
[target.x86_64-pc-windows-msvc]
rustflags = ["-Ctarget-cpu=native"]
[target.x86_64-pc-windows-gnu]
rustflags = ["-Ctarget-cpu=native"]

main.rs:

use std::io::*;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::thread;
extern crate num_cpus;
const FIZZ: *const u8 = "Fizz\n".as_ptr();
const BUZZ: *const u8 = "Buzz\n".as_ptr();
const FIZZBUZZ: *const u8 = "FizzBuzz\n".as_ptr();
const MAX_NUM: usize = 90_000; // results to cache before writing output
const MAX_THREADS: usize = 256;
const BUF_SIZE: usize = MAX_NUM * 207; // i think max size for one round of fizz buzz ANSI is 206 bytes?
const UATOMIC: AtomicUsize = AtomicUsize::new(0);
static mut BUFS: [[u8; BUF_SIZE]; MAX_THREADS] = [[0u8; BUF_SIZE]; MAX_THREADS];
static mut BUF_COUNTS: [usize; MAX_THREADS] = [0usize; MAX_THREADS];
static mut LOCKS: [AtomicUsize; MAX_THREADS] = [UATOMIC; MAX_THREADS];
unsafe fn sync_threads(threads: usize) {
 let mut synced = false;
 let mut val: usize;
 let mut out = stdout();
 loop {
 for x in 0..threads {
 val = LOCKS[x].load(Ordering::SeqCst);
 if val == 1 {
 synced = true;
 } else {
 synced = false;
 break;
 }
 }
 if synced {
 for x in 0..threads {
 let res = out.write_all(BUFS[x].get_unchecked(..BUF_COUNTS[x]));
 if res.is_err() {
 println!("{:?}", res.err());
 }
 LOCKS[x].store(0, Ordering::SeqCst);
 }
 }
 }
}
unsafe fn thread(id: usize, mut num: usize, worker_threads: usize) {
 // alas, i haven't learned macros yet
 unsafe fn itoap_write(id: usize, buf_count: &mut usize, num: usize) {
 *buf_count +=
 itoap::write_to_ptr(BUFS[id].get_unchecked_mut(*buf_count..).as_mut_ptr(), num);
 BUFS[id].as_mut_ptr().add(*buf_count).write(b'\n');
 *buf_count += 1;
 }
 unsafe fn str_write(id: usize, buf_count: &mut usize, string: *const u8, len: usize) {
 let ptr = BUFS[id].get_unchecked_mut(*buf_count..).as_mut_ptr();
 ptr.copy_from_nonoverlapping(string, len);
 *buf_count += len;
 }
 let mut val: usize;
 let mut count: usize = 0;
 num += 1;
 let number_skip = (worker_threads - 1) * MAX_NUM;
 loop {
 val = LOCKS[id].load(Ordering::SeqCst);
 if val == 0 {
 if count > MAX_NUM - 15 {
 count = 0;
 LOCKS[id].store(1, Ordering::SeqCst);
 loop {
 val = LOCKS[id].load(Ordering::SeqCst);
 if val == 0 {
 BUF_COUNTS[id] = 0;
 num += number_skip;
 break;
 }
 }
 continue;
 }
 itoap_write(id, &mut BUF_COUNTS[id], num);
 num += 1;
 itoap_write(id, &mut BUF_COUNTS[id], num);
 str_write(id, &mut BUF_COUNTS[id], FIZZ, 5);
 num += 2;
 itoap_write(id, &mut BUF_COUNTS[id], num);
 str_write(id, &mut BUF_COUNTS[id], BUZZ, 5);
 str_write(id, &mut BUF_COUNTS[id], FIZZ, 5);
 num += 3;
 itoap_write(id, &mut BUF_COUNTS[id], num);
 num += 1;
 itoap_write(id, &mut BUF_COUNTS[id], num);
 str_write(id, &mut BUF_COUNTS[id], FIZZ, 5);
 str_write(id, &mut BUF_COUNTS[id], BUZZ, 5);
 num += 3;
 itoap_write(id, &mut BUF_COUNTS[id], num);
 str_write(id, &mut BUF_COUNTS[id], FIZZ, 5);
 num += 2;
 itoap_write(id, &mut BUF_COUNTS[id], num);
 num += 1;
 itoap_write(id, &mut BUF_COUNTS[id], num);
 str_write(id, &mut BUF_COUNTS[id], FIZZBUZZ, 9);
 num += 2;
 count += 15;
 }
 }
}
fn main() {
 let max_threads = num_cpus::get();
 if max_threads < 2 {
 panic!("multi-cpu only!")
 }
 let mut thread_ids: Vec<usize> = Vec::with_capacity(max_threads);
 for x in 0..max_threads {
 thread_ids.push(x);
 }
 thread::scope(|s| {
 for x in thread_ids.iter_mut() {
 if *x == max_threads - 1 {
 let builder = thread::Builder::new();
 let res = builder.spawn_scoped(s, || unsafe { sync_threads(max_threads - 1) });
 if res.is_err() {
 println!("Error spinning up sync thread: {}", res.err().unwrap());
 }
 } else {
 let builder = thread::Builder::new();
 let res = builder
 .spawn_scoped(s, || unsafe { thread(*x, *x * MAX_NUM, max_threads - 1) });
 if res.is_err() {
 println!("Error spinning up worker thread: {}", res.err().unwrap());
 }
 }
 }
 });
}

Cargo.toml:

[package]
name = "fizz_buzz_threaded"
version = "0.1.0"
authors = ["0xDEADFED5","aiden4"]
edition = "2021"
[dependencies]
num_cpus = "1.15.0"
itoap = "0.1"
[profile.release]
lto = "fat"

also available at https://github.com/0xDEADFED5/fizz_buzz_threaded

Question 8

It segfaults unless I adjust MAX_THREADS. 32 works. 64 works. 96 works. 231 to 256 segfaults at runtime. If I set it to anything between 116 and 230 inclusive the compiler itself crashes! Anyway the throughput seems to be 4.3GB/s-ish. I'll try to find some time to add it to the tables since it seems to be the fastest

Question 9

i did fix a bug in the code shortly after posting. was hoping it would be faster on your machine since for some strange reason it runs faster on my laptop than that beautiful AVX2 code.

Question 10

My comment was referring to the latest code on GitHub and seems to still be relevant. I will hold back from including this submission in the scores table until those issues are resolved

Question 11

Ruby

Here's some refinements based on the initial solution by @lonelyelk. Tested using the now latest Ruby 3.3. His original five-liner gives ~70 MiB/s on my machine.

Step 1: Drop the each_slice iteration, using a simple loop:

# frozen_string_literal: true
FMT = "%d\n%d\nFizz\n%d\nBuzz\nFizz\n%d\n%d\nFizz\nBuzz\n%d\nFizz\n%d\n%d\nFizzBuzz\n"
i = 1
loop do
 printf(FMT, i, i + 1, i + 3, i + 6, i + 7, i + 10, i + 12, i + 13)
 i += 15
end

The result runs at 135 MiB/s, almost twice as fast.

Step 2: Create much longer (by x400) strings each iteration step, using some runtime-generated code.

# frozen_string_literal: true
MULT = 400
FMT = "%d\n%d\nFizz\n%d\nBuzz\nFizz\n%d\n%d\nFizz\nBuzz\n%d\nFizz\n%d\n%d\nFizzBuzz\n" * MULT
idxs = (0..MULT).flat_map do |i|
 [0, 1, 3, 6, 7, 10, 12, 13].map { |j| j + i * 15 }
end
eval <<EOS
 def run_loop
 i = 1
 loop do
 printf(FMT, #{idxs.map { |n| "i + #{n}" }.join(', ') })
 i += 15 * MULT
 end
 end
EOS
run_loop

This gets up to 180-200 MiB/s, overtaking the "naive C".

Finally, parallelize the computation using ractors (16 seems like the sweet spot on my 6-core machine). This chunk of work is big enough that message-passing overhead is not a problem.

# frozen_string_literal: true
MULT = 400
FMT = ("%d\n%d\nFizz\n%d\nBuzz\nFizz\n%d\n%d\nFizz\nBuzz\n%d\nFizz\n%d\n%d\nFizzBuzz\n" * MULT).freeze
WORKERS = 16
idxs = (0..MULT).flat_map do |i|
 [0, 1, 3, 6, 7, 10, 12, 13].map { |j| j + i * 15 }
end
eval <<EOS
 def spawn_worker
 Ractor.new do
 loop do
 i = Ractor.receive
 Ractor.yield format(FMT, #{idxs.map { |n| "i + #{n}" }.join(', ') }).freeze
 end
 end
 end
EOS
workers = (1...WORKERS).map { spawn_worker }
ii = 1
loop do
 workers.each do |worker|
 worker.send(ii)
 ii += 15 * MULT
 end
 workers.each { |worker| puts worker.take }
end

This gives ~400 MiB/s, with fluctuations 380-420.

Question 12

A C++ program for Linux. I use the same method of arithmetic as in my C answer, but here I create a team of threads by hand rather than using OpenMP.

We divide the problem into ranges of numbers, so that we don't have to touch the low-order digits each iteration.

The workers are arranged in a circular chain, and each is responsible for a subrange. We do all our addition while other threads are writing, then take a turn at writing. In order to write an exact number of pages with each write, there's generally one write that straddles two workers, so we combine writes using writev(), temporarily blocking the other worker from beginning its arithmetic. The timing diagram looks like this:

 +--------------------------------------------+
 | |
 | wait |
-------> |
 | write (end of prev, start of this) | +--------------------------------------------+
<------- | | |
 | write (just this one) | | wait |
 | --------------------> |
 | wait | | write (end of prev, start of this) |
 | <-------------------- |
 | update each number | | write (just this one) |
 | | | ------+----->
 +--------------------------------------------+ | wait |
 | <------------
 | update each number |
 | |
 +--------------------------------------------+

On my workstation (8 thread Ivybridge), I measured at around 90% the throughput of cat /dev/zero, or about 3⁄4 that of dd if=/dev/zero bs=64K.

#include <cassert>
#include <condition_variable>
#include <iostream>
#include <memory>
#include <mutex>
#include <thread>
#include <vector>
#include <unistd.h>
#include <sys/sysinfo.h>
#include <sys/uio.h>
// Select a number of decimal digits to use. If we produce one billion
// numbers per second, then we'll finish all the 18-digit numbers in
// just 30 years. 24 digits should suffice until the next geological
// epoch, at least.
static constexpr int numlen = 25; // 24 digits plus newline
static constexpr int write_size = 0x10000; // this is fastest on my system
static_assert('0' - '\n' > 1, "Character coding incompatible with the arithmetic");
#define unlikely(e) __builtin_expect((e), 0)
struct worker
{
 // Storage for the character string we maintain
 std::string lines{""};
 // Iterators to each newline in 'lines'
 std::vector<std::string::iterator> nl{};
 int units_offset{}; // significant figures in the step
 char digit{}; // the single non-zero digit of step
 // We coordinate with the next and previous threads in the ring.
 // The iovec is used for writing a block of lines that straddles
 // the previous thread and this one.
 std::mutex mutex{};
 std::condition_variable cv{};
 struct iovec iov[2] = { { 0, 0 }, { 0, 0 } };
 worker *next{};
 // The functions
 worker(std::size_t first, std::size_t count, std::size_t step);
 worker(const worker&) = delete;
 worker& operator=(const worker&) = delete;
 void loop();
};
static constexpr auto buf_len(std::size_t digits, std::size_t count)
{
 // each group of 15 lines has 8 numbers and 39 chars of Fizz and Buzz
 return (8 * digits + 39) * count / 15;
}
static constexpr std::size_t optimal_step(long thread_count)
{
 // We need each thread to produce at least write_size each iteration
 for (std::size_t tens = 1000; tens < 10'000'000; tens *= 10) {
 auto const digits = std::snprintf(0, 0, "%zu", tens);
 for (std::size_t digit = 3; digit < 10; digit += 3) {
 if (buf_len(digits, digit * tens) / thread_count > write_size) {
 return digit * tens;
 }
 }
 }
 return 9'000'000; // fallback (perhaps we should limit thread count?)
}
int main()
{
 // How many threads will we have?
 auto const nprocs = get_nprocs();
 auto const step = optimal_step(nprocs);
 // Write output the slow way, until we have enough digits for the
 // format strings.
 auto n = 1;
 const auto step_len = std::to_string(step).size();
 // Finish the loop just before a FizzBuzz, so that the lines buffer
 // doesn't start with a number (we need a newline preceding for it
 // to overflow correctly).
 for (; std::to_string(n).size() <= step_len; ++n) {
 if ((n % 15) == 0) {
 std::cout << "FizzBuzz\n";
 } else if ((n % 5) == 0) {
 std::cout << "Buzz\n";
 } else if ((n % 3) == 0) {
 std::cout << "Fizz\n";
 } else {
 std::cout << n << '\n';
 }
 }
 std::cout.flush(); // use Unix write() from here on
 // create the workers
 auto workers = std::vector<std::unique_ptr<worker>>{};
 workers.reserve(nprocs);
 for (int i = 0; i < nprocs; ++i) {
 auto const a = i * step / nprocs;
 auto const z = (i+1) * step / nprocs;
 workers.emplace_back(std::make_unique<worker>(n + a, z - a, step));
 }
 // and connect them in a loop
 workers.back()->next = workers.front().get();
 for (int i = 1; i < nprocs; ++i) {
 workers[i-1]->next = workers[i].get();
 }
 // a thread for each worker
 auto threads = std::vector<std::unique_ptr<std::thread>>{};
 threads.reserve(nprocs);
 auto const loop = [](worker *w) { w->loop(); };
 for (auto const& w: workers) {
 threads.emplace_back(std::make_unique<std::thread>(loop, w.get()));
 }
 // start them writing
 auto &first = workers.front();
 {
 std::unique_lock lock{first->mutex};
 first->iov[0] = { &first->lines.front(), 0 };
 }
 first->cv.notify_one();
 threads.front()->join();
}
worker::worker(std::size_t first, std::size_t count, std::size_t step)
{
 auto s = std::to_string(step);
 units_offset = s.size() + 1;
 digit = s.front() - '0';
 lines.reserve(buf_len(numlen, count) + 1);
 nl.reserve(count);
 assert(lines.capacity() >= write_size);
 for (auto n = first; n < first + count; ++n) {
 if ((n % 15) == 0) {
 lines += "FizzBuzz\n";
 } else if ((n % 5) == 0) {
 lines += "Buzz\n";
 } else if ((n % 3) == 0) {
 lines += "Fizz\n";
 } else {
 lines += std::to_string(n) + '\n';
 }
 nl.push_back(lines.end());
 }
 assert(lines.size() >= write_size);
 // Second half of a straddle write is always from the beginning of our buffer.
 iov[1].iov_base = &lines.front();
}
void worker::loop()
{
 assert(next);
 for (;;) {
 {
 // We can write when the previous thread passes its buffer offcut.
 std::unique_lock lock{mutex};
 cv.wait(lock, [this]{ return iov[0].iov_base; });
 iov[1].iov_len = write_size - iov[0].iov_len;
 assert(iov[1].iov_len < lines.size());
 writev(1, iov, 2);
 // Tell the previous thread that we've finished with its buffer.
 iov[0].iov_base = 0;
 }
 cv.notify_one();
 auto p = lines.begin() + iov[1].iov_len;
 while (unlikely(p + write_size < lines.end())) {
 ::write(1, &*p, write_size);
 p += write_size;
 }
 {
 // Tell the next thread that we have leftover data for it to write.
 std::unique_lock lock{next->mutex};
 next->iov[0] = { &*p, static_cast<std::size_t>(lines.end() - p) };
 }
 next->cv.notify_one();
 {
 // Now wait until next worker has written, and released buffer back to us.
 std::unique_lock lock{next->mutex};
 next->cv.wait(lock, [this]{ return !next->iov[0].iov_base; });
 }
 // Update the numbers in the buffer.
 auto rollover = 0u;
 for (auto const& i: nl) {
 if (i[-2] == 'z') {
 // fizz and/or buzz - not a number
 continue;
 }
 // else add 'step' to the number
 auto p = i - units_offset;
 *p += digit;
 while (*p > '9') {
 *p-- -= 10;
 ++*p;
 }
 if (unlikely(*p == '\n'+1)) {
 // digit rollover
 ++rollover;
 }
 }
 if (unlikely(rollover)) {
 // Add a leading one to each overflowing number.
 auto nlp = nl.end();
 auto p = lines.end();
 assert(lines.size() + rollover < lines.capacity());
 lines.resize(lines.size() + rollover);
 auto dest = lines.end();
 while (--p < --dest) {
 if (*p == '\n'+1) {
 --*p;
 *dest-- = '1';
 }
 if (*p == '\n') {
 *nlp-- = dest + 1;
 }
 *dest = *p;
 }
 }
 }
}

Compile using g++-10 -std=c++2a -Wall -Wextra -Weffc++ -DNDEBUG -O3 -march=native -fno-exceptions -lpthread. No special arguments or environment are needed when running.

Question 13

Neil + Kamila generates fizzbuzz About 50% faster than the kernel generates 0's on my machine (9 vs 6 gib/s, approx), so I wouldn't consider /dev/zero an upper limit. An interesting reference point, maybe. In any case, tried running your new code - optimal when I fake nproc to be 4 - but it just yields results as fast as your previous submission with only 1 core.

Question 14

That's a shame - unmodified, it runs about 50% faster than my other one here. I really expected to come out ahead on your system too.

Question 15

I've now spent far longer than I should on this. Time to bow out gracefully.

Question 16

C++

I’m not sure how you did the benchmarking but would appreciate it if you can run my naïve solution through it.

PS, sorry for being late for the party.

#include <vector>
#include <thread>
#include <iostream>
#include <unistd.h>
class worker
{
public:
 worker(size_t size)
 {
 workersize = size;
 buf = new char[workersize * 40];
 }
 ~worker()
 {
 delete buf;
 }
 void run()
 {
 char thisMod = 0;
 char ibMod;
 char ib[20];
 size_t ibStart;
 size_t s;
 size_t last_ib = -100; 
 char* ptr = buf;
 for (size_t i = start; i < end; i++)
 {
 size_t m3 = i % 3;
 size_t m5 = i % 5;
 if (m3 == 0 && m5 == 0)
 {
 *ptr++ = 'F';
 *ptr++ = 'i';
 *ptr++ = 'z';
 *ptr++ = 'z';
 *ptr++ = 'B';
 *ptr++ = 'u';
 *ptr++ = 'z';
 *ptr++ = 'z';
 }
 else if (m3 == 0)
 {
 *ptr++ = 'F';
 *ptr++ = 'i';
 *ptr++ = 'z';
 *ptr++ = 'z';
 }
 else if (m5 == 0)
 {
 *ptr++ = 'B';
 *ptr++ = 'u';
 *ptr++ = 'z';
 *ptr++ = 'z';
 }
 else
 {
 if (i - last_ib < 10 && ibMod + thisMod < 10)
 {
 // uses previous ib
 for (s = ibStart; s <= 14; s++)
 *ptr++ = ib[s];
 *ptr++ = ib[s] + thisMod - ibMod;
 }
 else
 {
 // calc new ib
 size_t x = i;
 size_t s = 15;
 ibMod = x % 10;
 ib[s--] = ibMod + '0';
 x /= 10;
 while (x > 0)
 {
 ib[s--] = (x % 10) + '0';
 x /= 10;
 }
 ibStart = ++s;
 for (ibStart; s <= 15; s++)
 *ptr++ = ib[s];
 last_ib = i;
 }
 }
 *ptr++ = '\n';
 thisMod++;
 if (thisMod > 9)
 thisMod = 0;
 }
 buflen = ptr - buf;
 }
 size_t start;
 size_t end;
 size_t workersize;
 char* buf;
 size_t buflen;
};
void task(worker* w)
{
 w->run();
}
int main() 
{
 size_t workercount = std::thread::hardware_concurrency();
 size_t workersize = 10000000;
 // create workers
 std::vector<worker*> workers;
 for (size_t i = 0; i < workercount; i++)
 workers.push_back(new worker(workersize));
 
 // main loop
 size_t cur = 0;
 for (;;)
 {
 // init workers
 for (worker* worker : workers)
 {
 worker->start = cur;
 cur += workersize;
 worker->end = cur;
 }
 std::vector<std::thread> threads;
 for (worker* worker : workers)
 threads.emplace_back(task, worker);
 for (std::thread& thread : threads)
 thread.join();
 // write output
 for (worker* worker : workers)
 write(1, worker->buf, worker->buflen);
 // write progress to cerr
 std::cerr << cur << "\n";
 }
}

Question 17

Not bad. Seems to fluctuate between 4GiB/s to 5.5GiB/s. Compiled with g++ a.cpp -lpthread -o Johan -O3. When I run it all my 32 cores are about 20-40% busy

Question 18

Using clang++ a.cpp -lpthread -o Johanclang -O3 didn't make much of a difference

Question 19

@OmerTuchfeld, thank you very much for testing!! Can you perhaps run my latest code when you get a chance?

Question 20

No noticeable difference in throughput

Question 21

@OmerTuchfeld Oh well, thanks anyway! I think that is as far as I can push it. I got an average of about 4 GB/s on my 6 core i7-9850H with the following TIO and was hoping that running it on your 32-core monster would have made a big difference.

Question 22

There's plenty room for improvement here and I'll try to update when I can. (Most of the optimizations above will also be applicable):

A simple Powershell solution:

$i = 1
#While ($i -gt 0){
While ($i -lt 100000){ # Limiting to 100k for testing
if ((($i % 5) -eq 0) -and (($i % 3) -eq 0)){Write-Host "FizzBuzz"}
elseif (($i % 3) -eq 0) {Write-Host "Fizz"}
elseif (((($i -split "" | Select-Object -SkipLast 1) | 
 Select-Object -Last 1) - eq 0) -or ((($i -split "" | 
 Select-Object -SkipLast 1) | Select-Object -Last 1) -eq 5)){Write-Host "Buzz"}
else {Write-Host $i}
$i++}

Question 23

Tried running it with podman run -it --rm --name powershell --volume $PWD:/root mcr.microsoft.com/powershell /usr/bin/pwsh /root/fizz.ps1 without success:

Line | 7 | Select-Object -Last 1) - eq 0) -or ((($i -split "" | | ~ | You must provide a value expression following the '-' operator.

Question 24

I'm surprised there is no JavaScript answer here.
But when I try to write a solution, I think I understand why.
JavaScript's score are pretty bad, and the very uncertainty.
But anyway, I want to release my solution first.

It's AMD Ryzen 5 3400G (2.6GHz) using Node v16.13.1 on Ubuntu 18.04.5, Linux 5.6.19.

main.js

let buffer = "";
for (i = 0; i < 1e7; i += 15) {
 buffer += `${i+1}\n${i+2}\nfizz\n${i+4}\nbuzz\nfizz\n${i+7}\n${i+8}\nfizz\nbuzz\n${i+11}\nfizz\n${i+13}\n${i+14}\nfizzbuzz\n`;
 if(buffer.length > 40000) {
 process.stdout.write(buffer);
 buffer = "";
 }
}
process.stdout.write(buffer);

Because of the high uncertainty, I will test 10 times in a row.

run.sh

#!/bin/bash
for ((i=1; i<=10; i++)); do
 node main.js | pv -ab > /dev/null
done

And the result is:

64.9MiB [47.8MiB/s]
64.9MiB [ 150MiB/s]
64.9MiB [ 150MiB/s]
64.9MiB [ 149MiB/s]
64.9MiB [ 151MiB/s]
64.9MiB [ 149MiB/s]
64.9MiB [ 150MiB/s]
64.9MiB [ 153MiB/s]
64.9MiB [78.8MiB/s]
64.9MiB [ 149MiB/s]

Question 25

Thank you for your submission. Submissions must go at least as high as 2^63 to be considered correct

Question 26

Node JS (v18.1.0)

let i = 1;
let str = '';
const resume = () => {
 while (true) {
 if (i % (2 << 11) === 0) {
 const success = process.stdout.write(str);
 str = '';
 if (!success) {
 process.stdout.once('drain', resume);
 break;
 }
 }
 if (i % 3 === 0 && i % 5 === 0) {
 str += 'FizzBuzz\n'
 } else if (i % 3 === 0) {
 str += 'Fizz\n'
 } else if (i % 5 === 0) {
 str += 'Buzz\n'
 } else {
 str += i + '\n';
 }
 i++;
 }
}
resume();

The command node main.js | pv > /dev/null gave me about 150MiB/s (for comparison the example shown in the OP ran at about 200 MiB/s on my machine, an Intel i7-9700K)

Question 27

Welcome to Code Golf!

Question 28

Python

Here is my humble attempt. It seems to out-perform the other python alternatives on my machine. I'm getting a sustained ~=330-367 MiB/s.

Run with python3, and tested with pypy3 (which seems to be little bit slower 320-360 MiB/s).

from sys import stdout
from io import StringIO
fz = "Fizz"
bz = "Buzz"
nl = "\n"
def fizzbuzz():
 num = 1
 while True:
 nf = True
 if num % 3 == 0:
 nf = False
 yield fz
 if num % 5 == 0:
 nf = False
 yield bz
 if nf:
 yield str(num)
 yield nl
 num += 1
output = StringIO()
for a in fizzbuzz():
 output.write(a)
 stdout.write(output.getvalue())
 output.truncate(0)

It uses a StringIO object which, as far as I understand, is implemented in a more efficient way, together with stdout.write, which again, seems to perform better than print and os.write.

I'm also storing the string-parts as variables. It seems to perform better, but that might be just placebo.

I have made a few variations on this with bulking, and writing more numbers/fizzbuzz to the output-parameter before writing to stdout, but it only seems to slow it down. I'm not sure why that is, but perhaps I get more lucky with L1/L2 cache with the current approach.

Question 29

I see @janfrode 's answer now which, on my machine, runs at a similar speed, nice job!

Question 30

Just normal Go

package main
import (
 "bufio"
 "fmt"
 "os"
)
var writer *bufio.Writer = bufio.NewWriter(os.Stdout)
func printf(f string, a ...interface{}) { fmt.Fprintf(writer, f, a...) }
func main() {
 defer writer.Flush()
 for i := 1; i < 1000000000; i++ {
 if (i % 3 == 0) && (i % 5 == 0) {
 printf("FizzBuzz\n")
 } else if i % 3 == 0 {
 printf("Fizz\n")
 } else if i % 5 == 0 {
 printf("Buzz\n")
 } else {
 printf("%d\n",i)
 }
 }
}

Unrolled output version:

package main
import (
 "bufio"
 "fmt"
 "os"
)
var writer *bufio.Writer = bufio.NewWriter(os.Stdout)
// or: bufio.NewWriterSize(os.Stdout, 400 * 1024 * 1024)
func printf(f string, a ...interface{}) { fmt.Fprintf(writer, f, a...) }
func main() {
 const FMT = "%d\n%d\nFizz\n%d\nBuzz\nFizz\n%d\n%d\nFizz\nBuzz\n%d\nFizz\n%d\n%d\nFizzBuzz"
 defer writer.Flush()
 for i := 1; i < 1000000000; i += 15 {
 printf(FMT, i, i+1, i+3, i+6, i+7, i+10, i+12, i+13)
 }
}

Unbuffered output version:

package main
import "fmt"
func main() {
 for i := 1; i < 1000000000; i++ {
 if (i % 3 == 0) && (i % 5 == 0) {
 fmt.Println("FizzBuzz")
 } else if i % 3 == 0 {
 fmt.Println("Fizz")
 } else if i % 5 == 0 {
 fmt.Println("Buzz")
 } else {
 fmt.Printf("%d\n",i)
 }
 }
}

On my machine:

gcc version 11.3.0 (Ubuntu 11.3.0-1ubuntu1~22.04.1)
gcc a.c && ./a.out | pv > /dev/null
go version go1.20.5 linux/amd64
go build a.go && ./a | pv > /dev/null

C example: 7.33GiB 0:00:26 [ 282MiB/s]
Go: 7.33GiB 0:00:56 [ 133MiB/s]
Go (400MB buffer): 7.33GiB 0:00:49 [ 151MiB/s]
Unrolled Go: 7.27GiB 0:00:32 [ 229MiB/s]
Unrolled Go (400MB buffer): 7.27GiB 0:00:25 [ 291MiB/s]
unbuffered Go: 7.33GiB 0:05:50 [21.4MiB/s]

Question 31

This probably won't be the fastest, but I am curious how it performs compared to a naive stdio-based fizzbuzz. I created this to illustrate the approach I suggested in the comments of using an array of fixed-length regions to format the output, and filtering out null padding before printing the output. I didn't do any multithreading, either, this could probably be improved by using multiple buffers and doing the writing on another thread.

This code does nothing to prevent unusual output after reaching ULONG_MAX. It might be able to be made faster by only supporting up to UINT_MAX, which would also allow more data to fit in a given size of intermediate buffer.

// sample outputs [space is null]
// 1\n
// Fizz \n
// Buzz \n
// FizzBuzz \n
// 18446744073709551614\n ULLONG_MAX-1 [ULLONG_MAX is FizzBuzz]
#define LEN 21
//#define NUM (200*15)
#define NUM (200*15)
#if defined(_POSIX_C_SOURCE)
#include <unistd.h>
#define WRITE(b,l) write(1,b,l)
#elif defined(_MSC_VER)
#include <io.h>
#define WRITE(b,l) write(1,b,l)
#else
#include <stdio.h>
#define WRITE(b,l) fwrite(b,1,l,stdout)
#endif
char buf[NUM*LEN] = {0}; // 63000
char buf2[NUM*LEN];
int flags[NUM];
int main() {
 // pre-populate buffer
 for(int i=0; i<NUM; i++) {
 char *t = buf+i*LEN;
 int f = flags[i] = i%3==0 | ((i%5==0)<<1);
 if(f&1) { t[0]='F'; t[1]='i'; t[2]='z'; t[3]='z';}
 if(f&2) { t[4]='B'; t[5]='u'; t[6]='z'; t[7]='z';}
 t[LEN-1] = '\n';
 }
 int j=1; // position of current output within buffer
 for(unsigned long long i=1;;i++) {
 if(!flags[j]) {
 char *t = buf+j*LEN+(LEN-2); // position of last digit
 unsigned long long k = i;
 while(k) {
 *t-- = '0'+k%10;
 k/=10;
 }
 }
 if(++j == NUM) {
 j=0;
 char *p=buf;
 char *q=buf2;
 if(i == NUM-1) p+=LEN; // skip zero, there's probably a better way
 while(p < buf+NUM*LEN) {
 if(*p) *q++=*p;
 p++;
 }
 WRITE(buf2, q-buf2);
 }
 }
}

Question 32

I'm getting 0.5GiB/s

Question 33

yeah, once i posted it i realized that per-character branch inside the while loop is probably terrible for the performance, i'd been hoping it was optimized away to a conditional move or something - i can't find a way to get rid of it either. Ah well, it's at least useful as evidence this approach isn't worthwhile.

Question 34

C99 (gcc)

#include <unistd.h>
#include <string.h>
#include <stdio.h>
#include <stdbool.h>
#define SIZE (1 << 16)
int main(void) {
 
 char buffer[SIZE] = {};
 char *buf = buffer;
 unsigned long long int prev_n_3=0; 
 unsigned long long int prev_n_5=0; 
 for (unsigned long long int n=3; 3; n++) { 
 if ((buf - buffer) >= (SIZE-21)) {
 write(1, buffer, buf-(buffer+1));
 buf = buffer;
 }
 if (n-prev_n_3==3) {
 buf[0] = 'F';
 buf[1] = 'i';
 buf[2] = 'z';
 buf[3] = 'z';
 buf += 4; 
 prev_n_3=n;
 }
 if (n-prev_n_5==5) {
 buf[0] = 'B';
 buf[1] = 'u';
 buf[2] ='z';
 buf[3] = 'z';
 buf += 4;
 prev_n_5=n;
 }
 
 if (!(prev_n_5==n) && !(prev_n_3==n)) {
 buf += sprintf(buf,"%llu",n);
 }
 *buf='\n';
 buf++;
 }
 return 0;
}

About 140 MiBs by my measurement. I measured using gcc -flto -Ofast, on a 2015 MacBook Air running Big Sur. I deleted my previous answer a while ago. This is a partial rewrite.

Question 35

Outputs junk 0円

Question 36

(Around 12322)

Question 37

Fixed by using printf. Thanks!

Question 38

Fails pretty early (int overflow at ~2 billion)

Question 39

That's still very early, see the "very high astronomical number" requirement

Question 40

PHP (simple)

<?php
ob_start(null, 32000);
for ($i = 0; $i < PHP_INT_MAX; $i++) {
 $out = $i;
 if ($i % 3 === 0) $out = 'fizz';
 if ($i % 5 === 0) {
 $out = 'buzz';
 if ($i % 3 === 0) $out = 'fizzbuzz';
 }
 echo "$out\n";
}

PHP (optimized)

<?php
ob_start(null, 32000);
$a = [0, 1, 'fizz', 3, 'buzz', 'fizz',
 6, 7, 'fizz', 'buzz', 10, 'fizz',
 12, 13, 'fizzbuzz', ''];
for ($i = 1; $i < PHP_INT_MAX-15; $i+=15) {
 $a[0] = $i;
 $a[1] = $i+1;
 $a[3] = $i+3;
 $a[6] = $i+6;
 $a[7] = $i+7;
 $a[10] = $i+10;
 $a[12] = $i+12;
 $a[13] = $i+13;
 echo join("\n", $a);
}

PHP (multithreaded)

<?php
$nWorkers = preg_match_all('/^processor\s/m', file_get_contents('/proc/cpuinfo'), $discard);
$iterationsPerFlush = 20000;
define('CHILD_TOKEN_KEY', 0);
//1 byte SHM segment holding the worker ID whose turn it is to write 
//initialise to child 0.
$shm = shm_attach(1);
shm_put_var($shm, CHILD_TOKEN_KEY, 0);
//Semaphore to protect access to the SHM var
$key = ftok(__FILE__, 1);
$sem = sem_get($key);
$childPids = [];
pcntl_sigprocmask(SIG_BLOCK, [SIGTERM, SIGINT, SIGCHLD]);
for ($i=0; $i<$nWorkers; $i++){
 if (!$pid = pcntl_fork()){
 doWork($sem, $shm, $i, $nWorkers, $iterationsPerFlush);
 exit;
 }else{
 $childPids[] = $pid;
 }
}
//If we get killed or one of our children does, clean everything up
pcntl_sigwaitinfo([SIGTERM, SIGINT, SIGCHLD], $info);
foreach ($childPids as $pid){
 posix_kill($pid, SIGKILL);
}
sem_remove($sem);
shm_remove($shm);
function doWork($sem, $shm, $childId, $nWorkers, $iterationsPerFlush){
 $nextChildId = ($childId + 1) % $nWorkers;
 $increment = ($nWorkers - 1) * $iterationsPerFlush * 15;
 $startOffset = $childId * $iterationsPerFlush * 15;
 $a = [-14 + $startOffset, -13 + $startOffset, "fizz", -11 + $startOffset, "buzz\nfizz",
 -8 + $startOffset, -7 + $startOffset, "fizz\nbuzz", -4 + $startOffset, "fizz",
 -2 + $startOffset, -1 + $startOffset, "fizzbuzz\n"];
 ob_start();
 while (true){
 for ($i = 0; $i < $iterationsPerFlush; $i++){
 $a[0] += 15; 
 $a[1] += 15; 
 $a[3] += 15; 
 $a[5] += 15; 
 $a[6] += 15; 
 $a[8] += 15; 
 $a[10] += 15; 
 $a[11] += 15; 
 echo join("\n", $a);
 }
 //Acquire the semaphore and check if it's our turn to flush. If not, repeat.
 while (true){
 sem_acquire($sem);
 if (shm_get_var($shm, CHILD_TOKEN_KEY) === $childId){
 break;
 }
 sem_release($sem);
 }
 ob_flush();
 //Pass the token on to the next child and release the semaphore
 shm_put_var($shm, CHILD_TOKEN_KEY, $nextChildId);
 sem_release($sem);
 //Skip over the numbers the other workers already handled
 $a[0] += $increment; 
 $a[1] += $increment; 
 $a[3] += $increment; 
 $a[5] += $increment; 
 $a[6] += $increment; 
 $a[8] += $increment;
 $a[10] += $increment; 
 $a[11] += $increment; 
 }
}

The multithreaded version is the result of this thread on Reddit:

https://old.reddit.com/r/PHP/comments/v94ly2/php_fizzbuzz/

It was mostly done by therealgaxbo.

Question 41

Ends too quickly, see rules: "The output must go on forever (or a very high astronomical number)"

Question 42

What is the definition of "very high astronomical number"? I think it would be cool to have this simple approach in the race and see if PHP already outperforms the other interpreted languages like Python and Ruby, even if they use more optimized approaches. This would limit the loop to PHP_INT_MAX which is 9223372036854775807. Would that be sufficient?

Question 43

"What is the definition of "very high astronomical number"?" - I guess I would consider numbers 2^63 (PHP_INT_MAX) and higher astronomical. It would take the best submission here more than a hundred years to get to that number (357823770363079921209 digits between 1 and 2^63 with about 40% removed for fizzing and buzzing + 9223372036854775808 newlines is about 194 exabytes of characters at 50GiB/s). By the way PHP_INT_MAX is only 4 billion on 32-bit machines, but I guess it's fair to assume my machine / PHP executable is 64-bit. I'm getting 130MiB/s with your submission by the way

Question 44

Ah, 130MiB/s is not that much. Then let's go for an optimized version. I just added one. On my laptop, it runs 5x faster than the simple one.

Question 45

I see you added the results of the optimized version. Great. I now also added a multithreaded version. So far, it seems to scale about linearly with the number of cores.

Question 46

`awk`

If modulo (%) ops are considered high cost, then here's one way to do handle all output scenarios with just 1 modulo, and 1 single copy of combined string.

jot 36 | 
awk 'function ____(__,_,___){return(_^=_<_)-(__=int(___=__?__:$_)^++_^_++%(_++*++_))\
 ?substr("Fizz Buzz",++_^(_--<__),_*!__+--_):___}$NF=____()' ORS=', '

1, 2, Fizz, 4, Buzz, Fizz, 7, 8, Fizz, Buzz, 11, Fizz, 
13, 14, Fizz Buzz, 16, 17, Fizz, 19, Buzz, Fizz, 22, 23, Fizz,
Buzz, 26, Fizz, 28, 29, Fizz Buzz, 31, 32, Fizz, 34, Buzz, Fizz,

But if I stupidly hard-encode every scenario into an array,

( time ( mawk2 ' BEGIN {
 ______ = 16
 1ドル = (4ドル = 7ドル = 10ドル = 13ドル = "Fizz")" "(6ドル = 11ドル = "Buzz")
 for (_____ ^= _ = (ORS = ", ")*(___ = (_+=++_)^++_^_); _++ < ___;)
 $(_____ ^= ++_____ < ______) ? $_____ : _
 printf("%d\n", ___) }' ) )

( mawk2 ; ) 7.07s user 0.03s system 99% cpu 7.102 total
 1 134217728
 echo '1148700297/7.102/32^4' | bc -l 
 154.25034785935389293948

Cycling through 2^0 - 2^27, mawk2 had an internal throughput rate of 154.25 MiB/sec, which is quite insane for a single-threaded shell scripting language invented 47 years ago. In this variant, all the thresholds appear to be upshifted by one, and that's by design, since I'm using exponentiation to handle mod 15.

0xDEADFED5 212 bronze badges · Answer 1 · 2023-07-04 09:09:12Z

this is inspired by the previous Rust entry by aiden4 at https://codegolf.stackexchange.com/a/217455/118590

it's a threaded version of aiden4's code, with the macros replaced by inline functions, because i haven't learned macros yet =)

this will utilize all logical CPUs up to MAX_THREADS, and uses one thread to handle synchronization.

build with cargo build --release and run with ./target/release/fizz_buzz_threaded

/.cargo/config.toml (please adjust as necessary):

[target.x86_64-unknown-linux-gnu]
rustflags = ["-Ctarget-cpu=native"]
[target.x86_64-pc-windows-msvc]
rustflags = ["-Ctarget-cpu=native"]
[target.x86_64-pc-windows-gnu]
rustflags = ["-Ctarget-cpu=native"]

main.rs:

use std::io::*;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::thread;
extern crate num_cpus;
const FIZZ: *const u8 = "Fizz\n".as_ptr();
const BUZZ: *const u8 = "Buzz\n".as_ptr();
const FIZZBUZZ: *const u8 = "FizzBuzz\n".as_ptr();
const MAX_NUM: usize = 90_000; // results to cache before writing output
const MAX_THREADS: usize = 256;
const BUF_SIZE: usize = MAX_NUM * 207; // i think max size for one round of fizz buzz ANSI is 206 bytes?
const UATOMIC: AtomicUsize = AtomicUsize::new(0);
static mut BUFS: [[u8; BUF_SIZE]; MAX_THREADS] = [[0u8; BUF_SIZE]; MAX_THREADS];
static mut BUF_COUNTS: [usize; MAX_THREADS] = [0usize; MAX_THREADS];
static mut LOCKS: [AtomicUsize; MAX_THREADS] = [UATOMIC; MAX_THREADS];
unsafe fn sync_threads(threads: usize) {
 let mut synced = false;
 let mut val: usize;
 let mut out = stdout();
 loop {
 for x in 0..threads {
 val = LOCKS[x].load(Ordering::SeqCst);
 if val == 1 {
 synced = true;
 } else {
 synced = false;
 break;
 }
 }
 if synced {
 for x in 0..threads {
 let res = out.write_all(BUFS[x].get_unchecked(..BUF_COUNTS[x]));
 if res.is_err() {
 println!("{:?}", res.err());
 }
 LOCKS[x].store(0, Ordering::SeqCst);
 }
 }
 }
}
unsafe fn thread(id: usize, mut num: usize, worker_threads: usize) {
 // alas, i haven't learned macros yet
 unsafe fn itoap_write(id: usize, buf_count: &mut usize, num: usize) {
 *buf_count +=
 itoap::write_to_ptr(BUFS[id].get_unchecked_mut(*buf_count..).as_mut_ptr(), num);
 BUFS[id].as_mut_ptr().add(*buf_count).write(b'\n');
 *buf_count += 1;
 }
 unsafe fn str_write(id: usize, buf_count: &mut usize, string: *const u8, len: usize) {
 let ptr = BUFS[id].get_unchecked_mut(*buf_count..).as_mut_ptr();
 ptr.copy_from_nonoverlapping(string, len);
 *buf_count += len;
 }
 let mut val: usize;
 let mut count: usize = 0;
 num += 1;
 let number_skip = (worker_threads - 1) * MAX_NUM;
 loop {
 val = LOCKS[id].load(Ordering::SeqCst);
 if val == 0 {
 if count > MAX_NUM - 15 {
 count = 0;
 LOCKS[id].store(1, Ordering::SeqCst);
 loop {
 val = LOCKS[id].load(Ordering::SeqCst);
 if val == 0 {
 BUF_COUNTS[id] = 0;
 num += number_skip;
 break;
 }
 }
 continue;
 }
 itoap_write(id, &mut BUF_COUNTS[id], num);
 num += 1;
 itoap_write(id, &mut BUF_COUNTS[id], num);
 str_write(id, &mut BUF_COUNTS[id], FIZZ, 5);
 num += 2;
 itoap_write(id, &mut BUF_COUNTS[id], num);
 str_write(id, &mut BUF_COUNTS[id], BUZZ, 5);
 str_write(id, &mut BUF_COUNTS[id], FIZZ, 5);
 num += 3;
 itoap_write(id, &mut BUF_COUNTS[id], num);
 num += 1;
 itoap_write(id, &mut BUF_COUNTS[id], num);
 str_write(id, &mut BUF_COUNTS[id], FIZZ, 5);
 str_write(id, &mut BUF_COUNTS[id], BUZZ, 5);
 num += 3;
 itoap_write(id, &mut BUF_COUNTS[id], num);
 str_write(id, &mut BUF_COUNTS[id], FIZZ, 5);
 num += 2;
 itoap_write(id, &mut BUF_COUNTS[id], num);
 num += 1;
 itoap_write(id, &mut BUF_COUNTS[id], num);
 str_write(id, &mut BUF_COUNTS[id], FIZZBUZZ, 9);
 num += 2;
 count += 15;
 }
 }
}
fn main() {
 let max_threads = num_cpus::get();
 if max_threads < 2 {
 panic!("multi-cpu only!")
 }
 let mut thread_ids: Vec<usize> = Vec::with_capacity(max_threads);
 for x in 0..max_threads {
 thread_ids.push(x);
 }
 thread::scope(|s| {
 for x in thread_ids.iter_mut() {
 if *x == max_threads - 1 {
 let builder = thread::Builder::new();
 let res = builder.spawn_scoped(s, || unsafe { sync_threads(max_threads - 1) });
 if res.is_err() {
 println!("Error spinning up sync thread: {}", res.err().unwrap());
 }
 } else {
 let builder = thread::Builder::new();
 let res = builder
 .spawn_scoped(s, || unsafe { thread(*x, *x * MAX_NUM, max_threads - 1) });
 if res.is_err() {
 println!("Error spinning up worker thread: {}", res.err().unwrap());
 }
 }
 }
 });
}

Cargo.toml:

[package]
name = "fizz_buzz_threaded"
version = "0.1.0"
authors = ["0xDEADFED5","aiden4"]
edition = "2021"
[dependencies]
num_cpus = "1.15.0"
itoap = "0.1"
[profile.release]
lto = "fat"

also available at https://github.com/0xDEADFED5/fizz_buzz_threaded

It segfaults unless I adjust MAX_THREADS. 32 works. 64 works. 96 works. 231 to 256 segfaults at runtime. If I set it to anything between 116 and 230 inclusive the compiler itself crashes! Anyway the throughput seems to be 4.3GB/s-ish. I'll try to find some time to add it to the tables since it seems to be the fastest
i did fix a bug in the code shortly after posting. was hoping it would be faster on your machine since for some strange reason it runs faster on my laptop than that beautiful AVX2 code.
My comment was referring to the latest code on GitHub and seems to still be relevant. I will hold back from including this submission in the scores table until those issues are resolved

Dmitry 1211 bronze badge · Answer 2 · 2023-12-29 03:16:09Z

Ruby

Here's some refinements based on the initial solution by @lonelyelk. Tested using the now latest Ruby 3.3. His original five-liner gives ~70 MiB/s on my machine.

Step 1: Drop the each_slice iteration, using a simple loop:

# frozen_string_literal: true
FMT = "%d\n%d\nFizz\n%d\nBuzz\nFizz\n%d\n%d\nFizz\nBuzz\n%d\nFizz\n%d\n%d\nFizzBuzz\n"
i = 1
loop do
 printf(FMT, i, i + 1, i + 3, i + 6, i + 7, i + 10, i + 12, i + 13)
 i += 15
end

The result runs at 135 MiB/s, almost twice as fast.

Step 2: Create much longer (by x400) strings each iteration step, using some runtime-generated code.

# frozen_string_literal: true
MULT = 400
FMT = "%d\n%d\nFizz\n%d\nBuzz\nFizz\n%d\n%d\nFizz\nBuzz\n%d\nFizz\n%d\n%d\nFizzBuzz\n" * MULT
idxs = (0..MULT).flat_map do |i|
 [0, 1, 3, 6, 7, 10, 12, 13].map { |j| j + i * 15 }
end
eval <<EOS
 def run_loop
 i = 1
 loop do
 printf(FMT, #{idxs.map { |n| "i + #{n}" }.join(', ') })
 i += 15 * MULT
 end
 end
EOS
run_loop

This gets up to 180-200 MiB/s, overtaking the "naive C".

Finally, parallelize the computation using ractors (16 seems like the sweet spot on my 6-core machine). This chunk of work is big enough that message-passing overhead is not a problem.

# frozen_string_literal: true
MULT = 400
FMT = ("%d\n%d\nFizz\n%d\nBuzz\nFizz\n%d\n%d\nFizz\nBuzz\n%d\nFizz\n%d\n%d\nFizzBuzz\n" * MULT).freeze
WORKERS = 16
idxs = (0..MULT).flat_map do |i|
 [0, 1, 3, 6, 7, 10, 12, 13].map { |j| j + i * 15 }
end
eval <<EOS
 def spawn_worker
 Ractor.new do
 loop do
 i = Ractor.receive
 Ractor.yield format(FMT, #{idxs.map { |n| "i + #{n}" }.join(', ') }).freeze
 end
 end
 end
EOS
workers = (1...WORKERS).map { spawn_worker }
ii = 1
loop do
 workers.each do |worker|
 worker.send(ii)
 ii += 15 * MULT
 end
 workers.each { |worker| puts worker.take }
end

This gives ~400 MiB/s, with fluctuations 380-420.

Toby Speight 6,9941 gold badge30 silver badges43 bronze badges · Answer 3 · 2021-01-22 10:53:18Z

A C++ program for Linux. I use the same method of arithmetic as in my C answer, but here I create a team of threads by hand rather than using OpenMP.

We divide the problem into ranges of numbers, so that we don't have to touch the low-order digits each iteration.

The workers are arranged in a circular chain, and each is responsible for a subrange. We do all our addition while other threads are writing, then take a turn at writing. In order to write an exact number of pages with each write, there's generally one write that straddles two workers, so we combine writes using writev(), temporarily blocking the other worker from beginning its arithmetic. The timing diagram looks like this:

 +--------------------------------------------+
 | |
 | wait |
-------> |
 | write (end of prev, start of this) | +--------------------------------------------+
<------- | | |
 | write (just this one) | | wait |
 | --------------------> |
 | wait | | write (end of prev, start of this) |
 | <-------------------- |
 | update each number | | write (just this one) |
 | | | ------+----->
 +--------------------------------------------+ | wait |
 | <------------
 | update each number |
 | |
 +--------------------------------------------+

On my workstation (8 thread Ivybridge), I measured at around 90% the throughput of cat /dev/zero, or about 3⁄4 that of dd if=/dev/zero bs=64K.

#include <cassert>
#include <condition_variable>
#include <iostream>
#include <memory>
#include <mutex>
#include <thread>
#include <vector>
#include <unistd.h>
#include <sys/sysinfo.h>
#include <sys/uio.h>
// Select a number of decimal digits to use. If we produce one billion
// numbers per second, then we'll finish all the 18-digit numbers in
// just 30 years. 24 digits should suffice until the next geological
// epoch, at least.
static constexpr int numlen = 25; // 24 digits plus newline
static constexpr int write_size = 0x10000; // this is fastest on my system
static_assert('0' - '\n' > 1, "Character coding incompatible with the arithmetic");
#define unlikely(e) __builtin_expect((e), 0)
struct worker
{
 // Storage for the character string we maintain
 std::string lines{""};
 // Iterators to each newline in 'lines'
 std::vector<std::string::iterator> nl{};
 int units_offset{}; // significant figures in the step
 char digit{}; // the single non-zero digit of step
 // We coordinate with the next and previous threads in the ring.
 // The iovec is used for writing a block of lines that straddles
 // the previous thread and this one.
 std::mutex mutex{};
 std::condition_variable cv{};
 struct iovec iov[2] = { { 0, 0 }, { 0, 0 } };
 worker *next{};
 // The functions
 worker(std::size_t first, std::size_t count, std::size_t step);
 worker(const worker&) = delete;
 worker& operator=(const worker&) = delete;
 void loop();
};
static constexpr auto buf_len(std::size_t digits, std::size_t count)
{
 // each group of 15 lines has 8 numbers and 39 chars of Fizz and Buzz
 return (8 * digits + 39) * count / 15;
}
static constexpr std::size_t optimal_step(long thread_count)
{
 // We need each thread to produce at least write_size each iteration
 for (std::size_t tens = 1000; tens < 10'000'000; tens *= 10) {
 auto const digits = std::snprintf(0, 0, "%zu", tens);
 for (std::size_t digit = 3; digit < 10; digit += 3) {
 if (buf_len(digits, digit * tens) / thread_count > write_size) {
 return digit * tens;
 }
 }
 }
 return 9'000'000; // fallback (perhaps we should limit thread count?)
}
int main()
{
 // How many threads will we have?
 auto const nprocs = get_nprocs();
 auto const step = optimal_step(nprocs);
 // Write output the slow way, until we have enough digits for the
 // format strings.
 auto n = 1;
 const auto step_len = std::to_string(step).size();
 // Finish the loop just before a FizzBuzz, so that the lines buffer
 // doesn't start with a number (we need a newline preceding for it
 // to overflow correctly).
 for (; std::to_string(n).size() <= step_len; ++n) {
 if ((n % 15) == 0) {
 std::cout << "FizzBuzz\n";
 } else if ((n % 5) == 0) {
 std::cout << "Buzz\n";
 } else if ((n % 3) == 0) {
 std::cout << "Fizz\n";
 } else {
 std::cout << n << '\n';
 }
 }
 std::cout.flush(); // use Unix write() from here on
 // create the workers
 auto workers = std::vector<std::unique_ptr<worker>>{};
 workers.reserve(nprocs);
 for (int i = 0; i < nprocs; ++i) {
 auto const a = i * step / nprocs;
 auto const z = (i+1) * step / nprocs;
 workers.emplace_back(std::make_unique<worker>(n + a, z - a, step));
 }
 // and connect them in a loop
 workers.back()->next = workers.front().get();
 for (int i = 1; i < nprocs; ++i) {
 workers[i-1]->next = workers[i].get();
 }
 // a thread for each worker
 auto threads = std::vector<std::unique_ptr<std::thread>>{};
 threads.reserve(nprocs);
 auto const loop = [](worker *w) { w->loop(); };
 for (auto const& w: workers) {
 threads.emplace_back(std::make_unique<std::thread>(loop, w.get()));
 }
 // start them writing
 auto &first = workers.front();
 {
 std::unique_lock lock{first->mutex};
 first->iov[0] = { &first->lines.front(), 0 };
 }
 first->cv.notify_one();
 threads.front()->join();
}
worker::worker(std::size_t first, std::size_t count, std::size_t step)
{
 auto s = std::to_string(step);
 units_offset = s.size() + 1;
 digit = s.front() - '0';
 lines.reserve(buf_len(numlen, count) + 1);
 nl.reserve(count);
 assert(lines.capacity() >= write_size);
 for (auto n = first; n < first + count; ++n) {
 if ((n % 15) == 0) {
 lines += "FizzBuzz\n";
 } else if ((n % 5) == 0) {
 lines += "Buzz\n";
 } else if ((n % 3) == 0) {
 lines += "Fizz\n";
 } else {
 lines += std::to_string(n) + '\n';
 }
 nl.push_back(lines.end());
 }
 assert(lines.size() >= write_size);
 // Second half of a straddle write is always from the beginning of our buffer.
 iov[1].iov_base = &lines.front();
}
void worker::loop()
{
 assert(next);
 for (;;) {
 {
 // We can write when the previous thread passes its buffer offcut.
 std::unique_lock lock{mutex};
 cv.wait(lock, [this]{ return iov[0].iov_base; });
 iov[1].iov_len = write_size - iov[0].iov_len;
 assert(iov[1].iov_len < lines.size());
 writev(1, iov, 2);
 // Tell the previous thread that we've finished with its buffer.
 iov[0].iov_base = 0;
 }
 cv.notify_one();
 auto p = lines.begin() + iov[1].iov_len;
 while (unlikely(p + write_size < lines.end())) {
 ::write(1, &*p, write_size);
 p += write_size;
 }
 {
 // Tell the next thread that we have leftover data for it to write.
 std::unique_lock lock{next->mutex};
 next->iov[0] = { &*p, static_cast<std::size_t>(lines.end() - p) };
 }
 next->cv.notify_one();
 {
 // Now wait until next worker has written, and released buffer back to us.
 std::unique_lock lock{next->mutex};
 next->cv.wait(lock, [this]{ return !next->iov[0].iov_base; });
 }
 // Update the numbers in the buffer.
 auto rollover = 0u;
 for (auto const& i: nl) {
 if (i[-2] == 'z') {
 // fizz and/or buzz - not a number
 continue;
 }
 // else add 'step' to the number
 auto p = i - units_offset;
 *p += digit;
 while (*p > '9') {
 *p-- -= 10;
 ++*p;
 }
 if (unlikely(*p == '\n'+1)) {
 // digit rollover
 ++rollover;
 }
 }
 if (unlikely(rollover)) {
 // Add a leading one to each overflowing number.
 auto nlp = nl.end();
 auto p = lines.end();
 assert(lines.size() + rollover < lines.capacity());
 lines.resize(lines.size() + rollover);
 auto dest = lines.end();
 while (--p < --dest) {
 if (*p == '\n'+1) {
 --*p;
 *dest-- = '1';
 }
 if (*p == '\n') {
 *nlp-- = dest + 1;
 }
 *dest = *p;
 }
 }
 }
}

Compile using g++-10 -std=c++2a -Wall -Wextra -Weffc++ -DNDEBUG -O3 -march=native -fno-exceptions -lpthread. No special arguments or environment are needed when running.

Neil + Kamila generates fizzbuzz About 50% faster than the kernel generates 0's on my machine (9 vs 6 gib/s, approx), so I wouldn't consider /dev/zero an upper limit. An interesting reference point, maybe. In any case, tried running your new code - optimal when I fake nproc to be 4 - but it just yields results as fast as your previous submission with only 1 core.
That's a shame - unmodified, it runs about 50% faster than my other one here. I really expected to come out ahead on your system too.
I've now spent far longer than I should on this. Time to bow out gracefully.

jdt 5,1141 gold badge16 silver badges21 bronze badges · Answer 4 · 2021-10-08 14:11:00Z

C++

I’m not sure how you did the benchmarking but would appreciate it if you can run my naïve solution through it.

PS, sorry for being late for the party.

#include <vector>
#include <thread>
#include <iostream>
#include <unistd.h>
class worker
{
public:
 worker(size_t size)
 {
 workersize = size;
 buf = new char[workersize * 40];
 }
 ~worker()
 {
 delete buf;
 }
 void run()
 {
 char thisMod = 0;
 char ibMod;
 char ib[20];
 size_t ibStart;
 size_t s;
 size_t last_ib = -100; 
 char* ptr = buf;
 for (size_t i = start; i < end; i++)
 {
 size_t m3 = i % 3;
 size_t m5 = i % 5;
 if (m3 == 0 && m5 == 0)
 {
 *ptr++ = 'F';
 *ptr++ = 'i';
 *ptr++ = 'z';
 *ptr++ = 'z';
 *ptr++ = 'B';
 *ptr++ = 'u';
 *ptr++ = 'z';
 *ptr++ = 'z';
 }
 else if (m3 == 0)
 {
 *ptr++ = 'F';
 *ptr++ = 'i';
 *ptr++ = 'z';
 *ptr++ = 'z';
 }
 else if (m5 == 0)
 {
 *ptr++ = 'B';
 *ptr++ = 'u';
 *ptr++ = 'z';
 *ptr++ = 'z';
 }
 else
 {
 if (i - last_ib < 10 && ibMod + thisMod < 10)
 {
 // uses previous ib
 for (s = ibStart; s <= 14; s++)
 *ptr++ = ib[s];
 *ptr++ = ib[s] + thisMod - ibMod;
 }
 else
 {
 // calc new ib
 size_t x = i;
 size_t s = 15;
 ibMod = x % 10;
 ib[s--] = ibMod + '0';
 x /= 10;
 while (x > 0)
 {
 ib[s--] = (x % 10) + '0';
 x /= 10;
 }
 ibStart = ++s;
 for (ibStart; s <= 15; s++)
 *ptr++ = ib[s];
 last_ib = i;
 }
 }
 *ptr++ = '\n';
 thisMod++;
 if (thisMod > 9)
 thisMod = 0;
 }
 buflen = ptr - buf;
 }
 size_t start;
 size_t end;
 size_t workersize;
 char* buf;
 size_t buflen;
};
void task(worker* w)
{
 w->run();
}
int main() 
{
 size_t workercount = std::thread::hardware_concurrency();
 size_t workersize = 10000000;
 // create workers
 std::vector<worker*> workers;
 for (size_t i = 0; i < workercount; i++)
 workers.push_back(new worker(workersize));
 
 // main loop
 size_t cur = 0;
 for (;;)
 {
 // init workers
 for (worker* worker : workers)
 {
 worker->start = cur;
 cur += workersize;
 worker->end = cur;
 }
 std::vector<std::thread> threads;
 for (worker* worker : workers)
 threads.emplace_back(task, worker);
 for (std::thread& thread : threads)
 thread.join();
 // write output
 for (worker* worker : workers)
 write(1, worker->buf, worker->buflen);
 // write progress to cerr
 std::cerr << cur << "\n";
 }
}

Not bad. Seems to fluctuate between 4GiB/s to 5.5GiB/s. Compiled with g++ a.cpp -lpthread -o Johan -O3. When I run it all my 32 cores are about 20-40% busy
Using clang++ a.cpp -lpthread -o Johanclang -O3 didn't make much of a difference
@OmerTuchfeld, thank you very much for testing!! Can you perhaps run my latest code when you get a chance?
@OmerTuchfeld Oh well, thanks anyway! I think that is as far as I can push it. I got an average of about 4 GB/s on my 6 core i7-9850H with the following TIO and was hoping that running it on your 32-core monster would have made a big difference.

DBADon 1511 silver badge5 bronze badges · Answer 5 · 2021-11-08 19:34:28Z

There's plenty room for improvement here and I'll try to update when I can. (Most of the optimizations above will also be applicable):

A simple Powershell solution:

$i = 1
#While ($i -gt 0){
While ($i -lt 100000){ # Limiting to 100k for testing
if ((($i % 5) -eq 0) -and (($i % 3) -eq 0)){Write-Host "FizzBuzz"}
elseif (($i % 3) -eq 0) {Write-Host "Fizz"}
elseif (((($i -split "" | Select-Object -SkipLast 1) | 
 Select-Object -Last 1) - eq 0) -or ((($i -split "" | 
 Select-Object -SkipLast 1) | Select-Object -Last 1) -eq 5)){Write-Host "Buzz"}
else {Write-Host $i}
$i++}

Tried running it with podman run -it --rm --name powershell --volume $PWD:/root mcr.microsoft.com/powershell /usr/bin/pwsh /root/fizz.ps1 without success: Line | 7 | Select-Object -Last 1) - eq 0) -or ((($i -split "" | | ~ | You must provide a value expression following the '-' operator.

LianSheng 1113 bronze badges · Answer 6 · 2022-06-17 16:39:18Z

I'm surprised there is no JavaScript answer here.
But when I try to write a solution, I think I understand why.
JavaScript's score are pretty bad, and the very uncertainty.
But anyway, I want to release my solution first.

It's AMD Ryzen 5 3400G (2.6GHz) using Node v16.13.1 on Ubuntu 18.04.5, Linux 5.6.19.

main.js

let buffer = "";
for (i = 0; i < 1e7; i += 15) {
 buffer += `${i+1}\n${i+2}\nfizz\n${i+4}\nbuzz\nfizz\n${i+7}\n${i+8}\nfizz\nbuzz\n${i+11}\nfizz\n${i+13}\n${i+14}\nfizzbuzz\n`;
 if(buffer.length > 40000) {
 process.stdout.write(buffer);
 buffer = "";
 }
}
process.stdout.write(buffer);

Because of the high uncertainty, I will test 10 times in a row.

run.sh

#!/bin/bash
for ((i=1; i<=10; i++)); do
 node main.js | pv -ab > /dev/null
done

And the result is:

64.9MiB [47.8MiB/s]
64.9MiB [ 150MiB/s]
64.9MiB [ 150MiB/s]
64.9MiB [ 149MiB/s]
64.9MiB [ 151MiB/s]
64.9MiB [ 149MiB/s]
64.9MiB [ 150MiB/s]
64.9MiB [ 153MiB/s]
64.9MiB [78.8MiB/s]
64.9MiB [ 149MiB/s]

Thank you for your submission. Submissions must go at least as high as 2^63 to be considered correct

anbcodes 1112 bronze badges · Answer 7 · 2022-06-23 17:28:40Z

Node JS (v18.1.0)

let i = 1;
let str = '';
const resume = () => {
 while (true) {
 if (i % (2 << 11) === 0) {
 const success = process.stdout.write(str);
 str = '';
 if (!success) {
 process.stdout.once('drain', resume);
 break;
 }
 }
 if (i % 3 === 0 && i % 5 === 0) {
 str += 'FizzBuzz\n'
 } else if (i % 3 === 0) {
 str += 'Fizz\n'
 } else if (i % 5 === 0) {
 str += 'Buzz\n'
 } else {
 str += i + '\n';
 }
 i++;
 }
}
resume();

The command node main.js | pv > /dev/null gave me about 150MiB/s (for comparison the example shown in the OP ran at about 200 MiB/s on my machine, an Intel i7-9700K)

1

\$\begingroup\$ Welcome to Code Golf! \$\endgroup\$

rydwolf
– rydwolf ♦

2022年06月23日 17:43:36 +00:00
Commented Jun 23, 2022 at 17:43

Automatico 1794 bronze badges · Answer 8 · 2023-07-03 11:11:14Z

Python

Here is my humble attempt. It seems to out-perform the other python alternatives on my machine. I'm getting a sustained ~=330-367 MiB/s.

Run with python3, and tested with pypy3 (which seems to be little bit slower 320-360 MiB/s).

from sys import stdout
from io import StringIO
fz = "Fizz"
bz = "Buzz"
nl = "\n"
def fizzbuzz():
 num = 1
 while True:
 nf = True
 if num % 3 == 0:
 nf = False
 yield fz
 if num % 5 == 0:
 nf = False
 yield bz
 if nf:
 yield str(num)
 yield nl
 num += 1
output = StringIO()
for a in fizzbuzz():
 output.write(a)
 stdout.write(output.getvalue())
 output.truncate(0)

It uses a StringIO object which, as far as I understand, is implemented in a more efficient way, together with stdout.write, which again, seems to perform better than print and os.write.

I'm also storing the string-parts as variables. It seems to perform better, but that might be just placebo.

I have made a few variations on this with bulking, and writing more numbers/fizzbuzz to the output-parameter before writing to stdout, but it only seems to slow it down. I'm not sure why that is, but perhaps I get more lucky with L1/L2 cache with the current approach.

I see @janfrode 's answer now which, on my machine, runs at a similar speed, nice job!

Kokizzu 1,54911 silver badges20 bronze badges · Answer 9 · 2022-10-18 08:19:16Z

Just normal Go

package main
import (
 "bufio"
 "fmt"
 "os"
)
var writer *bufio.Writer = bufio.NewWriter(os.Stdout)
func printf(f string, a ...interface{}) { fmt.Fprintf(writer, f, a...) }
func main() {
 defer writer.Flush()
 for i := 1; i < 1000000000; i++ {
 if (i % 3 == 0) && (i % 5 == 0) {
 printf("FizzBuzz\n")
 } else if i % 3 == 0 {
 printf("Fizz\n")
 } else if i % 5 == 0 {
 printf("Buzz\n")
 } else {
 printf("%d\n",i)
 }
 }
}

Unrolled output version:

package main
import (
 "bufio"
 "fmt"
 "os"
)
var writer *bufio.Writer = bufio.NewWriter(os.Stdout)
// or: bufio.NewWriterSize(os.Stdout, 400 * 1024 * 1024)
func printf(f string, a ...interface{}) { fmt.Fprintf(writer, f, a...) }
func main() {
 const FMT = "%d\n%d\nFizz\n%d\nBuzz\nFizz\n%d\n%d\nFizz\nBuzz\n%d\nFizz\n%d\n%d\nFizzBuzz"
 defer writer.Flush()
 for i := 1; i < 1000000000; i += 15 {
 printf(FMT, i, i+1, i+3, i+6, i+7, i+10, i+12, i+13)
 }
}

Unbuffered output version:

package main
import "fmt"
func main() {
 for i := 1; i < 1000000000; i++ {
 if (i % 3 == 0) && (i % 5 == 0) {
 fmt.Println("FizzBuzz")
 } else if i % 3 == 0 {
 fmt.Println("Fizz")
 } else if i % 5 == 0 {
 fmt.Println("Buzz")
 } else {
 fmt.Printf("%d\n",i)
 }
 }
}

On my machine:

gcc version 11.3.0 (Ubuntu 11.3.0-1ubuntu1~22.04.1)
gcc a.c && ./a.out | pv > /dev/null
go version go1.20.5 linux/amd64
go build a.go && ./a | pv > /dev/null

C example: 7.33GiB 0:00:26 [ 282MiB/s]
Go: 7.33GiB 0:00:56 [ 133MiB/s]
Go (400MB buffer): 7.33GiB 0:00:49 [ 151MiB/s]
Unrolled Go: 7.27GiB 0:00:32 [ 229MiB/s]
Unrolled Go (400MB buffer): 7.27GiB 0:00:25 [ 291MiB/s]
unbuffered Go: 7.33GiB 0:05:50 [21.4MiB/s]

Random832 8565 silver badges10 bronze badges · Answer 10 · 2021-10-29 21:56:16Z

This probably won't be the fastest, but I am curious how it performs compared to a naive stdio-based fizzbuzz. I created this to illustrate the approach I suggested in the comments of using an array of fixed-length regions to format the output, and filtering out null padding before printing the output. I didn't do any multithreading, either, this could probably be improved by using multiple buffers and doing the writing on another thread.

This code does nothing to prevent unusual output after reaching ULONG_MAX. It might be able to be made faster by only supporting up to UINT_MAX, which would also allow more data to fit in a given size of intermediate buffer.

// sample outputs [space is null]
// 1\n
// Fizz \n
// Buzz \n
// FizzBuzz \n
// 18446744073709551614\n ULLONG_MAX-1 [ULLONG_MAX is FizzBuzz]
#define LEN 21
//#define NUM (200*15)
#define NUM (200*15)
#if defined(_POSIX_C_SOURCE)
#include <unistd.h>
#define WRITE(b,l) write(1,b,l)
#elif defined(_MSC_VER)
#include <io.h>
#define WRITE(b,l) write(1,b,l)
#else
#include <stdio.h>
#define WRITE(b,l) fwrite(b,1,l,stdout)
#endif
char buf[NUM*LEN] = {0}; // 63000
char buf2[NUM*LEN];
int flags[NUM];
int main() {
 // pre-populate buffer
 for(int i=0; i<NUM; i++) {
 char *t = buf+i*LEN;
 int f = flags[i] = i%3==0 | ((i%5==0)<<1);
 if(f&1) { t[0]='F'; t[1]='i'; t[2]='z'; t[3]='z';}
 if(f&2) { t[4]='B'; t[5]='u'; t[6]='z'; t[7]='z';}
 t[LEN-1] = '\n';
 }
 int j=1; // position of current output within buffer
 for(unsigned long long i=1;;i++) {
 if(!flags[j]) {
 char *t = buf+j*LEN+(LEN-2); // position of last digit
 unsigned long long k = i;
 while(k) {
 *t-- = '0'+k%10;
 k/=10;
 }
 }
 if(++j == NUM) {
 j=0;
 char *p=buf;
 char *q=buf2;
 if(i == NUM-1) p+=LEN; // skip zero, there's probably a better way
 while(p < buf+NUM*LEN) {
 if(*p) *q++=*p;
 p++;
 }
 WRITE(buf2, q-buf2);
 }
 }
}

yeah, once i posted it i realized that per-character branch inside the while loop is probably terrible for the performance, i'd been hoping it was optimized away to a conditional move or something - i can't find a way to get rid of it either. Ah well, it's at least useful as evidence this approach isn't worthwhile.

Qaziquza 1,6932 gold badges11 silver badges34 bronze badges · Answer 11 · 2021-12-18 02:51:14Z

C99 (gcc)

#include <unistd.h>
#include <string.h>
#include <stdio.h>
#include <stdbool.h>
#define SIZE (1 << 16)
int main(void) {
 
 char buffer[SIZE] = {};
 char *buf = buffer;
 unsigned long long int prev_n_3=0; 
 unsigned long long int prev_n_5=0; 
 for (unsigned long long int n=3; 3; n++) { 
 if ((buf - buffer) >= (SIZE-21)) {
 write(1, buffer, buf-(buffer+1));
 buf = buffer;
 }
 if (n-prev_n_3==3) {
 buf[0] = 'F';
 buf[1] = 'i';
 buf[2] = 'z';
 buf[3] = 'z';
 buf += 4; 
 prev_n_3=n;
 }
 if (n-prev_n_5==5) {
 buf[0] = 'B';
 buf[1] = 'u';
 buf[2] ='z';
 buf[3] = 'z';
 buf += 4;
 prev_n_5=n;
 }
 
 if (!(prev_n_5==n) && !(prev_n_3==n)) {
 buf += sprintf(buf,"%llu",n);
 }
 *buf='\n';
 buf++;
 }
 return 0;
}

About 140 MiBs by my measurement. I measured using gcc -flto -Ofast, on a 2015 MacBook Air running Big Sur. I deleted my previous answer a while ago. This is a partial rewrite.

That's still very early, see the "very high astronomical number" requirement

no_gravity 1013 bronze badges · Answer 12 · 2022-06-04 18:03:36Z

PHP (simple)

<?php
ob_start(null, 32000);
for ($i = 0; $i < PHP_INT_MAX; $i++) {
 $out = $i;
 if ($i % 3 === 0) $out = 'fizz';
 if ($i % 5 === 0) {
 $out = 'buzz';
 if ($i % 3 === 0) $out = 'fizzbuzz';
 }
 echo "$out\n";
}

PHP (optimized)

<?php
ob_start(null, 32000);
$a = [0, 1, 'fizz', 3, 'buzz', 'fizz',
 6, 7, 'fizz', 'buzz', 10, 'fizz',
 12, 13, 'fizzbuzz', ''];
for ($i = 1; $i < PHP_INT_MAX-15; $i+=15) {
 $a[0] = $i;
 $a[1] = $i+1;
 $a[3] = $i+3;
 $a[6] = $i+6;
 $a[7] = $i+7;
 $a[10] = $i+10;
 $a[12] = $i+12;
 $a[13] = $i+13;
 echo join("\n", $a);
}

PHP (multithreaded)

<?php
$nWorkers = preg_match_all('/^processor\s/m', file_get_contents('/proc/cpuinfo'), $discard);
$iterationsPerFlush = 20000;
define('CHILD_TOKEN_KEY', 0);
//1 byte SHM segment holding the worker ID whose turn it is to write 
//initialise to child 0.
$shm = shm_attach(1);
shm_put_var($shm, CHILD_TOKEN_KEY, 0);
//Semaphore to protect access to the SHM var
$key = ftok(__FILE__, 1);
$sem = sem_get($key);
$childPids = [];
pcntl_sigprocmask(SIG_BLOCK, [SIGTERM, SIGINT, SIGCHLD]);
for ($i=0; $i<$nWorkers; $i++){
 if (!$pid = pcntl_fork()){
 doWork($sem, $shm, $i, $nWorkers, $iterationsPerFlush);
 exit;
 }else{
 $childPids[] = $pid;
 }
}
//If we get killed or one of our children does, clean everything up
pcntl_sigwaitinfo([SIGTERM, SIGINT, SIGCHLD], $info);
foreach ($childPids as $pid){
 posix_kill($pid, SIGKILL);
}
sem_remove($sem);
shm_remove($shm);
function doWork($sem, $shm, $childId, $nWorkers, $iterationsPerFlush){
 $nextChildId = ($childId + 1) % $nWorkers;
 $increment = ($nWorkers - 1) * $iterationsPerFlush * 15;
 $startOffset = $childId * $iterationsPerFlush * 15;
 $a = [-14 + $startOffset, -13 + $startOffset, "fizz", -11 + $startOffset, "buzz\nfizz",
 -8 + $startOffset, -7 + $startOffset, "fizz\nbuzz", -4 + $startOffset, "fizz",
 -2 + $startOffset, -1 + $startOffset, "fizzbuzz\n"];
 ob_start();
 while (true){
 for ($i = 0; $i < $iterationsPerFlush; $i++){
 $a[0] += 15; 
 $a[1] += 15; 
 $a[3] += 15; 
 $a[5] += 15; 
 $a[6] += 15; 
 $a[8] += 15; 
 $a[10] += 15; 
 $a[11] += 15; 
 echo join("\n", $a);
 }
 //Acquire the semaphore and check if it's our turn to flush. If not, repeat.
 while (true){
 sem_acquire($sem);
 if (shm_get_var($shm, CHILD_TOKEN_KEY) === $childId){
 break;
 }
 sem_release($sem);
 }
 ob_flush();
 //Pass the token on to the next child and release the semaphore
 shm_put_var($shm, CHILD_TOKEN_KEY, $nextChildId);
 sem_release($sem);
 //Skip over the numbers the other workers already handled
 $a[0] += $increment; 
 $a[1] += $increment; 
 $a[3] += $increment; 
 $a[5] += $increment; 
 $a[6] += $increment; 
 $a[8] += $increment;
 $a[10] += $increment; 
 $a[11] += $increment; 
 }
}

The multithreaded version is the result of this thread on Reddit:

https://old.reddit.com/r/PHP/comments/v94ly2/php_fizzbuzz/

It was mostly done by therealgaxbo.

Ends too quickly, see rules: "The output must go on forever (or a very high astronomical number)"
What is the definition of "very high astronomical number"? I think it would be cool to have this simple approach in the race and see if PHP already outperforms the other interpreted languages like Python and Ruby, even if they use more optimized approaches. This would limit the loop to PHP_INT_MAX which is 9223372036854775807. Would that be sufficient?
"What is the definition of "very high astronomical number"?" - I guess I would consider numbers 2^63 (PHP_INT_MAX) and higher astronomical. It would take the best submission here more than a hundred years to get to that number (357823770363079921209 digits between 1 and 2^63 with about 40% removed for fizzing and buzzing + 9223372036854775808 newlines is about 194 exabytes of characters at 50GiB/s). By the way PHP_INT_MAX is only 4 billion on 32-bit machines, but I guess it's fair to assume my machine / PHP executable is 64-bit. I'm getting 130MiB/s with your submission by the way
Ah, 130MiB/s is not that much. Then let's go for an optimized version. I just added one. On my laptop, it runs 5x faster than the simple one.
I see you added the results of the optimized version. Great. I now also added a multithreaded version. So far, it seems to scale about linearly with the number of cores.

RARE Kpop Manifesto 6534 silver badges8 bronze badges · Answer 13 · 2024-12-25 02:53:34Z

`awk`

If modulo (%) ops are considered high cost, then here's one way to do handle all output scenarios with just 1 modulo, and 1 single copy of combined string.

jot 36 | 
awk 'function ____(__,_,___){return(_^=_<_)-(__=int(___=__?__:$_)^++_^_++%(_++*++_))\
 ?substr("Fizz Buzz",++_^(_--<__),_*!__+--_):___}$NF=____()' ORS=', '

1, 2, Fizz, 4, Buzz, Fizz, 7, 8, Fizz, Buzz, 11, Fizz, 
13, 14, Fizz Buzz, 16, 17, Fizz, 19, Buzz, Fizz, 22, 23, Fizz,
Buzz, 26, Fizz, 28, 29, Fizz Buzz, 31, 32, Fizz, 34, Buzz, Fizz,

But if I stupidly hard-encode every scenario into an array,

( time ( mawk2 ' BEGIN {
 ______ = 16
 1ドル = (4ドル = 7ドル = 10ドル = 13ドル = "Fizz")" "(6ドル = 11ドル = "Buzz")
 for (_____ ^= _ = (ORS = ", ")*(___ = (_+=++_)^++_^_); _++ < ___;)
 $(_____ ^= ++_____ < ______) ? $_____ : _
 printf("%d\n", ___) }' ) )

( mawk2 ; ) 7.07s user 0.03s system 99% cpu 7.102 total
 1 134217728
 echo '1148700297/7.102/32^4' | bc -l 
 154.25034785935389293948

Cycling through 2^0 - 2^27, mawk2 had an internal throughput rate of 154.25 MiB/sec, which is quite insane for a single-threaded shell scripting language invented 47 years ago. In this variant, all the thresholds appear to be upshifted by one, and that's by design, since I'm using exponentiation to handle mod 15.

Stack Exchange Network

High throughput Fizz Buzz

Checking throughput

Example

Rules

Scores

Results for java:

Results for python:

Results for pypy:

Results for rust:

Results for ruby:

Results for asm:

Results for julia:

Results for c:

Results for cpp:

Results for numba:

Results for numpy:

Results for go:

Results for php:

Results for elixir:

Results for csharp:

43 Answers 43

Ruby

Node JS (v18.1.0)

Python

C99 (gcc)

`awk`

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

High throughput Fizz Buzz

Checking throughput

Example

Rules

Scores

Results for java:

Results for python:

Results for pypy:

Results for rust:

Results for ruby:

Results for asm:

Results for julia:

Results for c:

Results for cpp:

Results for numba:

Results for numpy:

Results for go:

Results for php:

Results for elixir:

Results for csharp:

43 Answers 43

Ruby

Node JS (v18.1.0)

Python

C99 (gcc)

awk

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions

`awk`