4
\$\begingroup\$

Intro

(This post has a continuation Get histogram of bytes in any set of files in C++14 - take II.)

For the sake of practice, I wrote this short program. It asks any set of file names and it produces the histogram of bytes in all specified files (computing the sum).

Code

#include <algorithm>
#include <array>
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <string>
#include <vector>
using std::array;
using std::cout;
using std::ifstream;
using std::string;
using std::vector;
static void printHelp(char* programName);
static vector<string> extractFileNames(const int argc,
 char* argv[]);
static void processFile(array<size_t, 0x100>& histogram,
 string fileName);
static void printHistogram(array<size_t, 0x100>& histogram);
int main(int argc, char* argv[])
{ 
 if (argc == 1) {
 printHelp(argv[0]);
 return EXIT_SUCCESS;
 }
 vector<string> fileNames = extractFileNames(argc, argv);
 array<size_t, 0x100> histogram;
 std::fill(histogram.begin(), 
 histogram.end(),
 0);
 
 for (auto it = fileNames.cbegin(); it != fileNames.cend(); it++) {
 processFile(histogram, *it);
 }
 printHistogram(histogram);
}
static void printHelp(char* programName) {
 string name = programName;
 cout << "usage: " 
 << name.substr(name.find_last_of("/\\") + 1) 
 << " FILE1 ... FILEN\n";
}
static vector<string> extractFileNames(const int argc,
 char* argv[]) {
 vector<string> fileNames(argc - 1);
 for (size_t i = 1; i < argc; i++) {
 fileNames.push_back(string(argv[i]));
 }
 return fileNames;
}
static void processFile(array<size_t, 0x100>& histogram,
 string fileName) {
 ifstream stream(fileName, std::ios::binary);
 while (!stream.eof() && stream.good()) {
 unsigned char ch;
 stream >> ch;
 histogram[ch]++;
 }
 stream.close();
}
static void printHistogram(array<size_t, 0x100>& histogram) {
 for (size_t i = 0; i < 0x100; i++) {
 const unsigned int hch = static_cast<unsigned int>(i);
 cout << "0x"
 << std::hex 
 << hch 
 << ": "
 << std::dec
 << histogram[i] 
 << "\n";
 }
}

Typical output

C:\Users\rodio\Documents\VSProjects\ByteHistogram.cpp\x64\Debug>ByteHistogram.cpp.exe ByteHistogram.cpp.pdb ByteHistogram.cpp.exe
0x0: 1050434
0x1: 36219
0x2: 29432
0x3: 23714
0x4: 5747
0x5: 6084
0x6: 7312
0x7: 5521
0x8: 7362
0x9: 0
0xa: 0
0xb: 0
0xc: 0
0xd: 0
0xe: 5843
0xf: 3296
0x10: 25008
0x11: 21996
0x12: 13495
0x13: 7415
0x14: 7473
0x15: 15738
0x16: 8175
0x17: 5275
0x18: 7068
0x19: 5976
0x1a: 8543
0x1b: 5076
0x1c: 6307
0x1d: 5445
0x1e: 6996
0x1f: 6105
0x20: 0
0x21: 3668
0x22: 4457
0x23: 3723
0x24: 10758
0x25: 3622
0x26: 3522
0x27: 2759
0x28: 4035
0x29: 3022
0x2a: 2862
0x2b: 2340
0x2c: 7176
0x2d: 2826
0x2e: 7294
0x2f: 1953
0x30: 10694
0x31: 7573
0x32: 7391
0x33: 5040
0x34: 6695
0x35: 4223
0x36: 5466
0x37: 2819
0x38: 4401
0x39: 4259
0x3a: 18402
0x3b: 2031
0x3c: 8432
0x3d: 1885
0x3e: 7356
0x3f: 9548
0x40: 26898
0x41: 12648
0x42: 4229
0x43: 10425
0x44: 10239
0x45: 7941
0x46: 5515
0x47: 3459
0x48: 11229
0x49: 7583
0x4a: 2387
0x4b: 3271
0x4c: 5935
0x4d: 8513
0x4e: 5940
0x4f: 3618
0x50: 8459
0x51: 2877
0x52: 6448
0x53: 6865
0x54: 5373
0x55: 4518
0x56: 9139
0x57: 3106
0x58: 3670
0x59: 2444
0x5a: 2411
0x5b: 1666
0x5c: 17223
0x5d: 2292
0x5e: 1490
0x5f: 39075
0x60: 4851
0x61: 32171
0x62: 7417
0x63: 25106
0x64: 19350
0x65: 24706
0x66: 6028
0x67: 6085
0x68: 10986
0x69: 23400
0x6a: 1981
0x6b: 3819
0x6c: 18780
0x6d: 9267
0x6e: 16417
0x6f: 22650
0x70: 10313
0x71: 2359
0x72: 32933
0x73: 31774
0x74: 43000
0x75: 9764
0x76: 5407
0x77: 4483
0x78: 4457
0x79: 5631
0x7a: 2222
0x7b: 1371
0x7c: 2255
0x7d: 1932
0x7e: 1446
0x7f: 1242
0x80: 8753
0x81: 2089
0x82: 1954
0x83: 1847
0x84: 2475
0x85: 2934
0x86: 1520
0x87: 1268
0x88: 2307
0x89: 2647
0x8a: 1547
0x8b: 3886
0x8c: 1890
0x8d: 3669
0x8e: 1468
0x8f: 1458
0x90: 3173
0x91: 1485
0x92: 1737
0x93: 1316
0x94: 2147
0x95: 1741
0x96: 1430
0x97: 1563
0x98: 2357
0x99: 1428
0x9a: 1520
0x9b: 1365
0x9c: 2133
0x9d: 1886
0x9e: 1183
0x9f: 1263
0xa0: 2320
0xa1: 1470
0xa2: 1495
0xa3: 990
0xa4: 2151
0xa5: 1997
0xa6: 942
0xa7: 1565
0xa8: 2112
0xa9: 1533
0xaa: 1403
0xab: 1575
0xac: 1787
0xad: 1484
0xae: 1284
0xaf: 1555
0xb0: 2792
0xb1: 1676
0xb2: 1523
0xb3: 1712
0xb4: 2489
0xb5: 1187
0xb6: 1883
0xb7: 1403
0xb8: 2309
0xb9: 1473
0xba: 1282
0xbb: 1468
0xbc: 2027
0xbd: 1917
0xbe: 1075
0xbf: 1325
0xc0: 4465
0xc1: 1883
0xc2: 1413
0xc3: 1651
0xc4: 2106
0xc5: 1520
0xc6: 1520
0xc7: 1567
0xc8: 2945
0xc9: 1349
0xca: 1536
0xcb: 1532
0xcc: 46475
0xcd: 1710
0xce: 1700
0xcf: 1313
0xd0: 2964
0xd1: 1767
0xd2: 1434
0xd3: 1483
0xd4: 1724
0xd5: 1076
0xd6: 1013
0xd7: 1424
0xd8: 1994
0xd9: 1486
0xda: 1205
0xdb: 1325
0xdc: 2246
0xdd: 1342
0xde: 1199
0xdf: 1595
0xe0: 3734
0xe1: 1630
0xe2: 1763
0xe3: 1951
0xe4: 2399
0xe5: 1805
0xe6: 1685
0xe7: 1977
0xe8: 4197
0xe9: 2599
0xea: 1752
0xeb: 1340
0xec: 2521
0xed: 1594
0xee: 1780
0xef: 1370
0xf0: 3064
0xf1: 9476
0xf2: 7373
0xf3: 3636
0xf4: 2268
0xf5: 1540
0xf6: 1406
0xf7: 1351
0xf8: 2248
0xf9: 1567
0xfa: 948
0xfb: 1517
0xfc: 2121
0xfd: 1869
0xfe: 1733
0xff: 13082

Critique request

Since I am not a professional C++ programmer, I would love to hear comments on how to improve my programming routine.

Toby Speight
87.6k14 gold badges104 silver badges325 bronze badges
asked Nov 11, 2024 at 14:13
\$\endgroup\$

3 Answers 3

11
\$\begingroup\$

This key loop looks incorrect:

while (!stream.eof() && stream.good()) {
 unsigned char ch;
 stream >> ch;
 histogram[ch]++;
}

If stream >> ch fails (e.g. at end of file), then we increment the histogram anyway. That most likely leads to over-counting some arbitrary values.

There's also the issue that operator>>() is a formatted input function: it skips whitespace (likely accounting for the zero entries for 0x09, 0x0a, 0x0d and 0x20 in your results). I think you should be using the unformatted input function get() if you want to read every byte.


The type std::array<std::size_t, 0x100> is used in several places. It may be good to give it a name. That will make it easier to remove the assumption that UCHAR_MAX is 0xFF...


Consider defining a class to represent a histogram instead of passing (references to) arrays around. I think that will make the code clearer.


Instead of rejecting invocations with no arguments, a more useful alternative would be to read from standard input. Most users would expect that behaviour, just like standard tools such as cat, grep and sed.

answered Nov 11, 2024 at 14:30
\$\endgroup\$
10
  • 2
    \$\begingroup\$ Regarding UCHAR_MAX being 0xFF: open-std.org/jtc1/sc22/wg21/docs/papers/2024/p3477r0.html \$\endgroup\$ Commented Nov 12, 2024 at 9:42
  • \$\begingroup\$ Thanks for that, @Sebastian. Will be interesting to see how much traction that gets (I worry it closes options for future architectures, but that's just my opinion, and the author of that paper is well-known in C++ circles). \$\endgroup\$ Commented Nov 12, 2024 at 10:21
  • \$\begingroup\$ @Peilonrayz, using a class means less exposure of internals in interfaces etc. In Python, I would use a collections.Counter rather than a plain dict, but Python is such a different language that such comparisons aren't that enlightening anyway. However, look at the linked question and my answer to it to see how a class might make the histogram more reusable. N.B. that recommendation is "consider", not "you must"... \$\endgroup\$ Commented Nov 12, 2024 at 16:24
  • 1
    \$\begingroup\$ @Peilonrayz: Certainly a type name alias (such as using histogram = std::array<std::size_t, 0x100>;) makes it easier to pass around opaquely. A class takes that a bit further, because then all the functions we need are members of the class, making them easier to find and to reason about. E.g. we can provide an insert() member function for adding data and hide the underlying array so that we can't accidentally modify it outside of the published interface. Does that help with the motivation? \$\endgroup\$ Commented Nov 12, 2024 at 16:55
  • \$\begingroup\$ @TobySpeight Thank you. Yes, you've helped me understand your rational. \$\endgroup\$ Commented Nov 12, 2024 at 16:58
3
\$\begingroup\$

Instead of this

array<size_t, 0x100> histogram;

I would use

vector<uint64_t> histogram(0x100, 0);

The difference is where the memory lives. The array will live on the stack, and while the size is only 2KB (256 x 8 bytes per element), the stack has limited space. It's better to use the heap for larger data elements. A vector will live on the stack. The good news is that the usage is the same, so you only need to change this one line. Also, by supplying the extra ,0 in the constructor, you can get rid of your std::fill.

--

The for loop should use range-based for notation:

Instead of

 for (auto it = fileNames.cbegin(); it != fileNames.cend(); it++) {

use

 for (const auto& it : fileNames) {

--

You can simplify extractFilenames implementation. Instead of this

vector<string> fileNames(argc - 1);
for (size_t i = 1; i < argc; i++) {
 fileNames.push_back(string(argv[i]));
}
return fileNames;

you can just do

vector<string> fileNames(argv+1,argv+argc);
return fileNames;
toolic
14.7k5 gold badges29 silver badges204 bronze badges
answered Nov 13, 2024 at 0:07
\$\endgroup\$
2
  • \$\begingroup\$ Is there much point in making a std::vector<string> of filenames? If we're just going to assume every command-line arg is a filename, argv to argv + argc is an array of char* elements we can just use directly without copying them anywhere. With C++20 you could wrap them in a std::span or something from std::ranges to make it easier to pass them to other functions with the pointer and length combined into one arg. \$\endgroup\$ Commented Nov 13, 2024 at 5:03
  • 1
    \$\begingroup\$ I respectfully disagree with the first recommendation. Generally, vectors have more costs in construction, access and destruction than arrays. Using a vector internally imposes those costs on all users whether they want them or not, whereas using an array allows users to choose whether to create the histogram in automatic storage or in dynamic storage (accessed via a smart pointer). That's more in keeping with C++'s philosophy of zero-cost abstraction. \$\endgroup\$ Commented Nov 13, 2024 at 13:00
2
\$\begingroup\$

Was efficiency a goal here?

istream::operator>> (or .get()) has to do a fair bit of work to manage its buffer, check for a tied stream that might need flushing, and so on, and it typically doesn't fully inline so the compiler can't hoist / sink some of the work out of loops. Something like a 1K buffer would be appropriate, or 32K or 64K if the C++ library functions will read directly from the OS into your buffer instead of bouncing through the library's buffer for the stream.

(Unlike C, C++ iostreams are not thread-safe, so at least there's no lock/unlock overhead like you'd have with C fgetc, or avoid with C fgetc_unlocked())

To histogram efficiently, you might unroll with two or four arrays of counts (which you sum at the end), so a long run of the same byte will be incrementing two or four different memory locations, instead of just one. Repeated load/increment/store on the same location has about 5 cycle latency on typical modern x86, with store-forwarding from the store buffer to the next load being all but 1 cycle of that. English text usually doesn't have long runs of the same byte, but some binary files can, especially uncompressed images. (Out-of-order execution can deal with short runs of the same byte.) Throughput of incrementing a counter in memory is at least 1 per clock on modern CPUs if the locations are independent, so a long run of the same address can be about 5x slower.

8 bytes * 256 entries is only 2 KiB, so four separate count arrays would still only be 8 KiB, easily fitting in L1d cache on typical CPUs where L1d cache is at least 32K. (Although reading from files could evict data, especially into a 64K I/O buffer.)

You could consider using narrower counters like uint32_t that you sum into a uint64_t array before it's possible for one of them to wrap around. But with 4x uint64_t arrays still only having an 8K cache footprint (since the number of possible values is only 256), there probably isn't a performance benefit to that in this case. Unless you're on a 32-bit ISA, then 64-bit loads and increments cost more.

Speaking of 32-bit ISAs and ABIs, people mentioned on your last version of the question that size_t will only be 32-bit on some platforms, but you might still have large files. uint64_t would be a more appropriate choice


I profiled it on x86-64 Arch GNU/Linux, compiled with clang 18.1 with libstdc++ (not LLVM's libc++) -O3 -march=native -fno-plt on my Skylake desktop. Run as yes abcdefghijkl | head -c $((10240 * 32768)) | perf record ./a.out /dev/stdin

perf report shows 40% of CPU time was spent in std::istream::get, another 30% in std::istream::sentry::sentry(std::istream&, bool) (which did not actually do any locking, and skipped the call to flush any ostream). Only 25% of CPU time was spent in main actually doing the increments in a loop, checking stream status, and making an indirect call to the library get() function. At least the stream status checks for the loop condition inlined.

The remaining <5% of CPU time was spent in various kernel functions doing I/O. So over 70% of CPU time is basically wasted on istream function-call overhead. Not as bad as I'd feared if it had been doing locking (which would have destroyed memory-level parallelism between increments), but reading into a buffer and looping over that could give you speedups of close to a factor of 4. If not more, since some of the time spent in main was on function calls and other overhead, not just increments.

Multithreading would be the next step after that in gaining performance: each thread would use its own count array(s) that are combined later, which should work great since they're all small enough to fit in per-core-private L1d cache. You'd then need to have each thread grab a chunk of input to count, perhaps just letting them all read into 1K buffers and letting the library sort it out. With small inputs less than a few KiB, it doesn't really matter if one thread gets all the work, it'll still be done fast. Unless you were doing this as a function as part of a larger program that gets called many times, then you'd worry about distributing work evenly for small cases.


Reading a 32K buffer at a time is over 10x faster on large files

Passing std::string by reference is usually a good idea; it's typically a 32-byte object and may contain a pointer or have the string data embedded in it. That's negligible compared to the cost of opening a file, of course, but good practice anyway.

I used a 32K buffer on the stack (local variable not static). That's towards the upper end of what it's reasonable and sensible to allocate on the stack; 1MiB or 8MiB are typical total stack size limits for Windows and Linux user-space processes respectively.

.read (https://en.cppreference.com/w/cpp/io/basic_istream/read) is how you read a block of data into a buffer in C++. It takes a char* which can alias anything (so you can point it at a buffer of int or whatever elements without violating strict-aliasing), but in this case it was easy enough to just do implicit conversion to unsigned char when reading the array.

static void processFile(array<size_t, 0x100>& histogram,
 const string &fileName) {
 ifstream input(fileName, std::ios::binary);
 char ibuf[32*1024]; // .read() takes a char* so unsigned char[] would be inconvenient
 do {
 input.read(ibuf, sizeof(ibuf));
 size_t len = input.gcount(); // returns 0 on an empty file
 for (size_t i = 0 ; i<len ; i++) {
 unsigned char ch = ibuf[i]; // implicit conversion from char
 histogram[ch]++;
 }
 } while(input.good()); // EOF and error both set failbit
 // TODO: check for non-EOF, and report I/O error
 input.close();
}

I could have gotten fancy with a range-for on a span or something of the part of the array that has valid data, but that didn't seem more readable or obvious. The inner loop condition takes care of hitting EOF, even for empty files, so we don't increment the histogram for (unsigned char)Traits::eof() (0xff) like the original does.

The original (with ch = stream.get() instead of stream >> ch) runs in 2.04 seconds = 7.99 G clock cycles, executing 22.8 G instructions in user-space.

This new version runs in 0.183 seconds = 0.716 G clock cycles @ 3.9GHz, executing 0.802 G instructions in user-space, about 0.854 G overall including kernel mode.

I measured with perf stat -r10 ./histo-block32K input.txt > /dev/null, which averages over 10 runs. Redirecting output to a non-terminal makes the output-printing code only make one write system call confirmed by strace ./histo... > /dev/null. And removes any time spent waiting for the terminal to scroll although I think that's not much of a thing with KDE Konsole on modern Linux. strace also showed my version making read(3, buf, 32768) system calls, vs. the original making read(3, buf, 8191) system calls, so its reads aren't even page-aligned!

The input file was 320MB, prepared with yes abcdefghijkl | head -c $((10240 * 32768)) > input.txt on my Linux desktop. yes repeats the "a..l\n" string indefinitely, and tail takes the first 10240x 32K blocks of it. Notice that all the characters are unique so I didn't have to unroll with multiple count arrays to get instruction-level parallelism with my increments.

I compiled with clang++ -O3 -march=native -mbranches-within-32B-boundaries -fno-plt with Clang 18.1.8 on Arch GNU/Linux, on my i7-6700k Skylake desktop at 3.9GHz (energy_performance_preference = balance_performance), with dual-channel DDR4-2666MHz. This uses libstdc++, not libc++.

The hot loop is the version inlined into main, from perf record / perf report. (I'd normally have used perf report -Mintel for Intel syntax instead of AT&T, but it omitted the addressing mode after inc QWORD.)

Clang unrolled the main loop by 8, which is good since the front-end (fetch/decode/issue/rename) is a bottleneck for this on Skylake (along with 2 loads + 1 store per clock) and -march=native implies -mtune=skylake. The issue stage is only 4 uops wide, and each load + memory-destination-increment is 1+3 uops, so any loop overhead costs some throughput assuming out-of-order exec can hide the latency of any cache misses.

left column: % of counts for <cycles> 
 | right column: disassembly with addresses for jump targets
 | And AT&T syntax for instructions
 1.14 │4c0: movzbl 0x840(%rsp,%rcx,1),%esi # load a byte with zero-extension, indexing a stack buffer
 14.21 │ incq 0x40(%rsp,%rsi,8) # increment that counter
 │ movzbl 0x841(%rsp,%rcx,1),%esi
 8.84 │ incq 0x40(%rsp,%rsi,8)
 │ movzbl 0x842(%rsp,%rcx,1),%esi
 11.26 │ incq 0x40(%rsp,%rsi,8)
 │ movzbl 0x843(%rsp,%rcx,1),%esi
 12.55 │ incq 0x40(%rsp,%rsi,8)
 │ movzbl 0x844(%rsp,%rcx,1),%esi
 13.67 │ incq 0x40(%rsp,%rsi,8)
 │ movzbl 0x845(%rsp,%rcx,1),%esi
 15.18 │ incq 0x40(%rsp,%rsi,8)
 0.16 │ movzbl 0x846(%rsp,%rcx,1),%esi
 11.99 │ incq 0x40(%rsp,%rsi,8)
 0.16 │ movzbl 0x847(%rsp,%rcx,1),%esi
 10.67 │ incq 0x40(%rsp,%rsi,8)
 │ add 0ドルx8,%rcx
 │ cmp %rcx,%rdx
 │ ↑ jne 4c0

Depending on the CPU, this could have run a little faster if clang had use addq 1,ドル (mem) instead of incq - with a memory destination, inc is 3 uops but add is only 2 on Intel. But apparently not with an indexed addressing mode: I get approximately equal counts for uops_issued.any either way. (Tested by compiling with -S and editing the resulting .s, then letting clang finish assembling + linking that into an executable.) So Clang's tuning choice here to use inc actually is optimal, unlike in some cases where it would be better to use add-immediate. Not that there's anything you can do about it with your C++ source, if tuning options don't correctly tune for the selected CPU.

perf record shows that 81% of the total cycles events were in main, the rest in various kernel functions. 0.13% was spent in libstdc++.so.6.0.33's std::basic_filebuf<char, std::char_traits<char> >::xsgetn(char*, long) function, presumably called from input.read(). And 0.26% in POSIX read(), the C system-call wrapper.

So C++ library overhead for reading input decreased from 70% to 0.13%, or 0.39% if we include the libc read wrapper function. And an inner loop over an array was optimized much better by clang, unrolling to amortize loop overhead for code that mostly hits in L1d cache (only an 0.8% miss rate according to perf stat --all-user -d)

Although it only runs at 1.36 instructions per cycle, not close to the 2 I was expecting, so maybe there's a uop-cache bottleneck from how 3 and 1 uop instructions pack into it. Anyway, that's getting into low-level tuning shenanigans that I find fun but aren't relevant in general.

(My choice of a..l and newline mean I'm only touching a few cache lines, not most of the array. So even after they do probably get evicted to L2 during the system-call for I/O where the kernel effectively does a memcpy from the pagecache to my buffer, only a couple cache misses happen, and they're spread out to make it even easier to OoO exec to hide them.)


The take-away here is that doing I/O in good-sized chunks like 32K and looping over that buffer can avoid a lot of overhead, and gives the compiler the opportunity to optimize the loop nicely. Even simple-seeming library functions like .get() have a lot of overhead compared to histogram[ch]++ when the count array hits in cache. And that's something which can't auto-vectorize, although it doesn't need to do any text parsing and string-to-int conversion (which actually can be vectorized with SIMD, but not easily or automatically).

answered Nov 12, 2024 at 18:12
\$\endgroup\$
4
  • \$\begingroup\$ I'm surprised the lock overhead is so high, especially on Linux where uncontended locks are supposed to be cheap. Quite an eye-opener! \$\endgroup\$ Commented Nov 13, 2024 at 6:19
  • 1
    \$\begingroup\$ @TobySpeight: There actually isn't any locking, just some load / compare / branch. No atomic RMW instructions execute, at least none in the hot code-paths that perf report highlighted. But if there were, it might make it another 2x or more slower. inc [mem] has 1/clock throughput on Intel, xchg m32, 32 has 18 or 17 cycle throughput, and is a full barrier. (7.5c on Zen 4 according to uops.info). Spinlock unlock just needs a plain mov store, but unlocking a mutex that can sleep (futex) when blocked will need another xchg RMW to not miss notifying a thread that just slept. \$\endgroup\$ Commented Nov 13, 2024 at 6:26
  • 1
    \$\begingroup\$ @TobySpeight: Anyway, if a lock/unlock required a system call (not "lightweight"), it would be at least 100x slower than that for the uncontended case. (With Spectre and Meltdown mitigation enabled, getting into the kernel and back costs at least a couple thousand cycles, probably more. Before that, at least a hundred.) A single histogram[ch]++ that hits in L1d (or even L2) cache is extremely cheap compared to even lightweight locking or even something like getchar_unlocked() or in this case istream::get() which has to check if there's a tied stream that needs flushing, manage buffer... \$\endgroup\$ Commented Nov 13, 2024 at 6:29
  • 1
    \$\begingroup\$ @TobySpeight: Now I understand why I didn't see any locking or a check for the program being single-threaded (like I've seen in some parts of glibc, e.g. by checking if libpthread was linked): iostreams aren't thread-safe; if you want locking you have to do it yourself. Which makes sense; cout << x << " " << y << '\n'; from multiple threads would be a total mess if each call locked separately, unlike with C stdout where a single printf can format a whole line of output and append it to the buffer with a lock held. \$\endgroup\$ Commented Nov 13, 2024 at 7:36

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.