C: function to read a token from stdin

Question 1

I recently started doing some competitive programming in C and one of my first requirements was a high-speed token reader (analogous to the java Scanner class' next() function). A few examples of input I am most likely to read are:

5
ccadd
bddcc

The integer/float inputs will be handled using atoi() and atof(), so all I need to develop is a function that will read words from stdin. Here's the first prototype:

#define BUF_SIZE (1 << 10) // approx 2 KiB or 1024 chars
char* next_token() {
 char* buf = malloc(BUF_SIZE * sizeof(char));
 char cc;
 // consume leading whitespaces
 while (isspace(cc=getchar())) ;
 buf[0] = cc;
 int i=1;
 int nofs = 1;
 while (!isspace(cc=getchar())) {
 if (i >= BUF_SIZE*nofs) {
 // gracefully extend buffer size
 nofs++;
 buf = realloc(buf, BUF_SIZE*nofs*sizeof(char));
 }
 buf[i] = cc;
 i++;
 }
 // trim buffer
 buf = realloc(buf, (i+1)*sizeof(char));
 buf[i] = '0円';
 return buf;
}
int main() {
 int T = atoi(next_token());
 while (T-- > 0) {
 char* word = next_token();
 // more logic here
 }
}

Two questions I had with this code are:

Is this fast enough? I think the major bottleneck lies with the realloc at the end, where I trim the length. If it's not fast enough, please suggest some optimizations.
Is this compliant with how C is generally written? I'm coming from Java and have little experience with C code. I write some embedded C but that's closer to assembly than it is to this type of code.

Any further improvements are welcome.

Question 2

How fast is fast enough? Do you have any measurable requirements?

Question 3

Can you quote the original programming challenge? Otherwise I don't consider this particularly off-topic, as long as this represents the solution in its entirety.

Question 4

Alignment

This will be an easy win - use aligned_alloc instead of malloc. This is only guaranteed to be available in the standard library as of C11, which you should be using anyway.

Exponential reallocation

This:

 // gracefully extend buffer size
 nofs++;
 buf = realloc(buf, BUF_SIZE*nofs*sizeof(char));

reallocates with linear growth. Memory is cheap and CPU time is expensive, so reallocate with exponential growth instead. Choosing a growth factor is a little more involved, but growth factors of 1.5 or 2 are not uncommon.

Inner assignment

Remove the assignment-in-condition from this:

while (isspace(cc=getchar())) ;

It does not make anything faster, and is a nasty bit of C syntax that makes code more difficult to read, maintain and debug.

Use a `for`

int i=1;
while (!isspace(cc=getchar())) {
 // ...
 i++;
}

can be

for (int i = 1; !isspace(cc); i++) {
 // ...
 cc = getchar();
}

noting that an initial getchar() would need to precede this loop.

Question 5

Thanks for the answer. Using C11 is not an option - I have to stick with C99. Also, why the preference for for over while? K&R seems to use inner assignment similar to what I have done.

Question 6

Using C11 is not an option - Fine; there are other options to do aligned allocation; I will leave this as an exercise to you. K&R seems to use inner assignment - That doesn't mean that it's a good idea :) Modern practice is to discourage such a pattern, and most languages do not support such a thing for good reason.

Question 7

why the preference for for over while? - Because in this case you have a loop counter, and a for is designed specifically to frame that situation.

Question 8

Why didn't you mention checking allocated memory for NULL?

Question 9

Plain old malloc is required to return memory aligned to alignof(max_align_t) even if the space requested is smaller. That should be plenty of alignment for a short string.

Question 10

It is doubtful that the program is so long that it can't all be included, but you have made an effort to comply with the Code Review guidelines. Just be aware that comments such as // more logic here or // ... will sometimes get the question votes to close.

Complexity
You're a Java programmer so I'm going to assume you understand object oriented programming principles. While the C programming language isn't object oriented, some of the principles can be applied such as the Single Responsibility Principle as applied to functions and modules. Therefore the current function is too complex because it does too much. Input should be in either in the calling function or next_token() should consist of 2 functions, one that does input and one that parses the input for tokens.

Error Handling
There are 2 kinds of errors that can occur in this program, the first is memory allocation errors and the second is input errors. The Xalloc() functions can fail if the system has insufficient memory, while this is rare on modern computers it can still happen, especially in an embedded environment with limited memory. A call to any of the memory allocation functions should always be followed by a test to see if the pointer to the memory is NULL or not. If the pointer is NULL then the memory allocation failed and somewhere in the code the program has to decide what to do, including reporting the memory allocation error.

char* next_token() {
 char* buf = malloc(BUF_SIZE * sizeof(*buf));
 if (buf == NULL)
 {
 fprintf(stderr, "Memory allocation failed in next_token");
 return buf;
 }
 char cc;
 // consume leading whitespaces
 while (isspace(cc=getchar())) ;
 buf[0] = cc;
 int i=1;
 int nofs = 1;
 while (!isspace(cc=getchar())) {
 if (i >= BUF_SIZE*nofs) {
 // gracefully extend buffer size
 nofs++;
 buf = realloc(buf, BUF_SIZE*nofs*sizeof(*buf));
 if (buf == NULL)
 {
 fprintf(stderr, "Memory allocation failed in next_token");
 return buf;
 }
 }
 buf[i] = cc;
 i++;
 }
 // trim buffer
 buf = realloc(buf, (i+1)*sizeof(*buf));
 if (buf == NULL)
 {
 fprintf(stderr, "Memory allocation failed in next_token");
 return buf;
 }
 buf[i] = '0円';
 return buf;
}

Please note that in the above code I changed sizeof(char) to sizeof(*buf). This makes the code more maintainable because the type of buf can be changed and the memory allocations don't require additional editing.

Input errors: If the user types in a CTRL-D on a Unix or Linux system the program will encounter an EOF (end of file) character. It currently can't handle that. This stackoverflow question covers that in more detail.

Character Input is Slow
Character input using getchar() is slower than using buffered input and processing character input rather than processing strings after they have been read is slower. Grab as many characters as you can using a fixed size buffer and a call to fgets(char *buffer, int buffer_size, FILE *stream). The function fgets() reads a line at a time buffer_size can be 1K, 2K or 4K or larger + 1 (most lines will be less than 1K). This reduces the memory allocation involved and reads the input faster. You will need a pointer that points to the string starting point after the token. Using fgets() in the main program or in a function that calls the tokenizer will also allow you to handle the EOF situation since fgets() only reads until the end of the file as well as the end of the line.

Question 11

I'll comment on C style:

#define BUF_SIZE (1 << 10) // approx 2 KiB or 1024 chars
```n
This comment makes no sense. A `char` in C is, by definition, 1 byte. `1 << 10` bytes is exactly 1024 `char`s. I suppose I can understand if you're coming from Java where `char` is a UTF-16 code unit.
```c
char* next_token() {
 char* buf = malloc(BUF_SIZE * sizeof(char));

Again, sizeof(char) is defined to be 1. malloc(BUF_SIZE) is sufficient. If you want your code to be robust against someday using, say, wchar_t instead of char, then idiomatic practice instead is to do char* buf = malloc(BUFSIZE * sizeof *buf);.

Also, you should verify that malloc succeeds.

 char cc;
 // consume leading whitespaces
 while (isspace(cc=getchar())) ;

Personally I'd break this up instead of embedding the assignment.

 int nofs = 1;

I can't decipher what this variable name means. "No filesytem"? "Number Fs"? "North of South"?

C is not so archaic that there is some tiny limit on lengths of variable names. Use descriptive names.

 buf = realloc(buf, BUF_SIZE*nofs*sizeof(char));

Others have already mentioned that you should grow your buffer exponentially.

x = realloc(x, ...) is an anti-pattern. Always assign to a temporary variable first; otherwise if realloc fails, you've lost your original pointer and will leak the memory.

As with malloc, sizeof(char) is useless, and you should check for realloc failure.

 // trim buffer
 buf = realloc(buf, (i+1)*sizeof(char));

Same thing here as before about realloc.

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Accepted Answer · 2020-06-13 15:31:57Z

Alignment

This will be an easy win - use aligned_alloc instead of malloc. This is only guaranteed to be available in the standard library as of C11, which you should be using anyway.

Exponential reallocation

This:

 // gracefully extend buffer size
 nofs++;
 buf = realloc(buf, BUF_SIZE*nofs*sizeof(char));

reallocates with linear growth. Memory is cheap and CPU time is expensive, so reallocate with exponential growth instead. Choosing a growth factor is a little more involved, but growth factors of 1.5 or 2 are not uncommon.

Inner assignment

Remove the assignment-in-condition from this:

while (isspace(cc=getchar())) ;

It does not make anything faster, and is a nasty bit of C syntax that makes code more difficult to read, maintain and debug.

Use a `for`

int i=1;
while (!isspace(cc=getchar())) {
 // ...
 i++;
}

can be

for (int i = 1; !isspace(cc); i++) {
 // ...
 cc = getchar();
}

noting that an initial getchar() would need to precede this loop.

Thanks for the answer. Using C11 is not an option - I have to stick with C99. Also, why the preference for for over while? K&R seems to use inner assignment similar to what I have done.
Using C11 is not an option - Fine; there are other options to do aligned allocation; I will leave this as an exercise to you. K&R seems to use inner assignment - That doesn't mean that it's a good idea :) Modern practice is to discourage such a pattern, and most languages do not support such a thing for good reason.
why the preference for for over while? - Because in this case you have a loop counter, and a for is designed specifically to frame that situation.
Why didn't you mention checking allocated memory for NULL?
Plain old malloc is required to return memory aligned to alignof(max_align_t) even if the space requested is smaller. That should be plenty of alignment for a short string.

Stack Exchange Network

C: function to read a token from stdin

3 Answers 3

Alignment

Exponential reallocation

Inner assignment

Use a `for`

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

C: function to read a token from stdin

3 Answers 3

Alignment

Exponential reallocation

Inner assignment

Use a for

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Use a `for`