I recently started doing some competitive programming in C and one of my first requirements was a high-speed token reader (analogous to the java Scanner
class' next()
function). A few examples of input I am most likely to read are:
5
ccadd
bddcc
5 4 1
1 2 5
2 3 7
3 4 8
4 5 2
2 3
The integer/float inputs will be handled using atoi()
and atof()
, so all I need to develop is a function that will read words from stdin
. Here's the first prototype:
#define BUF_SIZE (1 << 10) // approx 2 KiB or 1024 chars
char* next_token() {
char* buf = malloc(BUF_SIZE * sizeof(char));
char cc;
// consume leading whitespaces
while (isspace(cc=getchar())) ;
buf[0] = cc;
int i=1;
int nofs = 1;
while (!isspace(cc=getchar())) {
if (i >= BUF_SIZE*nofs) {
// gracefully extend buffer size
nofs++;
buf = realloc(buf, BUF_SIZE*nofs*sizeof(char));
}
buf[i] = cc;
i++;
}
// trim buffer
buf = realloc(buf, (i+1)*sizeof(char));
buf[i] = '0円';
return buf;
}
int main() {
int T = atoi(next_token());
while (T-- > 0) {
char* word = next_token();
// more logic here
}
}
Two questions I had with this code are:
- Is this fast enough? I think the major bottleneck lies with the
realloc
at the end, where I trim the length. If it's not fast enough, please suggest some optimizations. - Is this compliant with how C is generally written? I'm coming from Java and have little experience with C code. I write some embedded C but that's closer to assembly than it is to this type of code.
Any further improvements are welcome.
-
1\$\begingroup\$ How fast is fast enough? Do you have any measurable requirements? \$\endgroup\$πάντα ῥεῖ– πάντα ῥεῖ2020年06月13日 13:36:13 +00:00Commented Jun 13, 2020 at 13:36
-
\$\begingroup\$ Can you quote the original programming challenge? Otherwise I don't consider this particularly off-topic, as long as this represents the solution in its entirety. \$\endgroup\$Reinderien– Reinderien2020年06月13日 15:10:37 +00:00Commented Jun 13, 2020 at 15:10
3 Answers 3
Alignment
This will be an easy win - use aligned_alloc
instead of malloc
. This is only guaranteed to be available in the standard library as of C11, which you should be using anyway.
Exponential reallocation
This:
// gracefully extend buffer size
nofs++;
buf = realloc(buf, BUF_SIZE*nofs*sizeof(char));
reallocates with linear growth. Memory is cheap and CPU time is expensive, so reallocate with exponential growth instead. Choosing a growth factor is a little more involved, but growth factors of 1.5 or 2 are not uncommon.
Inner assignment
Remove the assignment-in-condition from this:
while (isspace(cc=getchar())) ;
It does not make anything faster, and is a nasty bit of C syntax that makes code more difficult to read, maintain and debug.
Use a for
int i=1;
while (!isspace(cc=getchar())) {
// ...
i++;
}
can be
for (int i = 1; !isspace(cc); i++) {
// ...
cc = getchar();
}
noting that an initial getchar()
would need to precede this loop.
-
1\$\begingroup\$ Thanks for the answer. Using C11 is not an option - I have to stick with C99. Also, why the preference for
for
overwhile
? K&R seems to use inner assignment similar to what I have done. \$\endgroup\$Aniruddha Deb– Aniruddha Deb2020年06月13日 15:44:18 +00:00Commented Jun 13, 2020 at 15:44 -
\$\begingroup\$ Using C11 is not an option - Fine; there are other options to do aligned allocation; I will leave this as an exercise to you. K&R seems to use inner assignment - That doesn't mean that it's a good idea :) Modern practice is to discourage such a pattern, and most languages do not support such a thing for good reason. \$\endgroup\$Reinderien– Reinderien2020年06月13日 15:48:36 +00:00Commented Jun 13, 2020 at 15:48
-
\$\begingroup\$ why the preference for for over while? - Because in this case you have a loop counter, and a
for
is designed specifically to frame that situation. \$\endgroup\$Reinderien– Reinderien2020年06月13日 15:49:53 +00:00Commented Jun 13, 2020 at 15:49 -
\$\begingroup\$ Why didn't you mention checking allocated memory for
NULL
? \$\endgroup\$2020年06月13日 18:29:07 +00:00Commented Jun 13, 2020 at 18:29 -
1\$\begingroup\$ Plain old
malloc
is required to return memory aligned toalignof(max_align_t)
even if the space requested is smaller. That should be plenty of alignment for a short string. \$\endgroup\$zwol– zwol2020年06月14日 02:13:40 +00:00Commented Jun 14, 2020 at 2:13
It is doubtful that the program is so long that it can't all be included, but you have made an effort to comply with the Code Review guidelines. Just be aware that comments such as // more logic here
or // ...
will sometimes get the question votes to close.
Complexity
You're a Java programmer so I'm going to assume you understand object oriented programming principles. While the C programming language isn't object oriented, some of the principles can be applied such as the Single Responsibility Principle as applied to functions and modules. Therefore the current function is too complex because it does too much. Input should be in either in the calling function or next_token()
should consist of 2 functions, one that does input and one that parses the input for tokens.
Error Handling
There are 2 kinds of errors that can occur in this program, the first is memory allocation errors and the second is input errors. The Xalloc()
functions can fail if the system has insufficient memory, while this is rare on modern computers it can still happen, especially in an embedded environment with limited memory. A call to any of the memory allocation functions should always be followed by a test to see if the pointer to the memory is NULL
or not. If the pointer is NULL
then the memory allocation failed and somewhere in the code the program has to decide what to do, including reporting the memory allocation error.
char* next_token() {
char* buf = malloc(BUF_SIZE * sizeof(*buf));
if (buf == NULL)
{
fprintf(stderr, "Memory allocation failed in next_token");
return buf;
}
char cc;
// consume leading whitespaces
while (isspace(cc=getchar())) ;
buf[0] = cc;
int i=1;
int nofs = 1;
while (!isspace(cc=getchar())) {
if (i >= BUF_SIZE*nofs) {
// gracefully extend buffer size
nofs++;
buf = realloc(buf, BUF_SIZE*nofs*sizeof(*buf));
if (buf == NULL)
{
fprintf(stderr, "Memory allocation failed in next_token");
return buf;
}
}
buf[i] = cc;
i++;
}
// trim buffer
buf = realloc(buf, (i+1)*sizeof(*buf));
if (buf == NULL)
{
fprintf(stderr, "Memory allocation failed in next_token");
return buf;
}
buf[i] = '0円';
return buf;
}
Please note that in the above code I changed sizeof(char)
to sizeof(*buf)
. This makes the code more maintainable because the type of buf
can be changed and the memory allocations don't require additional editing.
Input errors: If the user types in a CTRL-D
on a Unix or Linux system the program will encounter an EOF (end of file) character. It currently can't handle that. This stackoverflow question covers that in more detail.
Character Input is Slow
Character input using getchar()
is slower than using buffered input and processing character input rather than processing strings after they have been read is slower. Grab as many characters as you can using a fixed size buffer and a call to fgets(char *buffer, int buffer_size, FILE *stream). The function fgets()
reads a line at a time buffer_size
can be 1K, 2K or 4K or larger + 1 (most lines will be less than 1K). This reduces the memory allocation involved and reads the input faster. You will need a pointer that points to the string starting point after the token. Using fgets()
in the main program or in a function that calls the tokenizer will also allow you to handle the EOF situation since fgets()
only reads until the end of the file as well as the end of the line.
I'll comment on C style:
#define BUF_SIZE (1 << 10) // approx 2 KiB or 1024 chars
```n
This comment makes no sense. A `char` in C is, by definition, 1 byte. `1 << 10` bytes is exactly 1024 `char`s. I suppose I can understand if you're coming from Java where `char` is a UTF-16 code unit.
```c
char* next_token() {
char* buf = malloc(BUF_SIZE * sizeof(char));
Again, sizeof(char)
is defined to be 1. malloc(BUF_SIZE)
is sufficient. If you want your code to be robust against someday using, say, wchar_t
instead of char
, then idiomatic practice instead is to do char* buf = malloc(BUFSIZE * sizeof *buf);
.
Also, you should verify that malloc
succeeds.
char cc;
// consume leading whitespaces
while (isspace(cc=getchar())) ;
Personally I'd break this up instead of embedding the assignment.
int nofs = 1;
I can't decipher what this variable name means. "No filesytem"? "Number Fs"? "North of South"?
C is not so archaic that there is some tiny limit on lengths of variable names. Use descriptive names.
buf = realloc(buf, BUF_SIZE*nofs*sizeof(char));
Others have already mentioned that you should grow your buffer exponentially.
x = realloc(x, ...)
is an anti-pattern. Always assign to a temporary variable first; otherwise if realloc
fails, you've lost your original pointer and will leak the memory.
As with malloc
, sizeof(char)
is useless, and you should check for realloc
failure.
// trim buffer
buf = realloc(buf, (i+1)*sizeof(char));
Same thing here as before about realloc
.