Assignment on reading from stdin in C

Question 1

I am an experienced Python developer learning C. I solved this assignement:

Write a program that reads a word and some sentences from stdin. The word is separated by space from a sentences. The sentences are separated by one of the following characters: .!?. Words in sentences are separated by any char for which one of library functions isspace and ispunct returns nonzero result.

Your program should output average usage of the word in the sentences (a number of matched words divided by a number of sentences).

Example input:

two one four two seven eight two four two. two, one, nine, two, two, three, two!

Example expected output: 3.5

Explanation: The word is "two". Then we have two sentences: "one four two seven eight two four two." and "two, one, nine, two, two, three, two!". The word occurs 7 times, so result is 7/2 = 3.5.

Here is my solution:

# include <ctype.h>
# include <stdbool.h>
# include <stdio.h>
# include <stdlib.h>
# include <unistd.h>
void check_malloc_result(void* malloc_result) {
 if (!malloc_result) {
 perror("malloc failed");
 exit(EXIT_FAILURE);
 }
}
int main(void) {
 char stdin_ch;
 size_t buff_size = 8;
 size_t word_len = 0;
 bool reading_word = false;
 char* word = malloc(buff_size);
 check_malloc_result(word);
 // reading the word
 while(read(STDIN_FILENO, &stdin_ch, 1) > 0) {
 if (isspace(stdin_ch)) {
 if (reading_word) {
 break;
 } else {
 continue;
 }
 }
 reading_word = true;
 ++word_len;
 if (word_len > buff_size) {
 buff_size *= 2;
 word = realloc(word, buff_size);
 check_malloc_result(word);
 }
 word[word_len - 1] = stdin_ch;
 }
 // counting matches
 size_t sen_count = 0, matched_count = 0, cur_word_pos = 0;
 bool matching = true;
 reading_word = false;
 while(word_len && read(STDIN_FILENO, &stdin_ch, 1) > 0) {
 if (ispunct(stdin_ch) || isspace(stdin_ch)) {
 if (reading_word) {
 // end of a word
 if (matching) {
 ++matched_count;
 }
 if (stdin_ch == '.' || stdin_ch == '?' || stdin_ch == '!') {
 ++sen_count;
 }
 reading_word = false;
 matching = true;
 cur_word_pos = 0;
 }
 continue;
 }
 reading_word = true;
 if (matching) {
 if (cur_word_pos >= word_len || word[cur_word_pos] != stdin_ch) {
 matching = false;
 }
 ++cur_word_pos;
 }
 }
 free(word);
 // printing the result
 double result = 0.0;
 if (sen_count) {
 result = (double)matched_count / (double)sen_count;
 }
 printf("%f", result);
 return EXIT_SUCCESS;
}

Some notes:

I don't want to use fixed max length for the word so I use dynamic allocation here.
The reading of sentences is done using two boolean flags. reading_word means are actualy reading a word, and if it's false, we are somewhere in between. It's used for dealing with consecutive separators. matching becomes false when we know that current scanned word doesn't match, so we just scan to the end of the word, not checking anything.

What could be better in this code? Is it "idiomatic" C, or is it obvious that it's written by someone without much language experience?

Question 2

First, some positive notes. I got no warnings when compiled with gcc using -Wall -Wextra -pedantic. The code seems to work correctly, although the user has to signal EOF from the keyboard, or use file redirection (some warning about this in the question would have been nice.) It is good to see you using size_t for array indices, good that you are checking the results of your functions (i.e., malloc and realloc), and good that you are remembering to free allocated memory.

I like that you do not cast the result of malloc, though there are those that would disagree on that point.

Using `realloc`

Concerning realloc, you are directly assigning the result of realloc to the pointer to the memory which is being reallocated. This is bad, since realloc can return a NULL pointer in the event of an error. This would cause a memory leak because you would no longer have a pointer to the previous allocation. You should instead store the result of realloc in a temporary variable, and test that before reassigning:

char *temp;
/* ... */
temp = realloc(word, buff_size);
check_malloc_result(temp);
word = temp;

I don't really understand the need for dynamic allocation here. You only allocate storage for one word; this is not a lot of memory. The method of allocating a reasonable space, and then growing it as needed is sound, but note that if the first word is 17 characters, then you will have to allocate for 32 characters. This is a lot of waste, if you are concerned about space at this scale. You could allocate for one extra character at a time, as needed. I would recommend allocating for a larger initial value (say, 100 chars); if you want to handle possible very large first words, then you can grow this allocation dynamically. Then, if you really want to trim the memory, you can realloc to the actual length of the word after you have read it.

Using `getchar`

It would be more idiomatic to use getchar instead of read to read from stdin. To use getchar here you need to change stdin_ch to int type because getchar returns an int. Specifically, getchar returns EOF in some circumstances, and EOF is a negative int value which may not fit in a char.

Since the specification says that the only separators are spaces and the three punctuation characters, I would consider using '\n' as a way to terminate input to make keyboard entry more friendly:

int stdin_ch;
/* ... */
while((stdin_ch = getchar()) != '\n' && stdin_ch != EOF) {
/* ... */
while(word_len && (stdin_ch = getchar()) != '\n' && stdin_ch != EOF) {

Without the test for newlines, this looks much simpler than the read version of the same code:

int stdin_ch;
/* ... */
while((stdin_ch = getchar()) != EOF) {
/* ... */
while(word_len && (stdin_ch = getchar()) != EOF) {

Using `ctype.h` Functions

The character classification functions in ctype.h take int arguments, but the C Standard requires that these int values be representable as unsigned char. Attempting to pass values outside of the unsigned char range leads to undefined behavior. To be safe and correct you need to ensure that values passed to isspace, ispunct, etc., are in this range. This can be accomplished by casting to unsigned char:

if (isspace((unsigned char)stdin_ch)) {
/* ... */
if (ispunct((unsigned char)stdin_ch) || isspace((unsigned char)stdin_ch)) {

Alternatively, you may find it convenient to do something like:

int uch = (unsigned char)stdin_ch;
if (ispunct(uch) || isspace(uch)) {

This is less noisy and documents clearly that an int value in the range of unsigned char is being passed to ispunct and isspace.

Other Miscellany

When you calculate result, you don't need to use two casts; the first will do:

result = (double)matched_count / sen_count;

Or, you can do away with the cast altogether by multiplying by 1.0; Note that a floating constant without a suffix is of type double:

result = (1.0 * matched_count) / sen_count;

I personally lean towards the second approach as I prefer to avoid casts whenever possible.

I also added a newline to the format string of your final printf statement, since this is nicer for the user; you might have a reason not to do this, though.

Finally, a note on the indirection operator: opinions vary, but it seems most common to use "right-leaning" asterisks. This is purely a matter of style, but I prefer:

int *var;

to the "left-leaning" version:

int* var;

The only place that I make an exception is in function declarations:

int * func(int *ptr);

There are good reasons for both styles; consistency trumps all in this case.

Question 3

The only people who would disagree about casting the result of malloc are C++ programmers wearing a C hat. :-) Otherwise, a fantastic answer that hits on a lot of important points. The right-leaning asterisks seem to be more popular with C programmers, while C++ programmers prefer to associate the asterisk with the type.

Question 4

@CodyGray-- Thanks; I sympathize with left-leaning asterisk folks because of the emphasis on type, and that actually sounds compelling to me. But it is just hard for me to look at C code that actually uses this style!

Question 5

Thanks a lot for the answer. Very good points about realloc, allocation strategy and getchar. I didn't use newlines, becase the assignement is automatically checked, I'd added their support otherwise. And I'll stick to right-leaning if it's more common in C code (and as I see it is). Thank you again.

Question 6

Allocation can get a bit tricky at times, so it is good to get the fundamentals right, like checking for errors. When you read about the functions that you use (any functions), pay special attention to what they return (or what flags they set) in the event of an error. Using this information in a program may seem like extra work, but it will save you from many headaches, and is the only way to write robust code.

Question 7

What you have looks pretty good, but a few additional suggestions:

Error Handling

Nice to see checks for the return value of malloc(). You put them in a function called check_malloc_result(), but you’d want to do this for every system call that can fail unrecoverably. Here is the boilerplate I’ve been using for years:

Minimal subset of the header file:

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void fatal_error_helper( const char* msg, const char* sourcefile, int lineno, const char* syserr );
#define fatal_system_error(m) \
 fatal_error_helper( (m), __FILE__, __LINE__, strerror(errno) )

Then in another .c file:

void fatal_error_helper( const char* const msg,
 const char* const sourcefile,
 const int lineno,
 const char * const syserr )
{
 fflush(stdout); /* Don't cross the streams! */
 fprintf( stderr,
 "%s at %s:%d: %s. Program terminated.\n",
 msg, sourcefile, lineno, syserr
 );
 exit(EXIT_FAILURE);
}

You use it for any system call that sets errno, so for example:

void* const word = malloc(SIZE_MAX);
if (!word) {
 fatal_system_error("Failed to allocate buffer");
}

This flushes any output waiting to be sent to standard output, then terminates with an message to standard error that includes the line of code it was called from, in this case:

Failed to allocate buffer at /app/example.cpp:32: Cannot allocate memory.

You can use it to check your input stream for errors too, which you should also do.

The I/O Library

You use UNIX functions like read() rather than <stdio.h>. If this is meant to run on UNIX-like OSes, and does not need to be portable to anything else, POSIX, Linux and MacOS all support getline(), which allocates the dynamic input buffer for you.

However, you might prefer to used a fixed-size buffer, and copy words from that into a string table.

The Parsing

If you can put an upper limit on the length of a word, and you only need to support the default C locale, you might want to read in words all at once. (The input today is probably in UTF-8, which makes isspace() and ispunct() useless for non-ASCII characters anyway.) In traditional C, this might look like,

errno = 0; // So that any error later must have been set by scanf().
/* input_buf NUST have space for the terminating null character!
 * So it is an array of type char[4096], declared outside the while loop.
 */
const int chars_read = scanf(
 "%4095[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghiklmnopqrstuvwyz123456789_]", // Possibly not a complete set.
 input_buf);
if (ferror(stdin)) {
 fatal_system_error("Reading word from input failed");
}

Similarly, you can skip any substring of word separators with an %*[...] character-class argument to scanf().

You can then look up and increment the count of each word from input_buf in a hash map, in constant time.

In modern C, you could use regex.

Modularize as Appropriate

If you keep the current approach, I recommend splitting off the logic to resize the buffer into a different module, which has some kind of String data structure and a String_append() function to add more chunks of input to it. This function is the where you should call realloc().

The state machine also looks like something that could become a function that returns a data structure. Or even two: a tokenizer and an inserter.

Question 8

FWIW: Somewhere I read the longest "word" in English is 47 characters long. Were this assignment to work with German "words" also, 47 would be regarded as a good beginning, but ultimately woefully inadequate.

ad absurdum ad absurdum 8941 gold badge10 silver badges21 bronze badges · Accepted Answer · 2017-01-12 13:53:03Z

First, some positive notes. I got no warnings when compiled with gcc using -Wall -Wextra -pedantic. The code seems to work correctly, although the user has to signal EOF from the keyboard, or use file redirection (some warning about this in the question would have been nice.) It is good to see you using size_t for array indices, good that you are checking the results of your functions (i.e., malloc and realloc), and good that you are remembering to free allocated memory.

I like that you do not cast the result of malloc, though there are those that would disagree on that point.

Using `realloc`

Concerning realloc, you are directly assigning the result of realloc to the pointer to the memory which is being reallocated. This is bad, since realloc can return a NULL pointer in the event of an error. This would cause a memory leak because you would no longer have a pointer to the previous allocation. You should instead store the result of realloc in a temporary variable, and test that before reassigning:

char *temp;
/* ... */
temp = realloc(word, buff_size);
check_malloc_result(temp);
word = temp;

I don't really understand the need for dynamic allocation here. You only allocate storage for one word; this is not a lot of memory. The method of allocating a reasonable space, and then growing it as needed is sound, but note that if the first word is 17 characters, then you will have to allocate for 32 characters. This is a lot of waste, if you are concerned about space at this scale. You could allocate for one extra character at a time, as needed. I would recommend allocating for a larger initial value (say, 100 chars); if you want to handle possible very large first words, then you can grow this allocation dynamically. Then, if you really want to trim the memory, you can realloc to the actual length of the word after you have read it.

Using `getchar`

It would be more idiomatic to use getchar instead of read to read from stdin. To use getchar here you need to change stdin_ch to int type because getchar returns an int. Specifically, getchar returns EOF in some circumstances, and EOF is a negative int value which may not fit in a char.

Since the specification says that the only separators are spaces and the three punctuation characters, I would consider using '\n' as a way to terminate input to make keyboard entry more friendly:

int stdin_ch;
/* ... */
while((stdin_ch = getchar()) != '\n' && stdin_ch != EOF) {
/* ... */
while(word_len && (stdin_ch = getchar()) != '\n' && stdin_ch != EOF) {

Without the test for newlines, this looks much simpler than the read version of the same code:

int stdin_ch;
/* ... */
while((stdin_ch = getchar()) != EOF) {
/* ... */
while(word_len && (stdin_ch = getchar()) != EOF) {

Using `ctype.h` Functions

The character classification functions in ctype.h take int arguments, but the C Standard requires that these int values be representable as unsigned char. Attempting to pass values outside of the unsigned char range leads to undefined behavior. To be safe and correct you need to ensure that values passed to isspace, ispunct, etc., are in this range. This can be accomplished by casting to unsigned char:

if (isspace((unsigned char)stdin_ch)) {
/* ... */
if (ispunct((unsigned char)stdin_ch) || isspace((unsigned char)stdin_ch)) {

Alternatively, you may find it convenient to do something like:

int uch = (unsigned char)stdin_ch;
if (ispunct(uch) || isspace(uch)) {

This is less noisy and documents clearly that an int value in the range of unsigned char is being passed to ispunct and isspace.

Other Miscellany

When you calculate result, you don't need to use two casts; the first will do:

result = (double)matched_count / sen_count;

Or, you can do away with the cast altogether by multiplying by 1.0; Note that a floating constant without a suffix is of type double:

result = (1.0 * matched_count) / sen_count;

I personally lean towards the second approach as I prefer to avoid casts whenever possible.

I also added a newline to the format string of your final printf statement, since this is nicer for the user; you might have a reason not to do this, though.

Finally, a note on the indirection operator: opinions vary, but it seems most common to use "right-leaning" asterisks. This is purely a matter of style, but I prefer:

int *var;

to the "left-leaning" version:

int* var;

The only place that I make an exception is in function declarations:

int * func(int *ptr);

There are good reasons for both styles; consistency trumps all in this case.

The only people who would disagree about casting the result of malloc are C++ programmers wearing a C hat. :-) Otherwise, a fantastic answer that hits on a lot of important points. The right-leaning asterisks seem to be more popular with C programmers, while C++ programmers prefer to associate the asterisk with the type.
@CodyGray-- Thanks; I sympathize with left-leaning asterisk folks because of the emphasis on type, and that actually sounds compelling to me. But it is just hard for me to look at C code that actually uses this style!
Thanks a lot for the answer. Very good points about realloc, allocation strategy and getchar. I didn't use newlines, becase the assignement is automatically checked, I'd added their support otherwise. And I'll stick to right-leaning if it's more common in C code (and as I see it is). Thank you again.
Allocation can get a bit tricky at times, so it is good to get the fundamentals right, like checking for errors. When you read about the functions that you use (any functions), pay special attention to what they return (or what flags they set) in the event of an error. Using this information in a program may seem like extra work, but it will save you from many headaches, and is the only way to write robust code.

Stack Exchange Network

Assignment on reading from stdin in C

2 Answers 2

Using `realloc`

Using `getchar`

Using `ctype.h` Functions

Other Miscellany

Error Handling

The I/O Library

The Parsing

Modularize as Appropriate

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Assignment on reading from stdin in C

2 Answers 2

Using realloc

Using getchar

Using ctype.h Functions

Other Miscellany

Error Handling

The I/O Library

The Parsing

Modularize as Appropriate

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Using `realloc`

Using `getchar`

Using `ctype.h` Functions