I am an experienced Python developer learning C. I solved this assignement:
Write a program that reads a word and some sentences from stdin. The word is separated by space from a sentences. The sentences are separated by one of the following characters:
.!?
. Words in sentences are separated by any char for which one of library functionsisspace
andispunct
returns nonzero result.Your program should output average usage of the word in the sentences (a number of matched words divided by a number of sentences).
Example input:
two one four two seven eight two four two. two, one, nine, two, two, three, two!
Example expected output:
3.5
Explanation: The word is "two". Then we have two sentences: "one four two seven eight two four two." and "two, one, nine, two, two, three, two!". The word occurs 7 times, so result is 7/2 = 3.5.
Here is my solution:
# include <ctype.h>
# include <stdbool.h>
# include <stdio.h>
# include <stdlib.h>
# include <unistd.h>
void check_malloc_result(void* malloc_result) {
if (!malloc_result) {
perror("malloc failed");
exit(EXIT_FAILURE);
}
}
int main(void) {
char stdin_ch;
size_t buff_size = 8;
size_t word_len = 0;
bool reading_word = false;
char* word = malloc(buff_size);
check_malloc_result(word);
// reading the word
while(read(STDIN_FILENO, &stdin_ch, 1) > 0) {
if (isspace(stdin_ch)) {
if (reading_word) {
break;
} else {
continue;
}
}
reading_word = true;
++word_len;
if (word_len > buff_size) {
buff_size *= 2;
word = realloc(word, buff_size);
check_malloc_result(word);
}
word[word_len - 1] = stdin_ch;
}
// counting matches
size_t sen_count = 0, matched_count = 0, cur_word_pos = 0;
bool matching = true;
reading_word = false;
while(word_len && read(STDIN_FILENO, &stdin_ch, 1) > 0) {
if (ispunct(stdin_ch) || isspace(stdin_ch)) {
if (reading_word) {
// end of a word
if (matching) {
++matched_count;
}
if (stdin_ch == '.' || stdin_ch == '?' || stdin_ch == '!') {
++sen_count;
}
reading_word = false;
matching = true;
cur_word_pos = 0;
}
continue;
}
reading_word = true;
if (matching) {
if (cur_word_pos >= word_len || word[cur_word_pos] != stdin_ch) {
matching = false;
}
++cur_word_pos;
}
}
free(word);
// printing the result
double result = 0.0;
if (sen_count) {
result = (double)matched_count / (double)sen_count;
}
printf("%f", result);
return EXIT_SUCCESS;
}
Some notes:
- I don't want to use fixed max length for the word so I use dynamic allocation here.
- The reading of sentences is done using two boolean flags.
reading_word
means are actualy reading a word, and if it's false, we are somewhere in between. It's used for dealing with consecutive separators.matching
becomes false when we know that current scanned word doesn't match, so we just scan to the end of the word, not checking anything.
What could be better in this code? Is it "idiomatic" C, or is it obvious that it's written by someone without much language experience?
2 Answers 2
First, some positive notes. I got no warnings when compiled with gcc using -Wall -Wextra -pedantic
. The code seems to work correctly, although the user has to signal EOF
from the keyboard, or use file redirection (some warning about this in the question would have been nice.) It is good to see you using size_t
for array indices, good that you are checking the results of your functions (i.e., malloc
and realloc
), and good that you are remembering to free allocated memory.
I like that you do not cast the result of malloc
, though there are those that would disagree on that point.
Using realloc
Concerning realloc
, you are directly assigning the result of realloc
to the pointer to the memory which is being reallocated. This is bad, since realloc
can return a NULL
pointer in the event of an error. This would cause a memory leak because you would no longer have a pointer to the previous allocation. You should instead store the result of realloc
in a temporary variable, and test that before reassigning:
char *temp;
/* ... */
temp = realloc(word, buff_size);
check_malloc_result(temp);
word = temp;
I don't really understand the need for dynamic allocation here. You only allocate storage for one word; this is not a lot of memory. The method of allocating a reasonable space, and then growing it as needed is sound, but note that if the first word is 17 characters, then you will have to allocate for 32 characters. This is a lot of waste, if you are concerned about space at this scale. You could allocate for one extra character at a time, as needed. I would recommend allocating for a larger initial value (say, 100 char
s); if you want to handle possible very large first words, then you can grow this allocation dynamically. Then, if you really want to trim the memory, you can realloc
to the actual length of the word after you have read it.
Using getchar
It would be more idiomatic to use getchar
instead of read
to read from stdin
. To use getchar
here you need to change stdin_ch
to int
type because getchar
returns an int
. Specifically, getchar
returns EOF
in some circumstances, and EOF
is a negative int
value which may not fit in a char
.
Since the specification says that the only separators are spaces and the three punctuation characters, I would consider using '\n'
as a way to terminate input to make keyboard entry more friendly:
int stdin_ch;
/* ... */
while((stdin_ch = getchar()) != '\n' && stdin_ch != EOF) {
/* ... */
while(word_len && (stdin_ch = getchar()) != '\n' && stdin_ch != EOF) {
Without the test for newlines, this looks much simpler than the read
version of the same code:
int stdin_ch;
/* ... */
while((stdin_ch = getchar()) != EOF) {
/* ... */
while(word_len && (stdin_ch = getchar()) != EOF) {
Using ctype.h
Functions
The character classification functions in ctype.h
take int
arguments, but the C Standard requires that these int
values be representable as unsigned char
. Attempting to pass values outside of the unsigned char
range leads to undefined behavior. To be safe and correct you need to ensure that values passed to isspace
, ispunct
, etc., are in this range. This can be accomplished by casting to unsigned char
:
if (isspace((unsigned char)stdin_ch)) {
/* ... */
if (ispunct((unsigned char)stdin_ch) || isspace((unsigned char)stdin_ch)) {
Alternatively, you may find it convenient to do something like:
int uch = (unsigned char)stdin_ch;
if (ispunct(uch) || isspace(uch)) {
This is less noisy and documents clearly that an int
value in the range of unsigned char
is being passed to ispunct
and isspace
.
Other Miscellany
When you calculate result
, you don't need to use two casts; the first will do:
result = (double)matched_count / sen_count;
Or, you can do away with the cast altogether by multiplying by 1.0
; Note that a floating constant without a suffix is of type double
:
result = (1.0 * matched_count) / sen_count;
I personally lean towards the second approach as I prefer to avoid casts whenever possible.
I also added a newline to the format string of your final printf
statement, since this is nicer for the user; you might have a reason not to do this, though.
Finally, a note on the indirection operator: opinions vary, but it seems most common to use "right-leaning" asterisks. This is purely a matter of style, but I prefer:
int *var;
to the "left-leaning" version:
int* var;
The only place that I make an exception is in function declarations:
int * func(int *ptr);
There are good reasons for both styles; consistency trumps all in this case.
-
1\$\begingroup\$ The only people who would disagree about casting the result of
malloc
are C++ programmers wearing a C hat. :-) Otherwise, a fantastic answer that hits on a lot of important points. The right-leaning asterisks seem to be more popular with C programmers, while C++ programmers prefer to associate the asterisk with the type. \$\endgroup\$Cody Gray– Cody Gray2017年01月12日 13:58:11 +00:00Commented Jan 12, 2017 at 13:58 -
\$\begingroup\$ @CodyGray-- Thanks; I sympathize with left-leaning asterisk folks because of the emphasis on type, and that actually sounds compelling to me. But it is just hard for me to look at C code that actually uses this style! \$\endgroup\$ad absurdum– ad absurdum2017年01月12日 14:10:57 +00:00Commented Jan 12, 2017 at 14:10
-
\$\begingroup\$ Thanks a lot for the answer. Very good points about
realloc
, allocation strategy andgetchar
. I didn't use newlines, becase the assignement is automatically checked, I'd added their support otherwise. And I'll stick to right-leaning if it's more common in C code (and as I see it is). Thank you again. \$\endgroup\$DrTyrsa– DrTyrsa2017年01月13日 07:45:12 +00:00Commented Jan 13, 2017 at 7:45 -
1\$\begingroup\$ Allocation can get a bit tricky at times, so it is good to get the fundamentals right, like checking for errors. When you read about the functions that you use (any functions), pay special attention to what they return (or what flags they set) in the event of an error. Using this information in a program may seem like extra work, but it will save you from many headaches, and is the only way to write robust code. \$\endgroup\$ad absurdum– ad absurdum2017年01月13日 19:41:19 +00:00Commented Jan 13, 2017 at 19:41
What you have looks pretty good, but a few additional suggestions:
Error Handling
Nice to see checks for the return value of malloc()
. You put them in a function called check_malloc_result()
, but you’d want to do this for every system call that can fail unrecoverably. Here is the boilerplate I’ve been using for years:
Minimal subset of the header file:
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void fatal_error_helper( const char* msg, const char* sourcefile, int lineno, const char* syserr );
#define fatal_system_error(m) \
fatal_error_helper( (m), __FILE__, __LINE__, strerror(errno) )
Then in another .c
file:
void fatal_error_helper( const char* const msg,
const char* const sourcefile,
const int lineno,
const char * const syserr )
{
fflush(stdout); /* Don't cross the streams! */
fprintf( stderr,
"%s at %s:%d: %s. Program terminated.\n",
msg, sourcefile, lineno, syserr
);
exit(EXIT_FAILURE);
}
You use it for any system call that sets errno, so for example:
void* const word = malloc(SIZE_MAX);
if (!word) {
fatal_system_error("Failed to allocate buffer");
}
This flushes any output waiting to be sent to standard output, then terminates with an message to standard error that includes the line of code it was called from, in this case:
Failed to allocate buffer at /app/example.cpp:32: Cannot allocate memory.
You can use it to check your input stream for errors too, which you should also do.
The I/O Library
You use UNIX functions like read()
rather than <stdio.h>
. If this is meant to run on UNIX-like OSes, and does not need to be portable to anything else, POSIX, Linux and MacOS all support getline()
, which allocates the dynamic input buffer for you.
However, you might prefer to used a fixed-size buffer, and copy words from that into a string table.
The Parsing
If you can put an upper limit on the length of a word, and you only need to support the default C locale, you might want to read in words all at once. (The input today is probably in UTF-8, which makes isspace()
and ispunct()
useless for non-ASCII characters anyway.) In traditional C, this might look like,
errno = 0; // So that any error later must have been set by scanf().
/* input_buf NUST have space for the terminating null character!
* So it is an array of type char[4096], declared outside the while loop.
*/
const int chars_read = scanf(
"%4095[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghiklmnopqrstuvwyz123456789_]", // Possibly not a complete set.
input_buf);
if (ferror(stdin)) {
fatal_system_error("Reading word from input failed");
}
Similarly, you can skip any substring of word separators with an %*[
...]
character-class argument to scanf()
.
You can then look up and increment the count of each word from input_buf
in a hash map, in constant time.
In modern C, you could use regex.
Modularize as Appropriate
If you keep the current approach, I recommend splitting off the logic to resize the buffer into a different module, which has some kind of String
data structure and a String_append()
function to add more chunks of input to it. This function is the where you should call realloc()
.
The state machine also looks like something that could become a function that returns a data structure. Or even two: a tokenizer and an inserter.
-
\$\begingroup\$ FWIW: Somewhere I read the longest "word" in English is 47 characters long. Were this assignment to work with German "words" also, 47 would be regarded as a good beginning, but ultimately woefully inadequate. \$\endgroup\$user272752– user2727522025年06月01日 03:14:09 +00:00Commented Jun 1 at 3:14