I am writing a function that reads all consecutive blanks in a string, moves the pointer to the next non-blank char, and return a struct "token" that contains the blanks it found.
The main function is defined as
int match_ws(const char** p, struct token* tok)
where p
points to the beginning of the strings to analyze, and tok
will contain the result after the function returns. Return value is either 1 (successful), or 0.
[EDIT]
match_ws
starts analyzing the string given as first input, and reads all consecutive whitespaces, while advancing the pointer too. It stores all blanks it found in the struct (second input) and returns 1. If it didn't find any blank, or too many, returns 0 without modifying tok
.
So, if the input is const char* p = " . +"
then after match_ws
ran the tok
structure will contain lexeme = " "
(the four spaces at the beginning of p
, and type = WS
(enum type for "whitespace").
If the input is const char* p = ". ; )"
then the function just returns 0
without modifying tok
, because **p
is not a whitespace (it's a dot).
Overall, this is part of a parser, and match_ws
is the function that "consumes" all whitespaces -- not relevant for the parser. I believe the function behaves correctly, but since I am not experienced with the C language, I would like to have reviews about how I handled allocation of memory, pointer, pointer to pointers, function definition, etc.
[/EDIT]
Overall, my questions are about correct usage of pointers, pointers-to-pointers, good choice in the type of arguments (for the function definitions), and memory management.
I wrote my questions in the comments, but I am not so used to memory management and coding standards in C, so would appreciate a code review and all suggestions welcome.
I also added some boilerplate code so the file can be compiled and executed. As you can guess is part of a larger code, but I reduced it to be a minimal example.
#include <stdio.h>
#include <stdlib.h>
#define MAX_LINE 100
/*
* Some boilerplate code
*/
enum token_type {
WS, // white spaces
Eof
};
struct token{
char* lexeme;
enum token_type type;
};
char consume(const char** p);
/*
* Main function to work on.
* Questions:
* - char ws[MAX_LINE] or using malloc?
* - is it OK to initialize char ws with "" ?
* - Am I appending char into ws in a optimal way?
* - is it OK to return int and modify input argument tok, or should create a new tok and return it?
* - makes sense to have pointer-to-pointer as input? Else what?
*/
int match_ws(const char** p, struct token* tok) {
/*
* Find all consecutives blanks starting from position p
* Modify the input struct token.
* Return 1 if execution was successful, 0 otherwise.
*/
if ( **p != ' ' &&
**p != '\t' &&
**p != '\r' &&
**p != '\n')
{
printf("Bad call to whitespace!!");
return 0;
}
int i = 0;
char c;
char ws[MAX_LINE] = "";
do {
if (i >= MAX_LINE) {
printf("Found more than %d white spaces", MAX_LINE);
return 0;
}
c = consume(p);
printf("WHITESPACE : Current char is |%c|\n", c);
ws[i++] = c;
}
// Here p has been already advanced by the call to consume fn
while (**p == ' ' || **p == '\t' || **p == '\r' || **p == '\n');
tok->lexeme = ws;
tok->type = WS;
return 1;
}
/*
* More boilerplate code to compile
*/
const char* type2char (enum token_type t) {
switch (t)
{
case WS: return "WS";
default: return "UNK";
}
}
void print_token(struct token* p) {
printf("<%s, %s>\n", p->lexeme, type2char(p->type));
}
char consume(const char** p) {
// Advances the pointer while reading the input
char c = **p;
(*p)++;
return c;
}
/*
* Main.
* Questions:
* - is it OK to pass arguments to match_ws via '&' notation?
* - Makes sense to have const char* line? Or just char* ?
* - Should tok be initialized somehow?
* - Is there anything that has to be free() eventually?
*/
int main() {
const char* line = " +";
struct token tok;
match_ws(&line, &tok);
print_token(&tok);
}
1 Answer 1
Answers to Questions
char ws[MAX_LINE]
or usingmalloc
?
The function match_ws()
should return a pointer to a struct token
. The token should be allocated by the match_ws()
function, the contents of the token should also use malloc()
to allocate the string that token->lexeme
points to. Right now the code returns the address of a variable on the stack, this is not safe and can lead to undefined behavior (UB).
- is it OK to initialize char
ws
with""
?
It is okay to initialize an array of char with ""
. To initialize a single character use single quotes instead.
- Am I appending
char
intows
in a optimal way?
Using pointers would be more optimal.
- is it OK to return
int
and modify input argumenttok
, or should create a newtok
and return it?
See the answer to the first question.
- makes sense to have pointer-to-pointer as input? Else what?
It would be better to use a single pointer rather than a double pointer. It is possible to corrupt the input when it is a double pointer.
- is it OK to pass arguments to
match_ws
via '&' notation?
It is OK to pass the the token argument using the &
notation if you don't want to allocate a token in the function, it would be better to pass line
without the &
notation.
- Makes sense to have
const char*
line? Or justchar*
?
The use of const
is appropriate here.
- Should
tok
be initialized somehow?
The C programming language is a very unforgiving programming language: all variables should be initialized.
- Is there anything that has to be
free()
eventually?
Not in the current implementation.
Don't Reinvent the Wheel
Checking the values of input characters is very common in the C programming language; a function is provided for checking whitespace, and another function is provided for checking alphanumeric characters. What you do need to do is include <ctype.h>
. The advantage of doing this is that the current code probably misses some white space characters that isspace()
would find.
Code Organization
Function prototypes are very useful in large programs that contain multiple source files, and that in case they will be in header files. In a single file program like this it is better to put the main()
function at the bottom of the file and all the functions that get used in the proper order above main()
. Keep in mind that every line of code written is another line of code where a bug can crawl into the code.
match_ws
interface to look deeper into the code. Possibly if you add some examples get a better grasp of the interface intention. \$\endgroup\$