I implemented a simple tokenizer. Would love to hear your feedback on code style, best practices:
#include <stdio.h>
#include "lexer.h"
#include "mem.h"
static inline int delimiter(char c);
char **lexer_tokenize(const char *s) {
char **rv, **tmp;
int i, j, k, len, cap;
cap = 4;
rv = mem_malloc(sizeof(char *) * cap);
for (i = 0, j = 0, len = 0;; j++) {
if (delimiter(s[j]) && i < j) {
if (len == cap - 1) {
cap <<= 1;
if ((tmp = mem_realloc(rv, sizeof(char *) * cap))) {
mem_free(rv);
rv = tmp;
} else {
mem_free(rv);
rv = 0;
fprintf(stderr, "tokenize: error resizing result array.\n");
break;
}
}
rv[len] = mem_malloc(sizeof(char) * (j - i + 1));
for (k = 0; (rv[len][k++] = s[i++]) != s[j];)
;
rv[len++][k] = 0;
if (!s[j])
break;
i = j + 1;
}
}
rv[len] = 0;
return rv;
}
static inline int delimiter(char c) {
return c == ' ' || c == '\n' || c == 0 || c == EOF;
}
Would also love to hear tips on performance improvements.
-
2\$\begingroup\$ Could you state what exactly constitutes a token in your language? \$\endgroup\$200_success– 200_success2022年04月30日 05:35:05 +00:00Commented Apr 30, 2022 at 5:35
-
\$\begingroup\$ @200_success An english word (excluding punctuation symbols). I'm working on a top-down parser as a learning exercise (trying to learn natural language understanding). \$\endgroup\$kovac– kovac2022年04月30日 07:20:47 +00:00Commented Apr 30, 2022 at 7:20
-
\$\begingroup\$ this is missing the contents of the two 'home grown' header files without the contents of the header files, the code is incomplete. \$\endgroup\$user3629249– user36292492022年05月02日 11:52:59 +00:00Commented May 2, 2022 at 11:52
1 Answer 1
Apparent uniform formatting - good
Hope OP is using an auto-formatter.
Bug?
mem_free(rv)
is suspicious. Is a free needed on successful re-allocation?
if ((tmp = mem_realloc(rv, sizeof(char *) * cap))) {
mem_free(rv);
Insufficient documentation
"a simple tokenizer" is insufficient to describe lexer_tokenize()
functionality. I'd hope to see that in the not-included "lexer.h".
Sample usage would have been informative.
OP comments about punctuation symbols yet nothing in code suggests anything dealing with punctuation symbols.
O(n*n) vs. O(n)
With the nested loops code looks O(n*n). I'd expect a tokenizer to be O(n). As is, code is unclear to me.
EOF
is not a character of a string
// return c == ' ' || c == '\n' || c == 0 || c == EOF;
return c == ' ' || c == '\n' || c == 0;
Consider any whitespace
'\t'
, '\r'
, '\f'
, '\v'
are white-spaces too.
#include <ctype.h>
static inline int delimiter(char c) {
// return c == ' ' || c == '\n' || c == 0;
return isspace((unsigned char) c) || c == 0;
}
No checking for allocation failure
As mem_malloc()
can return a failure indication, NULL
, check for that and handle appropriately.
Declare objects when needed
// char **rv;
// ...
// rv = mem_malloc(sizeof(char *) * cap);
char **rv = mem_malloc(sizeof(char *) * cap);
Allocate to the refenced object, not the type
It is easier to code right, review and maintain.
// rv = mem_malloc(sizeof(char *) * cap);
char **rv = mem_malloc(sizeof rv[0] * cap);
// rv[len] = mem_malloc(sizeof(char) * (j - i + 1));
rv[len] = mem_malloc(sizeof rv[len][0] * (j - i + 1));
Validate "lexer.h" independence
I assume this code is in "lexer.c" and the lexer_tokenize()
declaration is in "lexer.h".
Rather than include #include <stdio.h>
, then #include "lexer.h"
, reverse that to test "lexer.h"
has no need for prior includes.
int
vs. size_t
int
is not certainly wide enough. size_t
is the type for any sizeof of an object or indexing. Note that size_t
is an unsigned type.
// int i, j, k, len, cap;
size_t i, j, k, len, cap;
Informative names
i, j, k
are not informative object names other than they are indexes - but how to they differ? What are they for?
-
\$\begingroup\$ I didn't get the "Validate "lexer.h" independence" section? Should I always include the deps in header rather than implementation? \$\endgroup\$kovac– kovac2022年05月01日 03:17:12 +00:00Commented May 1, 2022 at 3:17
-
1\$\begingroup\$ @kovac A user
"xxx.h"
file should include all the included files that.h
file needs and no extra. Useful for the corresponding"xxx.c"
to test that by including it first. \$\endgroup\$chux– chux2022年05月01日 05:28:14 +00:00Commented May 1, 2022 at 5:28 -
\$\begingroup\$ I agree that the code should be commented more. \$\endgroup\$qwr– qwr2022年05月01日 08:36:12 +00:00Commented May 1, 2022 at 8:36