3
\$\begingroup\$

I implemented a simple tokenizer. Would love to hear your feedback on code style, best practices:

#include <stdio.h>
#include "lexer.h"
#include "mem.h"
static inline int delimiter(char c);
char **lexer_tokenize(const char *s) {
 char **rv, **tmp;
 int i, j, k, len, cap;
 cap = 4;
 rv = mem_malloc(sizeof(char *) * cap);
 for (i = 0, j = 0, len = 0;; j++) {
 if (delimiter(s[j]) && i < j) {
 if (len == cap - 1) {
 cap <<= 1;
 if ((tmp = mem_realloc(rv, sizeof(char *) * cap))) {
 mem_free(rv);
 rv = tmp;
 } else {
 mem_free(rv);
 rv = 0;
 fprintf(stderr, "tokenize: error resizing result array.\n");
 break;
 }
 }
 rv[len] = mem_malloc(sizeof(char) * (j - i + 1));
 for (k = 0; (rv[len][k++] = s[i++]) != s[j];)
 ;
 rv[len++][k] = 0;
 if (!s[j])
 break;
 i = j + 1;
 }
 }
 rv[len] = 0;
 return rv;
}
static inline int delimiter(char c) {
 return c == ' ' || c == '\n' || c == 0 || c == EOF;
}

Would also love to hear tips on performance improvements.

200_success
145k22 gold badges190 silver badges478 bronze badges
asked Apr 30, 2022 at 3:58
\$\endgroup\$
3
  • 2
    \$\begingroup\$ Could you state what exactly constitutes a token in your language? \$\endgroup\$ Commented Apr 30, 2022 at 5:35
  • \$\begingroup\$ @200_success An english word (excluding punctuation symbols). I'm working on a top-down parser as a learning exercise (trying to learn natural language understanding). \$\endgroup\$ Commented Apr 30, 2022 at 7:20
  • \$\begingroup\$ this is missing the contents of the two 'home grown' header files without the contents of the header files, the code is incomplete. \$\endgroup\$ Commented May 2, 2022 at 11:52

1 Answer 1

3
\$\begingroup\$

Apparent uniform formatting - good

Hope OP is using an auto-formatter.

Bug?

mem_free(rv) is suspicious. Is a free needed on successful re-allocation?

 if ((tmp = mem_realloc(rv, sizeof(char *) * cap))) {
 mem_free(rv);

Insufficient documentation

"a simple tokenizer" is insufficient to describe lexer_tokenize() functionality. I'd hope to see that in the not-included "lexer.h".

Sample usage would have been informative.

OP comments about punctuation symbols yet nothing in code suggests anything dealing with punctuation symbols.

O(n*n) vs. O(n)

With the nested loops code looks O(n*n). I'd expect a tokenizer to be O(n). As is, code is unclear to me.

EOF is not a character of a string

// return c == ' ' || c == '\n' || c == 0 || c == EOF;
return c == ' ' || c == '\n' || c == 0;

Consider any whitespace

'\t', '\r', '\f', '\v' are white-spaces too.

#include <ctype.h>
static inline int delimiter(char c) {
 // return c == ' ' || c == '\n' || c == 0;
 return isspace((unsigned char) c) || c == 0;
}

No checking for allocation failure

As mem_malloc() can return a failure indication, NULL, check for that and handle appropriately.

Declare objects when needed

// char **rv;
// ... 
// rv = mem_malloc(sizeof(char *) * cap);
char **rv = mem_malloc(sizeof(char *) * cap);

Allocate to the refenced object, not the type

It is easier to code right, review and maintain.

// rv = mem_malloc(sizeof(char *) * cap);
char **rv = mem_malloc(sizeof rv[0] * cap);
// rv[len] = mem_malloc(sizeof(char) * (j - i + 1));
rv[len] = mem_malloc(sizeof rv[len][0] * (j - i + 1));

Validate "lexer.h" independence

I assume this code is in "lexer.c" and the lexer_tokenize() declaration is in "lexer.h".

Rather than include #include <stdio.h>, then #include "lexer.h", reverse that to test "lexer.h" has no need for prior includes.

int vs. size_t

int is not certainly wide enough. size_t is the type for any sizeof of an object or indexing. Note that size_t is an unsigned type.

// int i, j, k, len, cap;
size_t i, j, k, len, cap;

Informative names

i, j, k are not informative object names other than they are indexes - but how to they differ? What are they for?

answered Apr 30, 2022 at 17:30
\$\endgroup\$
3
  • \$\begingroup\$ I didn't get the "Validate "lexer.h" independence" section? Should I always include the deps in header rather than implementation? \$\endgroup\$ Commented May 1, 2022 at 3:17
  • 1
    \$\begingroup\$ @kovac A user "xxx.h" file should include all the included files that .h file needs and no extra. Useful for the corresponding "xxx.c" to test that by including it first. \$\endgroup\$ Commented May 1, 2022 at 5:28
  • \$\begingroup\$ I agree that the code should be commented more. \$\endgroup\$ Commented May 1, 2022 at 8:36

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.