Simple tokenizer in C

Question 1

I implemented a simple tokenizer. Would love to hear your feedback on code style, best practices:

#include <stdio.h>
#include "lexer.h"
#include "mem.h"
static inline int delimiter(char c);
char **lexer_tokenize(const char *s) {
 char **rv, **tmp;
 int i, j, k, len, cap;
 cap = 4;
 rv = mem_malloc(sizeof(char *) * cap);
 for (i = 0, j = 0, len = 0;; j++) {
 if (delimiter(s[j]) && i < j) {
 if (len == cap - 1) {
 cap <<= 1;
 if ((tmp = mem_realloc(rv, sizeof(char *) * cap))) {
 mem_free(rv);
 rv = tmp;
 } else {
 mem_free(rv);
 rv = 0;
 fprintf(stderr, "tokenize: error resizing result array.\n");
 break;
 }
 }
 rv[len] = mem_malloc(sizeof(char) * (j - i + 1));
 for (k = 0; (rv[len][k++] = s[i++]) != s[j];)
 ;
 rv[len++][k] = 0;
 if (!s[j])
 break;
 i = j + 1;
 }
 }
 rv[len] = 0;
 return rv;
}
static inline int delimiter(char c) {
 return c == ' ' || c == '\n' || c == 0 || c == EOF;
}

Would also love to hear tips on performance improvements.

Question 2

Could you state what exactly constitutes a token in your language?

Question 3

@200_success An english word (excluding punctuation symbols). I'm working on a top-down parser as a learning exercise (trying to learn natural language understanding).

Question 4

this is missing the contents of the two 'home grown' header files without the contents of the header files, the code is incomplete.

Question 5

Apparent uniform formatting - good

Hope OP is using an auto-formatter.

Bug?

mem_free(rv) is suspicious. Is a free needed on successful re-allocation?

 if ((tmp = mem_realloc(rv, sizeof(char *) * cap))) {
 mem_free(rv);

Insufficient documentation

"a simple tokenizer" is insufficient to describe lexer_tokenize() functionality. I'd hope to see that in the not-included "lexer.h".

Sample usage would have been informative.

OP comments about punctuation symbols yet nothing in code suggests anything dealing with punctuation symbols.

O(n*n) vs. O(n)

With the nested loops code looks O(n*n). I'd expect a tokenizer to be O(n). As is, code is unclear to me.

EOF is not a character of a string

// return c == ' ' || c == '\n' || c == 0 || c == EOF;
return c == ' ' || c == '\n' || c == 0;

Consider any whitespace

'\t', '\r', '\f', '\v' are white-spaces too.

#include <ctype.h>
static inline int delimiter(char c) {
 // return c == ' ' || c == '\n' || c == 0;
 return isspace((unsigned char) c) || c == 0;
}

No checking for allocation failure

As mem_malloc() can return a failure indication, NULL, check for that and handle appropriately.

Declare objects when needed

// char **rv;
// ... 
// rv = mem_malloc(sizeof(char *) * cap);
char **rv = mem_malloc(sizeof(char *) * cap);

Allocate to the refenced object, not the type

It is easier to code right, review and maintain.

// rv = mem_malloc(sizeof(char *) * cap);
char **rv = mem_malloc(sizeof rv[0] * cap);
// rv[len] = mem_malloc(sizeof(char) * (j - i + 1));
rv[len] = mem_malloc(sizeof rv[len][0] * (j - i + 1));

Validate "lexer.h" independence

I assume this code is in "lexer.c" and the lexer_tokenize() declaration is in "lexer.h".

Rather than include #include <stdio.h>, then #include "lexer.h", reverse that to test "lexer.h" has no need for prior includes.

int vs. size_t

int is not certainly wide enough. size_t is the type for any sizeof of an object or indexing. Note that size_t is an unsigned type.

// int i, j, k, len, cap;
size_t i, j, k, len, cap;

Informative names

i, j, k are not informative object names other than they are indexes - but how to they differ? What are they for?

Question 6

I didn't get the "Validate "lexer.h" independence" section? Should I always include the deps in header rather than implementation?

Question 7

@kovac A user "xxx.h" file should include all the included files that .h file needs and no extra. Useful for the corresponding "xxx.c" to test that by including it first.

Question 8

I agree that the code should be commented more.

chux chux 36.4k2 gold badges43 silver badges96 bronze badges · Accepted Answer · 2022-04-30 17:30:41Z

Apparent uniform formatting - good

Hope OP is using an auto-formatter.

Bug?

mem_free(rv) is suspicious. Is a free needed on successful re-allocation?

 if ((tmp = mem_realloc(rv, sizeof(char *) * cap))) {
 mem_free(rv);

Insufficient documentation

"a simple tokenizer" is insufficient to describe lexer_tokenize() functionality. I'd hope to see that in the not-included "lexer.h".

Sample usage would have been informative.

OP comments about punctuation symbols yet nothing in code suggests anything dealing with punctuation symbols.

O(n*n) vs. O(n)

With the nested loops code looks O(n*n). I'd expect a tokenizer to be O(n). As is, code is unclear to me.

EOF is not a character of a string

// return c == ' ' || c == '\n' || c == 0 || c == EOF;
return c == ' ' || c == '\n' || c == 0;

Consider any whitespace

'\t', '\r', '\f', '\v' are white-spaces too.

#include <ctype.h>
static inline int delimiter(char c) {
 // return c == ' ' || c == '\n' || c == 0;
 return isspace((unsigned char) c) || c == 0;
}

No checking for allocation failure

As mem_malloc() can return a failure indication, NULL, check for that and handle appropriately.

Declare objects when needed

// char **rv;
// ... 
// rv = mem_malloc(sizeof(char *) * cap);
char **rv = mem_malloc(sizeof(char *) * cap);

Allocate to the refenced object, not the type

It is easier to code right, review and maintain.

// rv = mem_malloc(sizeof(char *) * cap);
char **rv = mem_malloc(sizeof rv[0] * cap);
// rv[len] = mem_malloc(sizeof(char) * (j - i + 1));
rv[len] = mem_malloc(sizeof rv[len][0] * (j - i + 1));

Validate "lexer.h" independence

I assume this code is in "lexer.c" and the lexer_tokenize() declaration is in "lexer.h".

Rather than include #include <stdio.h>, then #include "lexer.h", reverse that to test "lexer.h" has no need for prior includes.

int vs. size_t

int is not certainly wide enough. size_t is the type for any sizeof of an object or indexing. Note that size_t is an unsigned type.

// int i, j, k, len, cap;
size_t i, j, k, len, cap;

Informative names

i, j, k are not informative object names other than they are indexes - but how to they differ? What are they for?

I didn't get the "Validate "lexer.h" independence" section? Should I always include the deps in header rather than implementation?
@kovac A user "xxx.h" file should include all the included files that .h file needs and no extra. Useful for the corresponding "xxx.c" to test that by including it first.

Stack Exchange Network

Simple tokenizer in C

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Simple tokenizer in C

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions