Efficient data-driven finite state machine

Question 1

In creating this answer to a recent question, I thought I might write a fully worked example that implements a finite state machine in C.

The purpose is to implement the following function:

Given a const char *, search for a string conforming to the pattern WWWDD where WWW is one of { "Mon", "Tue", "Wed, "Thu", "Fri", "Sat", "Sun" } and DD is an integer in the range \$[1, 31]\$ (inclusive). If the pattern exists within the string, return the offset; otherwise return -1.

I implemented this state machine:finite state machine using simple table structure. This is the code, which includes a simple driver with test vectors.

#include <string.h>
#include <stdlib.h>
#include <ctype.h>
static const struct fsmNode {
 const char target;
 const struct fsmNode *next;
} fsm[] = {
 { 'T' , &fsm[6] }, // 0
 { 'M' , &fsm[9] }, // 1
 { 'S' , &fsm[11] }, // 2
 { 'W' , &fsm[12] }, // 3
 { 'F' , &fsm[14] }, // 4
 { '0円' , NULL }, // 5
 // second letter
 { 'h' , &fsm[18] }, // 6
 { 'u' , &fsm[20] }, // 7
 { '0円' , NULL }, // 8
 { 'o' , &fsm[22] }, // 9
 { '0円' , NULL }, // 10
 { 'a' , &fsm[24] }, // 11
 { 'u' , &fsm[22] }, // 12
 { '0円' , NULL }, // 13
 { 'e' , &fsm[26] }, // 14
 { '0円' , NULL }, // 15
 { 'r' , &fsm[27] }, // 16
 { '0円' , NULL }, // 17
 // third letter
 { 'u' , &fsm[30] }, // 18
 { '0円' , NULL }, // 19
 { 'e' , &fsm[30] }, // 20
 { '0円' , NULL }, // 21
 { 'n' , &fsm[30] }, // 22
 { '0円' , NULL }, // 23
 { 't' , &fsm[30] }, // 24
 { '0円' , NULL }, // 25
 { 'd' , &fsm[30] }, // 26
 { '0円' , NULL }, // 27
 { 'i' , &fsm[30] }, // 28
 { '0円' , NULL }, // 29
 // first digit
 { '0' , &fsm[36] }, // 30
 { '1' , &fsm[35] }, // 31
 { '2' , &fsm[35] }, // 32
 { '3' , &fsm[46] }, // 33
 { '0円' , NULL }, // 34
 // second digit
 { '0' , NULL }, // 35
 { '1' , NULL }, // 36
 { '2' , NULL }, // 37
 { '3' , NULL }, // 38
 { '4' , NULL }, // 39
 { '5' , NULL }, // 40
 { '6' , NULL }, // 41
 { '7' , NULL }, // 42
 { '8' , NULL }, // 43
 { '9' , NULL }, // 44
 { '0円' , NULL }, // 45
 { '0' , NULL }, // 46
 { '1' , NULL }, // 47
 { '0円' , NULL }, // 48
};
int findWeekdayDate(const char *str) {
 const struct fsmNode *state = fsm;
 const char *curr;
 for (curr = str ; *curr; ++curr) {
 for ( ; state->target && *curr != state->target; ++state) 
 { /* do nothing */ }
 if (state->next == NULL) {
 if (state->target) {
 // found a valid string
 return curr-str-4; 
 } else {
 state = fsm;
 }
 } else {
 state = state->next;
 }
 }
 return -1;
}
// Test code follows...
static const struct test_s {
 const char *string;
 int expected;
} test[] = {
 { "abcSun24Def", 3},
 { "abcMun24Def", -1},
 { "abcSun091", 3},
 { "MON31", -1},
 { "Mon31", 0},
 { NULL, -1 } // last item
};
#include <stdio.h>
#define RED "033円[31m"
#define GREEN "033円[32m"
#define WHITE "033円[39m"
int main() {
 for (size_t i=0; test[i].string != NULL; ++i) {
 int val = findWeekdayDate(test[i].string);
 printf("%s: \"%s\", expected %d, got %d\n", 
 (val == test[i].expected ? GREEN " OK" WHITE: RED "BAD" WHITE),
 test[i].string, test[i].expected, val
 );
 }
}

Comments on efficiency, style, performance, data structure choice and implementation are all of interest.

Question 2

Because it's been asked, the software I used to create the FSM diagram is graphviz.

Question 3

Nice work.

On findWeekdayDate() return, I'd expect a value that is enumerated like fWD_SUCCESS, fWD_FAIL, fWD_SUCCESS_for_foo_reason, fWD_FAIL_for_bar_reason, etc. Rather than magic numbers -1,0,3,.... It could end up being a simple bool.
As fsm[] is const struct fsmNode, I see no value in making a field const like const char target. char target is simpler and sufficient. OTOH, if there is some value, then next should also be const as in const struct fsmNode * const next.

Delay/remove #includes.

#include <string.h> // Needed?
#include <stdlib.h> // OK, for NULL
#include <ctype.h> // Needed?

Style inconsistency. Recommend empty for body to match other indentations.

// for ( ; state->target && *curr != state->target; ++state) 
//{ /* do nothing */ }
for ( ; state->target && *curr != state->target; ++state) {
 /* do nothing */
}

Use an auto formatter. The extra space betrays manual formatting as it is inconsistent with the rest of code. Life is too short for manual formatting.
```
// v
for (curr = str ; *curr; ++curr) {
```

Suggest adding link to software used to create FSM.

Question 4

Thanks for the review. On the first point, the return value can't be just a bool because it represents the offset within the string that the pattern was found. That's why -1 is useful as a "not found" indicator.

Question 5

I'm confused by your second point. The definition of the struct has both elements declared as const, so I'm not sure what you mean when you write "next should also be const" -- it seems to me that it already is. Can you clarify?

Question 6

@Edward The field next as const struct fsmNode *next is a pointer to const struct fsmNode, but is not itself const. Compare struct fsmNode * const next and const struct fsmNode * const next.

Question 7

Let me be more clear: what would adding an additional const keyword prevent that is currently possible but undesirable?

Question 8

Oh, I see now. You're advocating removing const (in front of target) rather than adding one.

Question 9

This was quite fun to read and to figure out how it works.

It took me a while to understand the control flow, which is my main point of critisism. I rearranged it a bit so that, at least in my opinion, the logic is easier to follow.

for ( ; state->target && *curr != state->target; ++state)
{ /* do nothing */ }

Here, instead of having to use a comment to "excuse" the empty body of the for loop, I would use a while loop and add a real comment explaining its purpose.

if (state->next == NULL) {
 if (state->target) {
 // found a valid string
 return curr-str-4; 
 } else {
 state = fsm;
 }
} else {
 state = state->next;
}

For this part, I would apply the following changes:

Consistently use implicit boolean evaluation (i.e. use !p instead of p == NULL).
Use continue, in order to
- reduce nesting,
- group each condition with its corresponding action,
- sort the cases by increasing success (no match → partial match → full match).
Add comments explaining each case.

This is what comes out:

int findWeekdayDate(const char *str) {
 const struct fsmNode *state = fsm;
 const char *curr;
 for (curr = str; *curr; ++curr) {
 // try to find a match
 while (state->target && *curr != state->target) {
 ++state;
 }
 if (!state->target) {
 // no match, retry with next character
 state = fsm;
 continue;
 }
 if (state->next) {
 // partial match, proceed with next character
 state = state->next;
 continue;
 }
 // full match
 return curr - str - 4;
 }
 return -1;
}

I hope this is still correct, at least all the provided test cases are passed.

Other remarks are:

I'm afraid this is quite an inflexible solution. Adding, removing or changing the patterns would probably require updating most of the next indexes (and the "line number" comments).
Assuming it was somewhat convenient to generate the fsm structure, it would make sense to make it a parameter of the matching function so that the function can be reused for different FSMs.
The magic number 4 (in curr-str-4) should be a named constant.
- Its value should actually by 5 (because it is the length of the WWWDD pattern),
- the result of the function should then be calculated as (curr - PATTERN_LENGTH + 1) - str in order to (1) make the awareness of the "fence post problem" apparent and (2) highlight the fact that the offset into str is calculated.
- Hard-coding it into the function makes it also impossible to support patterns where not all matches have the same length, e.g. the full weekday names (which would already be supported by the given FSM data structure).

Question 10

Perhaps if (state->next) { state = state->next; continue; } return curr - str - 4; --> if (!state->next) return curr - str - 4; state = state->next; ?

Question 11

Then the cases are not sorted by increasing success anymore, which was one of my arguments.

Question 12

Yes, the sort idea has merit - UV. // no match, go back to start threw me off a bit as the idea of going back to the beginning of the loop makes sense, but a continue is slightly incongruent with "go back to start". As it does more than go back to the loop beginning, it also does a ++curr, the iterative step between loops. More like a "next" than "go back".

Question 13

It means "continue with the next character, at the start of the pattern". I named it after the "start" node in the FSM drawing.

Question 14

This is a great example of a state machine, thanks for going to the extra effort of coding it up. I'm going to quote your own review of the previous implementation:

Check for a null pointer

Things do not go well if the routine is passed a NULL pointer. On my machine, I get a segmentation fault and a crash. You can eliminate this hole by adding these lines near the top of the routine:

if (str == NULL) {
 return -1;
}

You may have taken an intentional decision not to support this case, however it could be that you didn't notice because your test cases stop when they encounter a NULL as the test case:

for (i=0; test[i].string != NULL; ++i) {

If you want to support it, then a better running condition for your tests might be:

for (i=0; i < sizeof(test)/sizeof(test[0]); ++i) {

Maintainability

I had a similar concern to @mkrieger about the maintainability of the approach if you wanted to expand the state machine. On reflection however, you could simply move any nodes that you wanted to expand to the end of the list without having to update everything. This would create holes, but the memory impact would be small and your algorithm would cope with it. Where you're linking to a state multiple times however, I'd consider introducing a constant to represent the state. This would make the table easier to read + adapt. So, fsm[30] could be fsm[DAY_VALID] etc.

Question 15

Concerning the NULL check: In general, checking for NULL does have merit, yet this code's goal was "Given a const char *, search for a string conforming ...". As a string is an array and NULL never points to a valid string, checking for NULL is going beyond coding goals. Beware of creeping features.

Question 16

@chux I don't think that is clear from the spec at all. It says 'search for a string' not 'search a string'. The reference at the end is better 'If the pattern exists within the string' and implies that a string is passed in, but it is not a clear statement of expectation. Given the inconsistencies and Edward's previous answer having referred to a null check for essentially the same function, it seemed like a valid concern, particularly since there is also a test case that implies null should return -1, although the case isn't executed.

chux chuxchux 36.4k2 gold badges43 silver badges96 bronze badges · Accepted Answer · 2016-12-22 23:21:55Z

Nice work.

On findWeekdayDate() return, I'd expect a value that is enumerated like fWD_SUCCESS, fWD_FAIL, fWD_SUCCESS_for_foo_reason, fWD_FAIL_for_bar_reason, etc. Rather than magic numbers -1,0,3,.... It could end up being a simple bool.
As fsm[] is const struct fsmNode, I see no value in making a field const like const char target. char target is simpler and sufficient. OTOH, if there is some value, then next should also be const as in const struct fsmNode * const next.

Delay/remove #includes.

#include <string.h> // Needed?
#include <stdlib.h> // OK, for NULL
#include <ctype.h> // Needed?

Style inconsistency. Recommend empty for body to match other indentations.

// for ( ; state->target && *curr != state->target; ++state) 
//{ /* do nothing */ }
for ( ; state->target && *curr != state->target; ++state) {
 /* do nothing */
}

Use an auto formatter. The extra space betrays manual formatting as it is inconsistent with the rest of code. Life is too short for manual formatting.
```
// v
for (curr = str ; *curr; ++curr) {
```

Suggest adding link to software used to create FSM.

Thanks for the review. On the first point, the return value can't be just a bool because it represents the offset within the string that the pattern was found. That's why -1 is useful as a "not found" indicator.
I'm confused by your second point. The definition of the struct has both elements declared as const, so I'm not sure what you mean when you write "next should also be const" -- it seems to me that it already is. Can you clarify?
@Edward The field next as const struct fsmNode *next is a pointer to const struct fsmNode, but is not itself const. Compare struct fsmNode * const next and const struct fsmNode * const next.
Let me be more clear: what would adding an additional const keyword prevent that is currently possible but undesirable?
Oh, I see now. You're advocating removing const (in front of target) rather than adding one.

Stack Exchange Network

Efficient data-driven finite state machine

3 Answers 3

Check for a null pointer

Maintainability

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Efficient data-driven finite state machine

3 Answers 3

Check for a null pointer

Maintainability

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions