Splitting a command line in C

Question 1

Is this a good C program for splitting a command line without doing any expansion on it? Don't worry too much about main() and the output -- those are for testing.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <assert.h>
void pullwhitespace(char **input);
int main (int argc, char **argv)
{
 if (argc != 2) {
 fputs("Must have exactly one command line argument\n", stderr);
 return 64;
 }
 char *input = argv[1];
 char *outstring = malloc(strlen(argv[1]) + 1);
 char *outptr = outstring;
 char a;
 pullwhitespace(&input);
NORMAL:
 /* We are in an unquoted part of the input */
 switch ((a = *input++))
 {
 case ' ' :
 case '\t':
 *outptr++ = '0円';
 pullwhitespace(&input);
 goto NORMAL;
 case '\n':
 case '0円':
 goto DONE;
 case '\\':
 if (((a = *input++)) == '\n') {
 break;
 } else if (a) {
 *outptr++ = a;
 goto NORMAL;
 } else {
 goto FAIL;
 }
 case '\'':
 goto QUOTE;
 case '"':
 goto DQUOTE;
 default:
 *outptr++ = a;
 goto NORMAL;
 }
 assert(0);
QUOTE:
 /* We are in a single quoted string */ 
 while (((a = *input++)) != '\'') {
 if (a) {
 *outptr++ = a;
 } else {
 goto FAIL;
 }
 }
 goto NORMAL;
DQUOTE:
 /* We are in a double quoted string */
 switch ((a = *input++))
 {
 case '0円':
 goto FAIL;
 case '\\':
 {
 char b = *input++;
 switch (b)
 {
 case '0円':
 goto FAIL;
 default:
 *outptr++ = a; 
 case '"': 
 case '\\':
 *outptr++ = b;
 case '\n':
 break;
 }
 }
 goto DQUOTE;
 default:
 *outptr++ = a;
 goto DQUOTE;
 case '"':
 goto NORMAL;
 }
 assert(0);
FAIL:
 fprintf(stderr, "%s: Invalid string input\n", argv[0]);
 return 1;
 assert(0);
DONE:
 *outptr++ = '0円';
 *outptr++ = '0円';
 errno = 0;
 fwrite(outstring, 1, (size_t)(outptr - outstring), stdout);
 int error = errno;
 if(ferror(stdout)) {
 fprintf(stderr, "%s: write error: %s\n", argv[0], strerror(error));
 return 1;
 } 
 fflush(stdout);
 if ( errno != 0) { 
 fprintf(stderr, "%s: write error: %s\n", argv[0], strerror(error));
 return 1;
 }
 return 0;
}
void pullwhitespace (char **inputptr)
{
 char *input = *inputptr;
 char a;
 while (( a = *input)) {
 if (a != ' ' && a != '\t') {
 return;
 }
 input++;
 }
}

I know that it is FULL of goto statements (more than every other program I have ever made put together) -- is that okay?

The reason for the goto statements is that this is meant as a lightweight state machine. Intended applications include parsing command line arguments or (possibly - this is far fetched) for the parser (not supporting code) to be the basis of a kernel module to parse shebang lines, allowing multiple arguments to be passed to a shebang interpreter.

Question 2

No, it's a lexer :)

Question 3

What test cases have you tried this with? If you haven't got a set of tests for complex functions like this that it consistently passes, my default opinion is that it's not production ready.

Question 4

Why so many gotos? Those are frowned upon in C... in fact, I'm frowning right now. :-(

Question 5

That is because this is a state machine.

Question 6

gotos need not be feared irrationally, and I've seen them in in lexers before, including in the CPython lexer (tokenizer.c), that said, it's good to figure out the alternatives, which you may end up liking more.

Question 7

I doubt that this code actually works. Especially pullwhitespace doesn't do what you probably intended (hint: it doesn't modify the argument, it makes a copy of the pointer and advances it). Have you tested it before submitting it for review here?

Anyway: You might want to have a look here for alternative ways to implement a state machine. You approach is similar to Remo.D's answer in the linked question but I think his macros make it more readable.

A few things I noticed:

When you are in NORMAL and read a white space you do this *outptr++ = '0円'; Given that strings in C are NULL terminated character sequences this seems dodgy as it will terminate the output right there unless you tell someone to read further by supplying a length. If you intend to split on white space then you should store the results in a char ** like argv.
Again in NORMAL:
```
case '\\':
 if (((a = *input++)) == '\n') {
 break;
 } else if (a) {
 *outptr++ = a;
 goto NORMAL;
 } else {
 goto FAIL;
 }
```
So if you find a \ followed by a newline you break the switch which is followed by an assert(0). Normally assert should be used to state invariants as "this function expects this condition to be true and it is probably a bug in the program if it is not". Specifically assert(0) should be used as "should never execute this code path". It seems weird that external user input can trigger an assert just because the program doesn't like he input. Especially since you have a FAIL state anyway.

Also the else condition evaluation depends on the side effect of the if condition evaluation - not very nice. You should read the next character first in a separate line and then evaluate. This makes it much clearer. So this case could probably have been written as:
```
case '\\':
 a = *input++;
 if (a && a != '\n') {
 *outptr++ = a;
 goto NORMAL;
 }
 goto FAIL;
```
If you are in DQUOTE and encounter a \ then you read another character from the input. However that character simply gets swallowed because in the default case you do this: *outptr++ = a; which will write the \ to the output and skip the current character (held in b).

Question 8

Comments about pullwhitespace():

Code is broken. Needs work to update *inputptr.
Needs better name than inputptr. Maybe string_ptr. Hmm, no much better.
Rather than define "whitespace" as ' ' and '\t', use the C definition of whitesapce.

[Edit] Deleted const susggestion.

void pullwhitespace (char **inputptr) {
 char *input = *inputptr;
 char a;
 while ((a = *input) != '0円') {
 if (!isspace((unsigned char) a)) {
 break;
 }
 input++;
 }
 *inputptr = input;
}

If one accept using the C standard isspace() further simplification results in

 void pullwhitespace (char **inputptr) {
 char *input = *inputptr;
 while (isspace(*input)) input++;
 *inputptr = input;
 }

Question 9

Adding const to the definition of parameter inputptr doesn't work the way you'd expect (see c-faq.com/ansi/constmismatch.html). You have also modified the function in ways that make it worse than the original (first commented line).

Question 10

@William Morris I have seen the error of my ways concerning const and corrected while ((a = *(*inputptr)) != '0円'). Please detail about your 2nd point, what is worse?

Question 11

@William Morris I also see that isspace() needs an unsigned char or EOF. Is that your concern?

Question 12

The original code had char *input = *inputptr; and then de-referenced input, whereas your code omits that and uses eg *(*inputptr), which is more difficult to read (and hence worse). I would have made the OP's input a const, ie: const char *input = *inputptr;. Also it is arguable that the OP might have wanted just space and tab to be treated instead of all white-space (although I haven't read the code to see).

Question 13

BTW, when you use @ to quote a name, there should be no spaces in the name. With @WilliamMorris I get a notification, with the space, I don't :-)

Question 14

This parser is not that complicated, you don't need jumps. My example below ignores some corner cases, just showing the priciple. You can expand it to your needs.

#include <stdio.h>
#include <ctype.h>
int main(int argc, char **argv)
{
 const char *s, *e;
 for (e = argv[1]; *e; s = e) {
 while (isspace(*e)) e++;
 if (*e == '\"') {
 for (s = ++e; *e && *e != '\"'; e++);
 if (!*e) return 1; // missing closing quote
 } else if (*e == '\'') {
 for (s = ++e; *e && *e != '\''; e++);
 if (!*e) return 1; // missing closing quote
 } else {
 for (s = e; *e && !isspace(*e); e++);
 }
 printf("'%.*s'\n", (int)(e - s), s);
 if (*e == '\"' || *e == '\'')
 e++; // advance after closing quote
 }
 return 0;
}

Question 15

I think this answer would be a lot more helpful with some discussion on how this approach differs from the original, and why you think it's better.

Question 16

Your code doesn't cope with escape-sequences in double- or single-quoted strings like the original.

ChrisWue ChrisWue 20.6k4 gold badges43 silver badges107 bronze badges · Answer 1 · 2013-12-14 08:50:21Z

I doubt that this code actually works. Especially pullwhitespace doesn't do what you probably intended (hint: it doesn't modify the argument, it makes a copy of the pointer and advances it). Have you tested it before submitting it for review here?

Anyway: You might want to have a look here for alternative ways to implement a state machine. You approach is similar to Remo.D's answer in the linked question but I think his macros make it more readable.

A few things I noticed:

When you are in NORMAL and read a white space you do this *outptr++ = '0円'; Given that strings in C are NULL terminated character sequences this seems dodgy as it will terminate the output right there unless you tell someone to read further by supplying a length. If you intend to split on white space then you should store the results in a char ** like argv.
Again in NORMAL:
```
case '\\':
 if (((a = *input++)) == '\n') {
 break;
 } else if (a) {
 *outptr++ = a;
 goto NORMAL;
 } else {
 goto FAIL;
 }
```
So if you find a \ followed by a newline you break the switch which is followed by an assert(0). Normally assert should be used to state invariants as "this function expects this condition to be true and it is probably a bug in the program if it is not". Specifically assert(0) should be used as "should never execute this code path". It seems weird that external user input can trigger an assert just because the program doesn't like he input. Especially since you have a FAIL state anyway.

Also the else condition evaluation depends on the side effect of the if condition evaluation - not very nice. You should read the next character first in a separate line and then evaluate. This makes it much clearer. So this case could probably have been written as:
```
case '\\':
 a = *input++;
 if (a && a != '\n') {
 *outptr++ = a;
 goto NORMAL;
 }
 goto FAIL;
```
If you are in DQUOTE and encounter a \ then you read another character from the input. However that character simply gets swallowed because in the default case you do this: *outptr++ = a; which will write the \ to the output and skip the current character (held in b).

chux chux 36.2k2 gold badges43 silver badges96 bronze badges · Answer 2 · 2013-12-15 01:46:03Z

1

\$\begingroup\$

Comments about pullwhitespace():

Code is broken. Needs work to update *inputptr.
Needs better name than inputptr. Maybe string_ptr. Hmm, no much better.
Rather than define "whitespace" as ' ' and '\t', use the C definition of whitesapce.

[Edit] Deleted const susggestion.

void pullwhitespace (char **inputptr) {
 char *input = *inputptr;
 char a;
 while ((a = *input) != '0円') {
 if (!isspace((unsigned char) a)) {
 break;
 }
 input++;
 }
 *inputptr = input;
}

If one accept using the C standard isspace() further simplification results in

 void pullwhitespace (char **inputptr) {
 char *input = *inputptr;
 while (isspace(*input)) input++;
 *inputptr = input;
 }

Share

edited Dec 18, 2013 at 0:27

answered Dec 15, 2013 at 1:46

chux's user avatar

chux chux

36.2k2 gold badges43 silver badges96 bronze badges

\$\endgroup\$

8

\$\begingroup\$ Adding const to the definition of parameter inputptr doesn't work the way you'd expect (see c-faq.com/ansi/constmismatch.html). You have also modified the function in ways that make it worse than the original (first commented line). \$\endgroup\$

William Morris
– William Morris

2013年12月17日 20:29:03 +00:00
Commented Dec 17, 2013 at 20:29
\$\begingroup\$ @William Morris I have seen the error of my ways concerning const and corrected while ((a = *(*inputptr)) != '0円'). Please detail about your 2nd point, what is worse? \$\endgroup\$

chux
– chux

2013年12月17日 20:51:48 +00:00
Commented Dec 17, 2013 at 20:51
\$\begingroup\$ @William Morris I also see that isspace() needs an unsigned char or EOF. Is that your concern? \$\endgroup\$

chux
– chux

2013年12月17日 21:25:40 +00:00
Commented Dec 17, 2013 at 21:25
\$\begingroup\$ The original code had char *input = *inputptr; and then de-referenced input, whereas your code omits that and uses eg *(*inputptr), which is more difficult to read (and hence worse). I would have made the OP's input a const, ie: const char *input = *inputptr;. Also it is arguable that the OP might have wanted just space and tab to be treated instead of all white-space (although I haven't read the code to see). \$\endgroup\$

William Morris
– William Morris

2013年12月17日 22:56:13 +00:00
Commented Dec 17, 2013 at 22:56
\$\begingroup\$ BTW, when you use @ to quote a name, there should be no spaces in the name. With @WilliamMorris I get a notification, with the space, I don't :-) \$\endgroup\$

William Morris
– William Morris

2013年12月17日 22:58:23 +00:00
Commented Dec 17, 2013 at 22:58

| Show 3 more comments

user33516user33516 · Answer 3 · 2013-12-14 10:04:31Z

This parser is not that complicated, you don't need jumps. My example below ignores some corner cases, just showing the priciple. You can expand it to your needs.

#include <stdio.h>
#include <ctype.h>
int main(int argc, char **argv)
{
 const char *s, *e;
 for (e = argv[1]; *e; s = e) {
 while (isspace(*e)) e++;
 if (*e == '\"') {
 for (s = ++e; *e && *e != '\"'; e++);
 if (!*e) return 1; // missing closing quote
 } else if (*e == '\'') {
 for (s = ++e; *e && *e != '\''; e++);
 if (!*e) return 1; // missing closing quote
 } else {
 for (s = e; *e && !isspace(*e); e++);
 }
 printf("'%.*s'\n", (int)(e - s), s);
 if (*e == '\"' || *e == '\'')
 e++; // advance after closing quote
 }
 return 0;
}

I think this answer would be a lot more helpful with some discussion on how this approach differs from the original, and why you think it's better.
Your code doesn't cope with escape-sequences in double- or single-quoted strings like the original.

Stack Exchange Network

Splitting a command line in C

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Splitting a command line in C

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions