Is this a good C program for splitting a command line without doing any expansion on it? Don't worry too much about main()
and the output -- those are for testing.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <assert.h>
void pullwhitespace(char **input);
int main (int argc, char **argv)
{
if (argc != 2) {
fputs("Must have exactly one command line argument\n", stderr);
return 64;
}
char *input = argv[1];
char *outstring = malloc(strlen(argv[1]) + 1);
char *outptr = outstring;
char a;
pullwhitespace(&input);
NORMAL:
/* We are in an unquoted part of the input */
switch ((a = *input++))
{
case ' ' :
case '\t':
*outptr++ = '0円';
pullwhitespace(&input);
goto NORMAL;
case '\n':
case '0円':
goto DONE;
case '\\':
if (((a = *input++)) == '\n') {
break;
} else if (a) {
*outptr++ = a;
goto NORMAL;
} else {
goto FAIL;
}
case '\'':
goto QUOTE;
case '"':
goto DQUOTE;
default:
*outptr++ = a;
goto NORMAL;
}
assert(0);
QUOTE:
/* We are in a single quoted string */
while (((a = *input++)) != '\'') {
if (a) {
*outptr++ = a;
} else {
goto FAIL;
}
}
goto NORMAL;
DQUOTE:
/* We are in a double quoted string */
switch ((a = *input++))
{
case '0円':
goto FAIL;
case '\\':
{
char b = *input++;
switch (b)
{
case '0円':
goto FAIL;
default:
*outptr++ = a;
case '"':
case '\\':
*outptr++ = b;
case '\n':
break;
}
}
goto DQUOTE;
default:
*outptr++ = a;
goto DQUOTE;
case '"':
goto NORMAL;
}
assert(0);
FAIL:
fprintf(stderr, "%s: Invalid string input\n", argv[0]);
return 1;
assert(0);
DONE:
*outptr++ = '0円';
*outptr++ = '0円';
errno = 0;
fwrite(outstring, 1, (size_t)(outptr - outstring), stdout);
int error = errno;
if(ferror(stdout)) {
fprintf(stderr, "%s: write error: %s\n", argv[0], strerror(error));
return 1;
}
fflush(stdout);
if ( errno != 0) {
fprintf(stderr, "%s: write error: %s\n", argv[0], strerror(error));
return 1;
}
return 0;
}
void pullwhitespace (char **inputptr)
{
char *input = *inputptr;
char a;
while (( a = *input)) {
if (a != ' ' && a != '\t') {
return;
}
input++;
}
}
I know that it is FULL of goto statements (more than every other program I have ever made put together) -- is that okay?
The reason for the goto
statements is that this is meant as a lightweight state machine. Intended applications include parsing command line arguments or (possibly - this is far fetched) for the parser (not supporting code) to be the basis of a kernel module to parse shebang lines, allowing multiple arguments to be passed to a shebang interpreter.
3 Answers 3
I doubt that this code actually works. Especially pullwhitespace
doesn't do what you probably intended (hint: it doesn't modify the argument, it makes a copy of the pointer and advances it). Have you tested it before submitting it for review here?
Anyway: You might want to have a look here for alternative ways to implement a state machine. You approach is similar to Remo.D's answer in the linked question but I think his macros make it more readable.
A few things I noticed:
When you are in
NORMAL
and read a white space you do this*outptr++ = '0円';
Given that strings in C are NULL terminated character sequences this seems dodgy as it will terminate the output right there unless you tell someone to read further by supplying a length. If you intend to split on white space then you should store the results in achar **
likeargv
.Again in
NORMAL
:case '\\': if (((a = *input++)) == '\n') { break; } else if (a) { *outptr++ = a; goto NORMAL; } else { goto FAIL; }
So if you find a
\
followed by a newline youbreak
theswitch
which is followed by anassert(0)
. Normallyassert
should be used to state invariants as "this function expects this condition to be true and it is probably a bug in the program if it is not". Specificallyassert(0)
should be used as "should never execute this code path". It seems weird that external user input can trigger anassert
just because the program doesn't like he input. Especially since you have aFAIL
state anyway.Also the
else
condition evaluation depends on the side effect of theif
condition evaluation - not very nice. You should read the next character first in a separate line and then evaluate. This makes it much clearer. So this case could probably have been written as:case '\\': a = *input++; if (a && a != '\n') { *outptr++ = a; goto NORMAL; } goto FAIL;
If you are in
DQUOTE
and encounter a\
then you read another character from the input. However that character simply gets swallowed because in the default case you do this:*outptr++ = a;
which will write the\
to the output and skip the current character (held inb
).
Comments about pullwhitespace()
:
Code is broken. Needs work to update
*inputptr
.Needs better name than
inputptr
. Maybestring_ptr
. Hmm, no much better.Rather than define "whitespace" as
' '
and'\t'
, use the C definition of whitesapce.[Edit] Deleted
const
susggestion.void pullwhitespace (char **inputptr) { char *input = *inputptr; char a; while ((a = *input) != '0円') { if (!isspace((unsigned char) a)) { break; } input++; } *inputptr = input; }
If one accept using the C standard isspace()
further simplification results in
void pullwhitespace (char **inputptr) {
char *input = *inputptr;
while (isspace(*input)) input++;
*inputptr = input;
}
-
\$\begingroup\$ Adding
const
to the definition of parameterinputptr
doesn't work the way you'd expect (see c-faq.com/ansi/constmismatch.html). You have also modified the function in ways that make it worse than the original (first commented line). \$\endgroup\$William Morris– William Morris2013年12月17日 20:29:03 +00:00Commented Dec 17, 2013 at 20:29 -
\$\begingroup\$ @William Morris I have seen the error of my ways concerning
const
and correctedwhile ((a = *(*inputptr)) != '0円')
. Please detail about your 2nd point, what is worse? \$\endgroup\$chux– chux2013年12月17日 20:51:48 +00:00Commented Dec 17, 2013 at 20:51 -
\$\begingroup\$ @William Morris I also see that
isspace()
needs anunsigned char
orEOF
. Is that your concern? \$\endgroup\$chux– chux2013年12月17日 21:25:40 +00:00Commented Dec 17, 2013 at 21:25 -
\$\begingroup\$ The original code had
char *input = *inputptr;
and then de-referencedinput
, whereas your code omits that and uses eg*(*inputptr)
, which is more difficult to read (and hence worse). I would have made the OP'sinput
aconst
, ie:const char *input = *inputptr;
. Also it is arguable that the OP might have wanted just space and tab to be treated instead of all white-space (although I haven't read the code to see). \$\endgroup\$William Morris– William Morris2013年12月17日 22:56:13 +00:00Commented Dec 17, 2013 at 22:56 -
\$\begingroup\$ BTW, when you use @ to quote a name, there should be no spaces in the name. With @WilliamMorris I get a notification, with the space, I don't :-) \$\endgroup\$William Morris– William Morris2013年12月17日 22:58:23 +00:00Commented Dec 17, 2013 at 22:58
This parser is not that complicated, you don't need jumps. My example below ignores some corner cases, just showing the priciple. You can expand it to your needs.
#include <stdio.h>
#include <ctype.h>
int main(int argc, char **argv)
{
const char *s, *e;
for (e = argv[1]; *e; s = e) {
while (isspace(*e)) e++;
if (*e == '\"') {
for (s = ++e; *e && *e != '\"'; e++);
if (!*e) return 1; // missing closing quote
} else if (*e == '\'') {
for (s = ++e; *e && *e != '\''; e++);
if (!*e) return 1; // missing closing quote
} else {
for (s = e; *e && !isspace(*e); e++);
}
printf("'%.*s'\n", (int)(e - s), s);
if (*e == '\"' || *e == '\'')
e++; // advance after closing quote
}
return 0;
}
-
2\$\begingroup\$ I think this answer would be a lot more helpful with some discussion on how this approach differs from the original, and why you think it's better. \$\endgroup\$Michael Urman– Michael Urman2013年12月14日 15:42:18 +00:00Commented Dec 14, 2013 at 15:42
-
\$\begingroup\$ Your code doesn't cope with escape-sequences in double- or single-quoted strings like the original. \$\endgroup\$Matt– Matt2013年12月14日 16:20:36 +00:00Commented Dec 14, 2013 at 16:20
goto
s? Those are frowned upon in C... in fact, I'm frowning right now. :-( \$\endgroup\$goto
s need not be feared irrationally, and I've seen them in in lexers before, including in the CPython lexer (tokenizer.c), that said, it's good to figure out the alternatives, which you may end up liking more. \$\endgroup\$