I want to handle input like the following
ls -al | awk '{print 1ドル}'
Now to parse it I couldn't tokenize at whitespace because of the quotations and I had to make a special case for awk instead. The code below successfully parses the command-line input. Is there a better way?
int handleToken(int awk, char *token, char *params[100], int i) {
while (token != NULL) {
if (awk == 1) {
params[i++] = token;
token = strtok(NULL, " ");
awk = 0;
continue;
}
if (strcmp(token, "awk") == 0) {
params[i++] = token;
awk = 1;
token = strtok(NULL, "\'");
continue;
}
params[i++] = token;
token = strtok(NULL, " ");
}
params[i] = NULL;
return i;
}
1 Answer 1
When you say
I want to handle input like the following [...]
I take you to mean that you want to tokenize a language similar to the one recognized by the standard POSIX shell (since you tagged [posix] and presented an example), or maybe even that exact language.
You posited in comments that perhaps you needed a parser generator such as yacc
or bison
for this job. Although these tools can indeed do such a job, code for such a subsystem -- a "lexical analyzer" or "scanner" -- is more often generated via a different kind of code generator; lex
and its GNU variant flex
are the canonical tools for this purpose. These two particular tools allow you to describe your tokens (and separators) via regular expressions, and from such descriptions they generate C code for a table-based DFA that processes the language character by character and splits it into tokens. This is certainly one plausible way you could proceed.
On the other hand, depending on the language you want to recognize, it's not necessarily unreasonable to write your own lexical analyzer from scratch, or in favorable cases to use an available function such as strtok()
to do the job. This affords the possibility of a better-tuned implementation than can be produced via a general-purpose tool (or not), it does not incorporate an additional language into the project, and the source might even be smaller.
You ask,
Is there a better way?
and the answer is certainly "yes". Other than broad generalities such as I have given so far, however, code review is not really a good platform for what essentially boils down to designing a complete replacement for your present code. From here out, therefore, I focus on the code you actually presented:
- The name of your function is poorly chosen, as it does not describe its behavior very well. For what it seems meant to do, "tokenize", "find_tokens", "scan", "analyze", or a similar name would be more fitting.
- The name of the
token
parameter seems poorly chosen. Tokens (or in shell terminology, "words") are the function's output, not a good description of its likely input. Maybe this parameter should be a "line", or such. - On the other hand, such code would be more flexible if it could read directly from the input, instead of being presented a pre-read string.
- That your function accepts parameters
awk
andi
suggests that it is intended to be used multiple times, incrementally, to build up oneparams
array, but if so then it has no mechanism to return its present value ofawk
to the caller, which it seems would be necessary. - You have no protection against overrunning the bounds of your
params
array. It would probably be better to pass a double pointer, and let the function (re)allocate space as needed. - When you identify tokens / words, you should consider making copies instead of using pointers into the original string (the latter being what
strtok()
provides) as there are all kinds of unpleasant surprises that could result from using the space in the original string as storage for the tokenized words of the command. - And of course, I presume you know that
strtok()
consumes / modifies / destroys the input string. - Calling out
awk
as a special case is untenable for parsing shell input, as you cannot enumerate all the possible commands, nor know what form of argument or arguments to expect. Evenawk
itself does not necessarily take a single-quoted (or even double-quoted) argument. (Consider the perfectly viable commandawk {print}
.) - The standard shell does not organize input into bare words and quoted strings. Single- and double-quoted subsequences can be concatenated with other quoted or unquoted sequences to form a single token. For example, these mean the same thing to the shell:
'one shell word'
,one' 'shell" "word
, ando"ne"' shel'l" "wo'r'"d"''
. - If your shell language is simpler, so that you don't need to worry about internal quotes, then you probably still should not use the preceding token to predict quoting status. Instead, look at the first character of each substring itself to see whether it is a single or double quote.
- The shell recognizes escape sequences inside double-quoted strings; this requires a smarter scanner than
strtok()
to sort out. - Although the standard shell does not split words at quotes, there are several characters other than whitespace where it does split them (when they appear unquoted):
|
&
;
(
)
<
>
. So for example, the shell will interpret the inputls -al|awk '{print 1ドル}'
exactly the same as it does your example command, even though there is no whitespace around the|
.
-
\$\begingroup\$ Thank you very much! I've proceeded asking more question about the same project here if you would like to see how I could improve it. I think that perhaps I won't need flex/bison or the lemon parser and that I can write my own parser / tokenizer / scanner this way. My goal is to make a good POSIX shell that is at least as good as dash. \$\endgroup\$Niklas Rosencrantz– Niklas Rosencrantz2016年04月17日 17:47:41 +00:00Commented Apr 17, 2016 at 17:47
echo 'hello world' | awk '{print 1ドル}'
that is not trivial to tokenize. \$\endgroup\$