Tokenizing a shell command

Question 1

I want to handle input like the following

ls -al | awk '{print 1ドル}'

Now to parse it I couldn't tokenize at whitespace because of the quotations and I had to make a special case for awk instead. The code below successfully parses the command-line input. Is there a better way?

int handleToken(int awk, char *token, char *params[100], int i) {
 while (token != NULL) {
 if (awk == 1) {
 params[i++] = token;
 token = strtok(NULL, " ");
 awk = 0;
 continue;
 }
 if (strcmp(token, "awk") == 0) {
 params[i++] = token;
 awk = 1;
 token = strtok(NULL, "\'");
 continue;
 }
 params[i++] = token;
 token = strtok(NULL, " ");
 }
 params[i] = NULL;
 return i;
}

Question 2

Could you expand a little on exactly what you mean by "handle"?

Question 3

@Mat'sMug I wonder if I must know yacc or bison to proceed. I want to "handle" input like echo 'hello world' | awk '{print 1ドル}' that is not trivial to tokenize.

Question 4

Could you give us more context about why you want to tokenize a shell command? Are you writing your own shell?

Question 5

@200_success Yes, I am writing my own shell and if you follow the link you can find the whole project. I'm looking at the source of the Almquist shell to learn how shells are made.

Question 6

If you want a revision of your code reviewed, you should ask another question. By editing your question, you are probably invalidating existing answers.

Question 7

When you say

I want to handle input like the following [...]

I take you to mean that you want to tokenize a language similar to the one recognized by the standard POSIX shell (since you tagged [posix] and presented an example), or maybe even that exact language.

You posited in comments that perhaps you needed a parser generator such as yacc or bison for this job. Although these tools can indeed do such a job, code for such a subsystem -- a "lexical analyzer" or "scanner" -- is more often generated via a different kind of code generator; lex and its GNU variant flex are the canonical tools for this purpose. These two particular tools allow you to describe your tokens (and separators) via regular expressions, and from such descriptions they generate C code for a table-based DFA that processes the language character by character and splits it into tokens. This is certainly one plausible way you could proceed.

On the other hand, depending on the language you want to recognize, it's not necessarily unreasonable to write your own lexical analyzer from scratch, or in favorable cases to use an available function such as strtok() to do the job. This affords the possibility of a better-tuned implementation than can be produced via a general-purpose tool (or not), it does not incorporate an additional language into the project, and the source might even be smaller.

You ask,

Is there a better way?

and the answer is certainly "yes". Other than broad generalities such as I have given so far, however, code review is not really a good platform for what essentially boils down to designing a complete replacement for your present code. From here out, therefore, I focus on the code you actually presented:

The name of your function is poorly chosen, as it does not describe its behavior very well. For what it seems meant to do, "tokenize", "find_tokens", "scan", "analyze", or a similar name would be more fitting.
The name of the token parameter seems poorly chosen. Tokens (or in shell terminology, "words") are the function's output, not a good description of its likely input. Maybe this parameter should be a "line", or such.
On the other hand, such code would be more flexible if it could read directly from the input, instead of being presented a pre-read string.
That your function accepts parameters awk and i suggests that it is intended to be used multiple times, incrementally, to build up one params array, but if so then it has no mechanism to return its present value of awk to the caller, which it seems would be necessary.
You have no protection against overrunning the bounds of your params array. It would probably be better to pass a double pointer, and let the function (re)allocate space as needed.
When you identify tokens / words, you should consider making copies instead of using pointers into the original string (the latter being what strtok() provides) as there are all kinds of unpleasant surprises that could result from using the space in the original string as storage for the tokenized words of the command.
And of course, I presume you know that strtok() consumes / modifies / destroys the input string.
Calling out awk as a special case is untenable for parsing shell input, as you cannot enumerate all the possible commands, nor know what form of argument or arguments to expect. Even awk itself does not necessarily take a single-quoted (or even double-quoted) argument. (Consider the perfectly viable command awk {print}.)
The standard shell does not organize input into bare words and quoted strings. Single- and double-quoted subsequences can be concatenated with other quoted or unquoted sequences to form a single token. For example, these mean the same thing to the shell: 'one shell word', one' 'shell" "word, and o"ne"' shel'l" "wo'r'"d"''.
If your shell language is simpler, so that you don't need to worry about internal quotes, then you probably still should not use the preceding token to predict quoting status. Instead, look at the first character of each substring itself to see whether it is a single or double quote.
The shell recognizes escape sequences inside double-quoted strings; this requires a smarter scanner than strtok() to sort out.
Although the standard shell does not split words at quotes, there are several characters other than whitespace where it does split them (when they appear unquoted): | & ; ( ) < >. So for example, the shell will interpret the input ls -al|awk '{print 1ドル}' exactly the same as it does your example command, even though there is no whitespace around the |.

Question 8

Thank you very much! I've proceeded asking more question about the same project here if you would like to see how I could improve it. I think that perhaps I won't need flex/bison or the lemon parser and that I can write my own parser / tokenizer / scanner this way. My goal is to make a good POSIX shell that is at least as good as dash.

PellMel PellMel 9314 silver badges5 bronze badges · Accepted Answer · 2016-04-15 22:22:44Z

When you say

I want to handle input like the following [...]

I take you to mean that you want to tokenize a language similar to the one recognized by the standard POSIX shell (since you tagged [posix] and presented an example), or maybe even that exact language.

You posited in comments that perhaps you needed a parser generator such as yacc or bison for this job. Although these tools can indeed do such a job, code for such a subsystem -- a "lexical analyzer" or "scanner" -- is more often generated via a different kind of code generator; lex and its GNU variant flex are the canonical tools for this purpose. These two particular tools allow you to describe your tokens (and separators) via regular expressions, and from such descriptions they generate C code for a table-based DFA that processes the language character by character and splits it into tokens. This is certainly one plausible way you could proceed.

On the other hand, depending on the language you want to recognize, it's not necessarily unreasonable to write your own lexical analyzer from scratch, or in favorable cases to use an available function such as strtok() to do the job. This affords the possibility of a better-tuned implementation than can be produced via a general-purpose tool (or not), it does not incorporate an additional language into the project, and the source might even be smaller.

You ask,

Is there a better way?

and the answer is certainly "yes". Other than broad generalities such as I have given so far, however, code review is not really a good platform for what essentially boils down to designing a complete replacement for your present code. From here out, therefore, I focus on the code you actually presented:

The name of your function is poorly chosen, as it does not describe its behavior very well. For what it seems meant to do, "tokenize", "find_tokens", "scan", "analyze", or a similar name would be more fitting.
The name of the token parameter seems poorly chosen. Tokens (or in shell terminology, "words") are the function's output, not a good description of its likely input. Maybe this parameter should be a "line", or such.
On the other hand, such code would be more flexible if it could read directly from the input, instead of being presented a pre-read string.
That your function accepts parameters awk and i suggests that it is intended to be used multiple times, incrementally, to build up one params array, but if so then it has no mechanism to return its present value of awk to the caller, which it seems would be necessary.
You have no protection against overrunning the bounds of your params array. It would probably be better to pass a double pointer, and let the function (re)allocate space as needed.
When you identify tokens / words, you should consider making copies instead of using pointers into the original string (the latter being what strtok() provides) as there are all kinds of unpleasant surprises that could result from using the space in the original string as storage for the tokenized words of the command.
And of course, I presume you know that strtok() consumes / modifies / destroys the input string.
Calling out awk as a special case is untenable for parsing shell input, as you cannot enumerate all the possible commands, nor know what form of argument or arguments to expect. Even awk itself does not necessarily take a single-quoted (or even double-quoted) argument. (Consider the perfectly viable command awk {print}.)
The standard shell does not organize input into bare words and quoted strings. Single- and double-quoted subsequences can be concatenated with other quoted or unquoted sequences to form a single token. For example, these mean the same thing to the shell: 'one shell word', one' 'shell" "word, and o"ne"' shel'l" "wo'r'"d"''.
If your shell language is simpler, so that you don't need to worry about internal quotes, then you probably still should not use the preceding token to predict quoting status. Instead, look at the first character of each substring itself to see whether it is a single or double quote.
The shell recognizes escape sequences inside double-quoted strings; this requires a smarter scanner than strtok() to sort out.
Although the standard shell does not split words at quotes, there are several characters other than whitespace where it does split them (when they appear unquoted): | & ; ( ) < >. So for example, the shell will interpret the input ls -al|awk '{print 1ドル}' exactly the same as it does your example command, even though there is no whitespace around the |.

Thank you very much! I've proceeded asking more question about the same project here if you would like to see how I could improve it. I think that perhaps I won't need flex/bison or the lemon parser and that I can write my own parser / tokenizer / scanner this way. My goal is to make a good POSIX shell that is at least as good as dash.

Stack Exchange Network

Tokenizing a shell command

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Tokenizing a shell command

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions