This is my second attempt at K&R 1-23,
Write a program to remove all comments from a C program. Don't forget to handle quoted strings and character constants properly. C comments don't nest.
I was previously packing characters into various io buffers, and someone suggested this was not a good choice. Thought I'd try something more along the lines of a state machine:
#include <stdio.h>
#define NORMAL 0
#define SINGLE_QUOTE 1
#define DOUBLE_QUOTE 2
#define SLASH 3
#define MULTI_COMMENT 4
#define INLINE_COMMENT 5
#define STAR 6
int state_from_normal(char prev_symbol, char symbol)
{
int state = NORMAL;
if (symbol == '\'' && prev_symbol != '\\') {
state = SINGLE_QUOTE;
} else if (symbol == '"') {
state = DOUBLE_QUOTE;
} else if (symbol == '/') {
state = SLASH;
}
return state;
}
int state_from_single_quote(char prev_symbol, char symbol)
{
int state = SINGLE_QUOTE;
if (symbol == '\'' && prev_symbol != '\\') {
state = NORMAL;
}
return state;
}
int state_from_double_quote(char prev_symbol, char symbol)
{
int state = DOUBLE_QUOTE;
if (symbol == '"' && prev_symbol != '\\') {
state = NORMAL;
}
return state;
}
int state_from_slash(char symbol)
{
int state = SLASH;
if (symbol == '*') {
state = MULTI_COMMENT;
} else if (symbol == '/') {
state = INLINE_COMMENT;
} else {
state = NORMAL;
}
return state;
}
int state_from_multi_comment(char symbol)
{
int state = MULTI_COMMENT;
if (symbol == '*') {
state = STAR;
}
return state;
}
int state_from_star(char symbol)
{
int state = STAR;
if (symbol == '/') {
state = NORMAL;
} else {
state = MULTI_COMMENT;
}
return state;
}
int state_from_inline_comment(char symbol)
{
int state = INLINE_COMMENT;
if (symbol == '\n') {
state = NORMAL;
}
return state;
}
int state_from(int prev_state, char prev_symbol, char symbol)
{
switch(prev_state) {
case NORMAL :
return state_from_normal(prev_symbol, symbol);
case SINGLE_QUOTE :
return state_from_single_quote(prev_symbol, symbol);
case DOUBLE_QUOTE :
return state_from_double_quote(prev_symbol, symbol);
case SLASH :
return state_from_slash(symbol);
case MULTI_COMMENT :
return state_from_multi_comment(symbol);
case INLINE_COMMENT :
return state_from_inline_comment(symbol);
case STAR :
return state_from_star(symbol);
default :
return -1;
}
}
int main(void)
{
char input;
char symbol = '0円';
char prev_symbol;
int state = NORMAL;
int prev_state;
while ((input = getchar()) != EOF) {
prev_symbol = symbol;
prev_state = state;
symbol = input;
state = state_from(prev_state, prev_symbol, symbol);
if (prev_state == SLASH && state == NORMAL) {
putchar(prev_symbol);
}
if (prev_state != STAR && state < SLASH) {
putchar(symbol);
}
}
}
2 Answers 2
Bugs
Here are three examples that will cause problems with your state machine:
'\\'
"\\"
/* comment **/
In the first two examples, your state machine doesn't recognize the end quotes because of the preceding backslash characters, even though the backslashes were already "consumed" by the other backslashes.
In the third example, the state machine fails to recognize the end of comment. The problem is that the double star should cause the state to remain in the STAR
state but it instead reverts to the MULTI_COMMENT
state.
-
\$\begingroup\$ These bugs also applies to the OPs other question on the same subject. \$\endgroup\$holroy– holroy2015年12月27日 13:47:25 +00:00Commented Dec 27, 2015 at 13:47
-
\$\begingroup\$ Good catches! I actually ran into the first two bugs soon after "completing" this. The fix I came up with requires keeping track of the past two characters rather than just the past one. I'm not super happy about that, but it was the best I could come up with. That third bug is something I hadn't thought of. I guess I'll need to tweak my
state_from_star
function a little. \$\endgroup\$ivan– ivan2015年12月27日 17:53:31 +00:00Commented Dec 27, 2015 at 17:53 -
\$\begingroup\$ @holroy Ah I see. I didn't look at that other question because by the time I logged on, it had already been answered multiple times and accepted. \$\endgroup\$JS1– JS12015年12月27日 18:37:17 +00:00Commented Dec 27, 2015 at 18:37
-
\$\begingroup\$ @ivan, The quoting is a nut to crack. Try whether you could handles the following:
" ... "
,"... \" ..."
,".... \""
, or what about"... \\"
or... \\\""
. All of these should be legal, and could be extended... \$\endgroup\$holroy– holroy2015年12月27日 18:56:25 +00:00Commented Dec 27, 2015 at 18:56
In all, this looks like pretty solid code. I have just a few suggestions that may help you improve your code.
Use an enum
for related constants
The states are all related and not just standalone constants. For that reason, I'd recommend encapsulating them all in an enum:
enum { NORMAL, SINGLE_QUOTE, DOUBLE_QUOTE, SLASH, MULTI_COMMENT, INLINE_COMMENT, STAR } state_e;
Restructure or comment to make the code easier to read
The most significant feature in the code is a state machine. In order to understand a state machine, I typically need to know what's being processed (an input C program), what the states are (which are enumerated) and how the code transitions from state to state. The last bit is the part that is a little tough to decipher. It's probably mostly right, but it's hard to decode. For instance, see how long it takes you to answer the questions, "How does the code enter the SLASH
state?", "How does it leave the SLASH
state?"
Don't forget about line continuation
The \
character is a line continuation character in C. One effect it can have that affects this program is to continue a single-line comment:
// BAD is never defined \
#define BAD 1
Don't forget about trigraphs
Many people either don't use or don't know about trigraphs but they exist and, for better or worse, are still part of the language. This affects this particular program because the ??/
is the trigraph for \
which is the line continuation character. Related to the preceding comment, BAD
is never defined in this code fragment:
// are you surprised??/
#define BAD 1
-
\$\begingroup\$ Trigraphs? I thought those were only used for tricking people on PPCG ;) \$\endgroup\$SirPython– SirPython2016年01月01日 23:19:52 +00:00Commented Jan 1, 2016 at 23:19
-
\$\begingroup\$ Given that the context of the exercise pre-dates the introduction of trigraphs by some margin, I'm not sure it's strictly necessary to consider them. But perhaps then one should refrain from using prototypes or single-line comments in the parser itself? P.S. yes, I was being facetious - the exercise has most benefit when the most rigour is applied. \$\endgroup\$Toby Speight– Toby Speight2016年09月30日 08:54:30 +00:00Commented Sep 30, 2016 at 8:54
#define
's with anenum
. \$\endgroup\$getchar
returns anint
, not achar
, and that's real important. See the "Application Usage" note in the POSIX spec: pubs.opengroup.org/onlinepubs/9699919799/functions/getchar.html \$\endgroup\$input
toint
, problem solved. \$\endgroup\$