In one of my homework projects, I have to parse a matrix definition into a matrix data structure in C. The elements of the matrix will be given in a list of double. I need a simple tokenizer and parser for the task naturally, but I have not much experience in parsing. I implemented the tokenizer and parser. The tokenizer produces a few types of tokens. The list of token types includes tokEot
(ie. End of tokens), tokId
, tokNumber
, tokOparen
, tokCparen
, tokComma
, tokAssign
. The matrix definition will be in the form of identifier(Integer, Integer) = ( Real Number, Real Number, ... , Real Number)
. The input can have more than one matrix definition. The name of the data structure for the tokenizer is scanner
. I have no data structure for the parser, implemented it in a single function(parse_matrix
). I need also error checking in the parsing stage, so if the order in the definition of a matrix is not be followed, then I must report an error and exit the program. You can look at the supplementary code for more details. The program works without any problems. However, since I need a code review specifically for the function parse_matrix
and it might be redundant, I don't want to include the source code of the tokenizer (maybe in another question). Here is the code with the all necessary parts.
typedef struct matrix {
char* name; // The name of the matrix, the parsed identifiers will be stored here.
int m; // The number of rows
int n; // The number of cols
double* elements; // A chunk of doubles
}matrix;
// The enum for the token types
typedef enum token_type
{
tokEot, // End of tokens
tokId, // Identifier
tokNumber, // Either int or double
tokOparen, // (
tokCparen, // )
tokComma, // ,
tokAssign, // =
tokCount
}token_type;
typedef struct token_t
{
token_type type;
union
{
double data_double;
char* data_string;
int data_int;
};
}token_t;
typedef struct scanner
{
token_t current_token; // Last generated token
char* input_string; // Input string that contains matrix definitions
char* current_char; // Current character to be processed.
}scanner;
token_t next_token(scanner* pthis);
matrix* parse_matrix(scanner* pscanner);
void handle_error(const char* error_text);
matrix* new_matrix(char* name, int m, int n);
int main()
{
char* input = "mat1(3, 3) = ( 1.0, -3.0, 5.0, "
"2.0, -1.0, 5.0, "
"1.0, -3.0, 5.0 )";
scanner* float_scanner = new_scanner(input);
matrix* mat1 = parse_matrix(float_scanner);
system("pause");
}
matrix* new_matrix(char* name, int m, int n)
{
matrix* mat = malloc(sizeof(matrix));
mat->m = m;
mat->n = n;
mat->name = name;
mat->elements = malloc(sizeof(double) * m * n);
return mat;
}
// tokId --> ( --> Int --> , --> Int --> ) --> = --> ( --> Double --> , --> Double ... --> )
matrix* parse_matrix(scanner* pthis)
{
matrix* mat;
int count = 0;
char* id_name = NULL;
int m = 0;
int n = 0;
token_t pre_token;
token_t token;
do
{
token = next_token(pthis);
//If token type is a identifier, then we will try to parse a matrix declaration.
if (token.type == tokId)
{
// Duplicate the token data
id_name = strdup(token.data_string);
token = next_token(pthis);
// I must have an open parenthesis after an identifier
if (token.type == tokOparen)
{
token = next_token(pthis);
// Parsing the dimensions of the matrix
if (token.type == tokNumber)
{
m = token.data_int;
token = next_token(pthis);
// There is a comma between the dimensions of matrix
if (token.type == tokComma)
{
token = next_token(pthis);
// A number must follow the comma
if (token.type == tokNumber)
{
n = token.data_int;
token = next_token(pthis);
if (token.type == tokCparen)
{
// Allocate the matrix with the specified dimensions.
mat = new_matrix(id_name, m, n);
token = next_token(pthis);
if (token.type == tokAssign)
{
token = next_token(pthis);
if (token.type == tokOparen)
{
token = next_token(pthis);
// Finally, parse the elements.
if (token.type == tokNumber)
*(mat->elements + count++) = token.data_double;
else
handle_error("Expected a number after open paranthesis.");
do
{
token = next_token(pthis);
if (token.type == tokCparen)
break;
if (token.type == tokComma)
{
token = next_token(pthis);
if (token.type == tokNumber)
*(mat->elements + count++) = token.data_double;
else
handle_error("Expected a number after a comma.");
}
else
handle_error("Expected a comma after a number.");
} while (token.type != tokCparen);
// If the user didn't specify enough elements, we should
// fill the rest with zero.
if (count < m * n)
for (int i = count; i < m * n; i++)
*(mat->elements + i) = 0.0;
return mat;
}
else
handle_error("Expected an open parenthesis after an assignment.");
}
else
handle_error("Expected an assigment.");
}
else
handle_error("Expected a close parenthesis.");
}
else
handle_error("Expected a integer number.");
}
else
handle_error("Expected a comma.");
}
else
handle_error("Expected a integer number.");
}
else
handle_error("Expected an open parenthesis.");
}
} while (token.type != tokEot);
}
void handle_error(const char* error_text)
{
perror(error_text);
exit(-1);
}
```
1 Answer 1
Code looks pretty good. I prefer some other style guide, but this looks consistent, and consistency is important too.
Indentation
Arbitrarily, you can spell the string constant like this:
char* input =
"mat1(3, 3) = "
"( 1.0, -3.0, 5.0, "
"2.0, -1.0, 5.0, "
"1.0, -3.0, 5.0 )";
Placing the similar tokens directly one under another can find errors in those tokens in a moment.
Unit tests
You should test your code. There are literally dozens of unit test frameworks for C; and using unit tests will make you immediately find if new changes would break something later. Just do it. Besides, you're already testing the code, but in a much less efficient way - this code has one test in a main()
function.
Mixing model and presentation
Data transformations should be some inner job, usually called model, while reporting errors to user is for presentation. The parser would be much more portable and reusable if it wouldn't rely on a specific output system, maybe returning NULL on errors.
system("pause");
The worst thing in this code. It relies on specific OS and calls a far more complex program than this one for just one discarded input. Your IDE probably has some tools to capture the output after the program exit, just look for it.
if-else
staircase
I think this is why you've posted the code here. Well, it may be a good habit to put more likely option into the if
branch and less likely into the else
; but here the execution stops on handle_error
, so it can be rewritten in a much more linear way:
token = next_token(pthis);
if (token.type != tokOparen)
handle_error("Expected an open parenthesis.");
token = next_token(pthis);
if (token.type != tokNumber)
handle_error("Expected a integer number.");
etc. This will make the code not so structured (though it isn't structured now), but much more readable - except for one detail. It's not clear for readers that handle_error
breaks the execution. So, maybe, it should be something like
token = next_token(pthis);
if (token.type != tokOparen)
return handle_error("Expected an open parenthesis.");
token = next_token(pthis);
if (token.type != tokNumber)
return handle_error("Expected a integer number.");
This looks much better. Of course, handle_error
now should return NULL
after exit(-1)
. Maybe commented out to avoid compiler warnings.
One other option for such situations is a do-while(false);
construction:
do { //while(false) - you should always comment this
do_some_work();
if(something_wrong())
break;
do_some_more_work();
if(one thing more wrong)
break;
do_even_more_work();
if(still wrong)
break;
do_final_work();
} while(false);
-
\$\begingroup\$ +1 for your answer. But I cannot upvote since I have not enough reputation. That linear way sounds better! \$\endgroup\$jtxkopt - STOP GENOCIDE– jtxkopt - STOP GENOCIDE2021年12月05日 07:20:43 +00:00Commented Dec 5, 2021 at 7:20
-
\$\begingroup\$ I needed really such a style guide, +1 also for that link. \$\endgroup\$jtxkopt - STOP GENOCIDE– jtxkopt - STOP GENOCIDE2021年12月05日 07:29:35 +00:00Commented Dec 5, 2021 at 7:29