I know there are probably thousands of INI parser implementations in C.
Nevertheless I want to implement one to learn the art of parsing grammar.
I just wrote a small piece of the INI parser and wanted to ask if I am on the right track:
void parse_line(const char *str, char **name) {
size_t length = strlen(str);
bool found_section = false;
for (size_t pos = 0; pos < length; ++pos) {
char current = str[pos];
bool is_last_ch = (pos + 1) >= length;
//dbg("Processing %c", current);
if (current == '[') {
size_t updated_pos = pos + 1;
ssize_t match_pos = -1;
if (is_last_ch) {
dbg("Reached end of line without closing ] (1)!");
return;
}
if (found_section) {
dbg("Already found a section!");
return;
}
//dbg("Found a [ at position %zu, looking for ] now", pos);
for (size_t x = updated_pos; x < length; ++x) {
if (str[x] == ';') {
break;
} else if (str[x] == ']') {
match_pos = x;
break;
}
}
if (match_pos == -1) {
dbg("Reached end of line without closing ] (2)!");
return;
}
//dbg("Found matching ] at position %zd", match_pos);
assert(match_pos >= updated_pos);
size_t length_of_section = match_pos - updated_pos;
//dbg("Length of name is %zu", length_of_section);
char *section_name = malloc(length_of_section + 1);
assert(section_name);
for (size_t x = 0; x < length_of_section; ++x) {
size_t i = x + updated_pos;
section_name[x] = str[i];
}
section_name[length_of_section] = 0;
// utilize `section_name` in some way...
//dbg("Section name is: %s", section_name);
*name = section_name;
// and then free() it somewhere
pos += length_of_section + 1;
assert(pos < length);
found_section = true;
} else if (current == ']') {
dbg("Found unmatched ] at position %zu", pos);
return;
} else if (current == ';') {
//dbg("End of line, bye!");
return;
} else {
dbg("Unexpected character %c", current);
return;
}
}
}
At the moment it only parses name of sections (like [name]
).
I want to focus on detecting malformed grammar:
[section] comment
-> invalid
comment [section]
-> invalid
etc.
Example:
parse_line("[name]", ...); // gives "name"
parse_line("[name];Comment", ...); // gives "name"
parse_line(";[name];Comment", ...); // gives (null)
parse_line("[name[]", ...); // works... but should not (i already have a fix for this in mind -> check for "[" in section name fetching ; but is this good?)
Is the approach to ''wait'' for a character and then process from there on a robust way to parse things like INI files?
-
1\$\begingroup\$ I guess you want to write it by hand to learn what compilers are really doing. So I'd recommend to try writing a recursive descent parser. State machines are something for code generators. After that you could try to have a look at the boost library which features a parser combinator and implement one of those on your own. That should give you a deeper insight on what you're actually trying to do. \$\endgroup\$bash0r– bash0r2016年11月02日 00:44:06 +00:00Commented Nov 2, 2016 at 0:44
2 Answers 2
Generally a parser can be modeled as a state machine. Think about what different expressions your grammar consists of.
I don't know the INI spec but here is a suggestion for a section header:
maybe_whitespace
- zero or more tabs and/or spacesidentifier
- one or more (is there a maximum?) alphanumeric ASCII charactersopen_section_header
- a "["close_section_header
- a "]"section_header
- this is composed of other expressions (in order):maybe_whitespace
,open_section_header
,maybe_whitespace
,identifier
,maybe_whitespace
,close_section_header
,maybe_whitespace
line
-section_header
followed by a newline, this will include more expressions when you later implement the full file parsingini_file
- zero or moreline
s
When you've written down all the valid expressions you can get to work on implementing the parser. Start from the root expression ini_file
and create a function to parse each expression. Each of these functions represent a state and either return some value or an error if the expression doesn't match.
Below is an example implementation of some functions in pseudo code.
fn parse_ini_file(*head) {
for (i = 0; *head != '0円';i++) {
parse_line(&head)
if error {
printf("error at line %d\n", i)
break
}
}
}
fn parse_line(**head) {
section = parse_section_header(head)
if error return error
// ... do something with section
if **head == '\n' *head++
else return error
}
fn parse_section_header(**head) {
parse_maybe_whitespace(head)
if error return error
// ...
identifier = parse_identifier(head)
// ...
return identifier
}
// ...
This way you keep things simple and small. Each function only does one thing and is easy to modify.
-
\$\begingroup\$ Good point about breaking things up into small functions. It also makes things much easier to test. \$\endgroup\$Edward– Edward2016年11月01日 22:45:36 +00:00Commented Nov 1, 2016 at 22:45
Here are some things that may help you improve your program.
Use a state machine
The approach you currently have uses bool
variables keeping track things such as whether a section has been found, whether a character is the last one and weaves the interpretation int a long series of nested if
statements. That approach may work for small, simple grammars such as this, but it's not going to scale well as you move on to a larger set of things your program can parse.
For that reason, I'd recommend a more data-driven approach in which you write out a state machine and have the code execute according to the data within it rather than via a code-driven approach. This makes things more compact and easier to read.
An example of such an implementation for a parser is in this answer.
Consider using regular expressions
I've given this advice before. It's a lot easier in C++ because there's library support within the standard, but since it appears you're using POSIX-based libraries, you may already have a regular expression library available to you.
Consider using compiler tools
Another approach to consider is to use standard tools such as lex
and yacc
(or flex
and bison
) to create a much more readable and maintainable version of the parser. These tools have a steep learning curve, but once you've mastered them, you'll be better able to see and understand how to write unambiguous parsers from scratch as well. IMHO, this is time well spent.
Don't comment out debug statements
Just as with assert
, I'd suggest that creating a version of dbg
that either expands to nothing (for production code) or to a routine that prints debugging messages (for development) may be very useful. It's much easier than commenting and uncommenting such debugging statements during the debugging process.
Don't leak memory
There is, potentially, a malloc
within the code but no corresponding free
. I assume that's because it's intended that's because the caller is supposed to free it if it's allocated, but the problem is that the routine doesn't set name
to NULL
if it doesn't happen to allocate memory, making it incumbent on the caller to also pre-set the pointer to NULL
. This would be easier to use if, instead, the pointer were a return value that would either be NULL
or a pointer that the caller is expected to free.
Avoid ssize_t
if possible
Because you're initializing match_pos
with -1, I see why you've chosen to make it of type ssize_t
(which is signed) versus size_t
which might not be signed. However, this limits portability and there's no reason you couldn't use another sentinel value instead, such as SIZE_MAX