Small part of an INI parser

Question 1

I know there are probably thousands of INI parser implementations in C.

Nevertheless I want to implement one to learn the art of parsing grammar.

I just wrote a small piece of the INI parser and wanted to ask if I am on the right track:

void parse_line(const char *str, char **name) {
 size_t length = strlen(str);
 bool found_section = false;
 for (size_t pos = 0; pos < length; ++pos) {
 char current = str[pos];
 bool is_last_ch = (pos + 1) >= length;
 //dbg("Processing %c", current);
 if (current == '[') {
 size_t updated_pos = pos + 1;
 ssize_t match_pos = -1;
 if (is_last_ch) {
 dbg("Reached end of line without closing ] (1)!");
 return;
 }
 if (found_section) {
 dbg("Already found a section!");
 return;
 }
 //dbg("Found a [ at position %zu, looking for ] now", pos);
 for (size_t x = updated_pos; x < length; ++x) {
 if (str[x] == ';') {
 break;
 } else if (str[x] == ']') {
 match_pos = x;
 break;
 }
 }
 if (match_pos == -1) {
 dbg("Reached end of line without closing ] (2)!");
 return;
 }
 //dbg("Found matching ] at position %zd", match_pos);
 assert(match_pos >= updated_pos);
 size_t length_of_section = match_pos - updated_pos;
 //dbg("Length of name is %zu", length_of_section);
 char *section_name = malloc(length_of_section + 1);
 assert(section_name);
 for (size_t x = 0; x < length_of_section; ++x) {
 size_t i = x + updated_pos;
 section_name[x] = str[i];
 }
 section_name[length_of_section] = 0;
 // utilize `section_name` in some way...
 //dbg("Section name is: %s", section_name);
 *name = section_name;
 // and then free() it somewhere
 pos += length_of_section + 1;
 assert(pos < length);
 found_section = true;
 } else if (current == ']') {
 dbg("Found unmatched ] at position %zu", pos);
 return;
 } else if (current == ';') {
 //dbg("End of line, bye!");
 return;
 } else {
 dbg("Unexpected character %c", current);
 return;
 }
 }
}

At the moment it only parses name of sections (like [name]).

I want to focus on detecting malformed grammar:

[section] comment -> invalid comment [section] -> invalid

etc.

Example:

parse_line("[name]", ...); // gives "name"
parse_line("[name];Comment", ...); // gives "name"
parse_line(";[name];Comment", ...); // gives (null)
parse_line("[name[]", ...); // works... but should not (i already have a fix for this in mind -> check for "[" in section name fetching ; but is this good?)

Is the approach to ''wait'' for a character and then process from there on a robust way to parse things like INI files?

Question 2

I guess you want to write it by hand to learn what compilers are really doing. So I'd recommend to try writing a recursive descent parser. State machines are something for code generators. After that you could try to have a look at the boost library which features a parser combinator and implement one of those on your own. That should give you a deeper insight on what you're actually trying to do.

Question 3

Generally a parser can be modeled as a state machine. Think about what different expressions your grammar consists of.

I don't know the INI spec but here is a suggestion for a section header:

maybe_whitespace - zero or more tabs and/or spaces
identifier - one or more (is there a maximum?) alphanumeric ASCII characters
open_section_header - a "["
close_section_header - a "]"
section_header - this is composed of other expressions (in order): maybe_whitespace, open_section_header, maybe_whitespace, identifier, maybe_whitespace, close_section_header, maybe_whitespace
line - section_header followed by a newline, this will include more expressions when you later implement the full file parsing
ini_file - zero or more lines

When you've written down all the valid expressions you can get to work on implementing the parser. Start from the root expression ini_file and create a function to parse each expression. Each of these functions represent a state and either return some value or an error if the expression doesn't match.

Below is an example implementation of some functions in pseudo code.

fn parse_ini_file(*head) {
 for (i = 0; *head != '0円';i++) {
 parse_line(&head)
 if error {
 printf("error at line %d\n", i)
 break
 }
 }
}
fn parse_line(**head) {
 section = parse_section_header(head)
 if error return error
 // ... do something with section
 if **head == '\n' *head++
 else return error
}
fn parse_section_header(**head) {
 parse_maybe_whitespace(head)
 if error return error
 // ...
 identifier = parse_identifier(head)
 // ...
 return identifier
}
// ...

This way you keep things simple and small. Each function only does one thing and is easy to modify.

Question 4

Good point about breaking things up into small functions. It also makes things much easier to test.

Question 5

Here are some things that may help you improve your program.

Use a state machine

The approach you currently have uses bool variables keeping track things such as whether a section has been found, whether a character is the last one and weaves the interpretation int a long series of nested if statements. That approach may work for small, simple grammars such as this, but it's not going to scale well as you move on to a larger set of things your program can parse.

For that reason, I'd recommend a more data-driven approach in which you write out a state machine and have the code execute according to the data within it rather than via a code-driven approach. This makes things more compact and easier to read.

An example of such an implementation for a parser is in this answer.

Consider using regular expressions

I've given this advice before. It's a lot easier in C++ because there's library support within the standard, but since it appears you're using POSIX-based libraries, you may already have a regular expression library available to you.

Consider using compiler tools

Another approach to consider is to use standard tools such as lex and yacc (or flex and bison) to create a much more readable and maintainable version of the parser. These tools have a steep learning curve, but once you've mastered them, you'll be better able to see and understand how to write unambiguous parsers from scratch as well. IMHO, this is time well spent.

Don't comment out debug statements

Just as with assert, I'd suggest that creating a version of dbg that either expands to nothing (for production code) or to a routine that prints debugging messages (for development) may be very useful. It's much easier than commenting and uncommenting such debugging statements during the debugging process.

Don't leak memory

There is, potentially, a malloc within the code but no corresponding free. I assume that's because it's intended that's because the caller is supposed to free it if it's allocated, but the problem is that the routine doesn't set name to NULL if it doesn't happen to allocate memory, making it incumbent on the caller to also pre-set the pointer to NULL. This would be easier to use if, instead, the pointer were a return value that would either be NULL or a pointer that the caller is expected to free.

Avoid `ssize_t` if possible

Because you're initializing match_pos with -1, I see why you've chosen to make it of type ssize_t (which is signed) versus size_t which might not be signed. However, this limits portability and there's no reason you couldn't use another sentinel value instead, such as SIZE_MAX

jacwah jacwah 2,69118 silver badges42 bronze badges · Accepted Answer · 2016-11-01 22:40:10Z

Generally a parser can be modeled as a state machine. Think about what different expressions your grammar consists of.

I don't know the INI spec but here is a suggestion for a section header:

maybe_whitespace - zero or more tabs and/or spaces
identifier - one or more (is there a maximum?) alphanumeric ASCII characters
open_section_header - a "["
close_section_header - a "]"
section_header - this is composed of other expressions (in order): maybe_whitespace, open_section_header, maybe_whitespace, identifier, maybe_whitespace, close_section_header, maybe_whitespace
line - section_header followed by a newline, this will include more expressions when you later implement the full file parsing
ini_file - zero or more lines

When you've written down all the valid expressions you can get to work on implementing the parser. Start from the root expression ini_file and create a function to parse each expression. Each of these functions represent a state and either return some value or an error if the expression doesn't match.

Below is an example implementation of some functions in pseudo code.

fn parse_ini_file(*head) {
 for (i = 0; *head != '0円';i++) {
 parse_line(&head)
 if error {
 printf("error at line %d\n", i)
 break
 }
 }
}
fn parse_line(**head) {
 section = parse_section_header(head)
 if error return error
 // ... do something with section
 if **head == '\n' *head++
 else return error
}
fn parse_section_header(**head) {
 parse_maybe_whitespace(head)
 if error return error
 // ...
 identifier = parse_identifier(head)
 // ...
 return identifier
}
// ...

This way you keep things simple and small. Each function only does one thing and is easy to modify.

Good point about breaking things up into small functions. It also makes things much easier to test.

Stack Exchange Network

Small part of an INI parser

2 Answers 2

Use a state machine

Consider using regular expressions

Consider using compiler tools

Don't comment out debug statements

Don't leak memory

Avoid `ssize_t` if possible

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Small part of an INI parser

2 Answers 2

Use a state machine

Consider using regular expressions

Consider using compiler tools

Don't comment out debug statements

Don't leak memory

Avoid ssize_t if possible

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions

Avoid `ssize_t` if possible