File parsing and data management - follow-up

Question 1

This is part of a two-part post (Part 1).

Here, I have two recent projects that parse a file. The first uses a loop that's kind of hard to follow, and the second uses "modes" to decide what to do.

Project 2: A working map renderer that I plan to implement the game of Picross with.

This one uses "modes" to control the flow of the loop. I feel it is much more elegant, but wonder if this is the best way to do it.

The project is nowhere near finished, but I think this is in-line with the posting guidelines, since the part I care about is fully functional.

The data structures:

struct picross_data {
 char *name,
 *map;
 int width,
 height;
};

The parser:

enum pfmode {
 mode_find_name,
 mode_get_name,
 mode_find_width,
 mode_get_width,
 mode_find_height,
 mode_get_height,
 mode_get_map,
 mode_finish,
};
void
parse_file(char *fname, struct picross_data *pd)
{
 FILE *f = fopen(fname, "r");
 char fbuf[10000] = "",
 *fbp = fbuf,
 *nbuf = calloc(sizeof(char), 100),
 *nbp = nbuf,
 numbuf[10] = "",
 *nmbp = numbuf,
 *mapbuf,
 *mbp,
 c;
 enum pfmode m = mode_find_name;
 /* This function uses four buffers for parsing the file. 
 * fbuf[10000] for the file contents,
 * nbuf[100] for the name,
 * numbuf[10] for numbers,
 * mapbuf[w*h] for the map
 * The modes are self-explanatory. They make it easy to focus on one section at a time without an over-complicated hacked-together while loop
 */
 while((c = fgetc(f)) != EOF) *(fbp++) = c;
 fclose(f);
 fbp = fbuf;
 while(m != mode_finish){
 switch(m){
 case mode_find_name:
 if(*fbp == '"'){
 m = mode_get_name;
 }
 fbp++;
 break;
 case mode_get_name:
 if(*fbp == '"'){
 m = mode_find_width;
 pd->name = nbuf;
 } else {
 *nbp = *fbp;
 nbp++;
 }
 fbp++;
 break;
 case mode_find_width:
 if(*fbp == '['){
 m = mode_get_width;
 }
 fbp++;
 break;
 case mode_get_width:
 if(*fbp == ','){
 pd->width = atoi(numbuf);
 nmbp = numbuf;
 m = mode_find_height;
 } else {
 *nmbp = *fbp;
 nmbp++;
 }
 fbp++;
 break;
 case mode_find_height:
 while(*fbp == ' ') fbp++;
 m = mode_get_height;
 break;
 case mode_get_height:
 if(*fbp == ']'){
 pd->height = atoi(numbuf);
 mapbuf = calloc(sizeof(char), pd->width * pd->height);
 mbp = mapbuf;
 m = mode_get_map;
 } else {
 *nmbp = *fbp;
 nmbp++;
 }
 fbp++;
 break;
 case mode_get_map:
 if(mbp-mapbuf >= pd->width * pd->height){
 pd->map = mapbuf;
 m = mode_finish;
 }
 if(*fbp == '.'){
 *mbp = 0;
 mbp++;
 }
 if(*fbp == '*'){
 *mbp = 1;
 mbp++;
 }
 fbp++;
 break;
 }
 }
}

I feel that while this is definitely more verbose, it is far easier to follow.

Example File:

$ cat puppy.pc
"Puppy" [10, 10]
* * . . . . . . * *
* . * * . . * * . *
. . * * . . * * . .
. . . . . . . . . .
. . . . * * . . . .
* . . * * * * . . *
* . . . * * . . . *
. * . . * * . . * .
. . * * . . * * . .
. . . . * * . . . .

Result

$ ./a.out puppy.pc
Puppy (10 x 10)
* * * * 
* * * * * * 
 * * * * 
 * * 
* * * * * * 
* * * * 
 * * * * 
 * * * * 
 * *

Another example:

$ cat test
"Smiley"[5,5]......*.*......*...******

Result:

$ ./a.out test
Smiley (5 x 5)
 * * 
* * 
* * * * *

Question 2

Thanks for the edit, @Jamal, looks much better without the erroneous syntax highlighting.

Question 3

Strictly speaking, sscanf will do most of the job in one line of code. I assume you want to avoid it for learning purposes.
As in the part 1, the malformed input may result in the infinite loop.
It doesn't look like the state machine is warranted here: the flow is strictly linear. A set of functions (get_name(), get_int(), get_map()) returning (along with the actual result) the breaking point is much more readable:
```
fbp = get_name(fbp, nbuf);
while (*fbp++ != '[')
 fbp++;
fbp = get_int(fbp, &pd->width);
while (*fbp++ != ',')
 fbp++;
```

etc.

Question 4

Good idea on the functions! And thank you for addressing the mode system. In practice, though, I don't think a malformed file will result in an infinite loop, as eventually, fbp will have to reach the end of the buffer, and the eventual attempt to access a char that's out-of-bounds will (should) result in a Segmentation fault. Tested it with a file missing a ], and confirmed it crashes. Though the error is *** stack smashing detected *** ... Aborted (core dumped). I suppose I could protect against that by adding a function to validate that the file is correctly formatted.

Question 5

Abstain from implementing the Loop-Switch-Antipattern.

No, seriously.
If you want to add comments, you can do so.
Test all functions which can fail for error, and handle that, for example by notifying the caller: malloc / calloc, fopen, fgets.
Avoid large stack-buffers. Also, use dynamic memory-allocation to avoid arbitrary limitations.
Actually check the above-mentioned limitations.
Don't zero memory just for the heck of it. There are more amusing ways to waste time.
Dito for reading a file: Use fread to read whole blocks instead of fgetc to read a single byte, thus incurring unavoidable overhead multiple times.

As an aside, a char is the wrong data-type for saving the return-value of fgetc, use an int so EOF can be differentiated from (char)-1.
I would read the file in binary mode, meaning you can use ftell and fseek to get the size of the file, and restart from the beginning.

Question 6

See? I knew there was something wrong with that. Something about it felt dirty, but I couldn't quite place it. Thank you for the source. On #5, is there a specific problem with calloc? I've learned to use it over malloc because it ensures that the data will be zero'd out, avoiding unexpected garbage. While malloc does tend to zero out memory (rather, the OS kernel zeros out the memory for security reasons), the C standard doesn't explicitly state that malloc has to zero out memory.

Question 7

As for the aside...oops! I could have sworn I declared that as an int. I know exactly what you mean, because I've tried to use char for reading a binary file (e.g. implementing xxd for learning purposes), and for certain character values, the results are...rather unexpected.

Question 8

Is there anything better about expected garbage? No (though debug-runtimes like to fill with a pattern likely making errors obvious, thus not zero). The rule is, don't initialize to some dummy value, that just defeats any chance for automatic tools to catch logic errors, assuming it can be optimized out. If it cannot, it also costs performance.

Question 9

Naming

Please, use adequate names. (e.g: who reads your code doesn't know what fbuf means).

Parser

Any reason why you can't use fgets()? If you need every char in order to inspect it or modify or whatever else then use fgets().
You're risking buffer overflow if the user inputs 10000 characters or more. You need to dynamically allocate memory, using malloc() and realloc().
Use sscanf. Checking the return result of sscanf would be a good idea as well. The *scanf() functions return the number of items read. Don't compare the return value with EOF.
You can use fseek and ftell to calculate the file size for you without having to read every character.

Fair example:

long fsize(FILE *fp) {
 fseek(fp, 0, SEEK_END);
 long bytes = ftell(fp);
 rewind(fp);
 return bytes;
}

These functions are part of standard C (stdio.h), so this is portable.

Take care of pre/post fixes for variables.
In the last case, you don't need to use break;.

Question 10

As for the break;, it's a habit of mine to be explicit and leave room for additions. The last time I aggressively byte-counted like that, I spent the next three hours debugging what ended up being a missing comma. So thanks, but I'm not too concerned about the size of my source code.

Question 11

As for fgets, I considered it, but when I realized I'd have to calculate the size of the file to fgets it into a buffer, it kind of seemed redundant, as fgetc will do the job without needing to read the file twice.

Question 12

As for the limited file size, I totally forgot about that. I'll definitely calloc/free it with a larger size. Thanks for pointing that out!

Question 13

@B1KMusic what I mean is that it's good to use break for the cases, but not in the last one: there won't be any necessity. The fgets function reads at most one less than the number of characters specified by n from the stream pointed to by stream into the array pointed to by s. No additional characters are read after a new-line character (which is retained) or after EOF.

Question 14

Yes, I am aware that break isn't needed by the compiler, and, as you pointed out at the end, one can put the break anyway (and the compiler will optimize it out) if they prefer. And, I prefer to remain as explicit as possible to avoid bug-prone code, which is why I leave the break in there. That's just defensive coding practice that I've learned over the years.

vnp vnp 58.6k4 gold badges55 silver badges144 bronze badges · Answer 1 · 2015-12-01 19:54:51Z

Strictly speaking, sscanf will do most of the job in one line of code. I assume you want to avoid it for learning purposes.
As in the part 1, the malformed input may result in the infinite loop.
It doesn't look like the state machine is warranted here: the flow is strictly linear. A set of functions (get_name(), get_int(), get_map()) returning (along with the actual result) the breaking point is much more readable:
```
fbp = get_name(fbp, nbuf);
while (*fbp++ != '[')
 fbp++;
fbp = get_int(fbp, &pd->width);
while (*fbp++ != ',')
 fbp++;
```

etc.

Good idea on the functions! And thank you for addressing the mode system. In practice, though, I don't think a malformed file will result in an infinite loop, as eventually, fbp will have to reach the end of the buffer, and the eventual attempt to access a char that's out-of-bounds will (should) result in a Segmentation fault. Tested it with a file missing a ], and confirmed it crashes. Though the error is *** stack smashing detected *** ... Aborted (core dumped). I suppose I could protect against that by adding a function to validate that the file is correctly formatted.

Deduplicator Deduplicator 19.7k1 gold badge32 silver badges65 bronze badges · Answer 2 · 2015-12-01 20:26:51Z

Abstain from implementing the Loop-Switch-Antipattern.

No, seriously.
If you want to add comments, you can do so.
Test all functions which can fail for error, and handle that, for example by notifying the caller: malloc / calloc, fopen, fgets.
Avoid large stack-buffers. Also, use dynamic memory-allocation to avoid arbitrary limitations.
Actually check the above-mentioned limitations.
Don't zero memory just for the heck of it. There are more amusing ways to waste time.
Dito for reading a file: Use fread to read whole blocks instead of fgetc to read a single byte, thus incurring unavoidable overhead multiple times.

As an aside, a char is the wrong data-type for saving the return-value of fgetc, use an int so EOF can be differentiated from (char)-1.
I would read the file in binary mode, meaning you can use ftell and fseek to get the size of the file, and restart from the beginning.

See? I knew there was something wrong with that. Something about it felt dirty, but I couldn't quite place it. Thank you for the source. On #5, is there a specific problem with calloc? I've learned to use it over malloc because it ensures that the data will be zero'd out, avoiding unexpected garbage. While malloc does tend to zero out memory (rather, the OS kernel zeros out the memory for security reasons), the C standard doesn't explicitly state that malloc has to zero out memory.
As for the aside...oops! I could have sworn I declared that as an int. I know exactly what you mean, because I've tried to use char for reading a binary file (e.g. implementing xxd for learning purposes), and for certain character values, the results are...rather unexpected.
Is there anything better about expected garbage? No (though debug-runtimes like to fill with a pattern likely making errors obvious, thus not zero). The rule is, don't initialize to some dummy value, that just defeats any chance for automatic tools to catch logic errors, assuming it can be optimized out. If it cannot, it also costs performance.

user91060user91060 · Answer 3 · 2015-12-01 19:35:15Z

2

\$\begingroup\$

Naming

Please, use adequate names. (e.g: who reads your code doesn't know what fbuf means).

Parser

Any reason why you can't use fgets()? If you need every char in order to inspect it or modify or whatever else then use fgets().
You're risking buffer overflow if the user inputs 10000 characters or more. You need to dynamically allocate memory, using malloc() and realloc().
Use sscanf. Checking the return result of sscanf would be a good idea as well. The *scanf() functions return the number of items read. Don't compare the return value with EOF.
You can use fseek and ftell to calculate the file size for you without having to read every character.

Fair example:

long fsize(FILE *fp) {
 fseek(fp, 0, SEEK_END);
 long bytes = ftell(fp);
 rewind(fp);
 return bytes;
}

These functions are part of standard C (stdio.h), so this is portable.

Take care of pre/post fixes for variables.
In the last case, you don't need to use break;.

Share

edited Jun 10, 2020 at 13:24

Community's user avatar

Community Bot

1

answered Dec 1, 2015 at 19:35

user91060user91060

\$\endgroup\$

14

1

\$\begingroup\$ As for the break;, it's a habit of mine to be explicit and leave room for additions. The last time I aggressively byte-counted like that, I spent the next three hours debugging what ended up being a missing comma. So thanks, but I'm not too concerned about the size of my source code. \$\endgroup\$

Braden Best
– Braden Best

2015年12月01日 19:47:14 +00:00
Commented Dec 1, 2015 at 19:47
\$\begingroup\$ As for fgets, I considered it, but when I realized I'd have to calculate the size of the file to fgets it into a buffer, it kind of seemed redundant, as fgetc will do the job without needing to read the file twice. \$\endgroup\$

Braden Best
– Braden Best

2015年12月01日 19:50:55 +00:00
Commented Dec 1, 2015 at 19:50
\$\begingroup\$ As for the limited file size, I totally forgot about that. I'll definitely calloc/free it with a larger size. Thanks for pointing that out! \$\endgroup\$

Braden Best
– Braden Best

2015年12月01日 19:52:43 +00:00
Commented Dec 1, 2015 at 19:52
\$\begingroup\$ @B1KMusic what I mean is that it's good to use break for the cases, but not in the last one: there won't be any necessity. The fgets function reads at most one less than the number of characters specified by n from the stream pointed to by stream into the array pointed to by s. No additional characters are read after a new-line character (which is retained) or after EOF. \$\endgroup\$

user91060
– user91060

2015年12月01日 20:02:51 +00:00
Commented Dec 1, 2015 at 20:02
1

\$\begingroup\$ Yes, I am aware that break isn't needed by the compiler, and, as you pointed out at the end, one can put the break anyway (and the compiler will optimize it out) if they prefer. And, I prefer to remain as explicit as possible to avoid bug-prone code, which is why I leave the break in there. That's just defensive coding practice that I've learned over the years. \$\endgroup\$

Braden Best
– Braden Best

2015年12月01日 20:30:13 +00:00
Commented Dec 1, 2015 at 20:30

| Show 9 more comments

Stack Exchange Network

File parsing and data management - follow-up

Project 2: A working map renderer that I plan to implement the game of Picross with.

The data structures:

The parser:

Example File:

Result

Another example:

Result:

3 Answers 3

Naming

Parser

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

File parsing and data management - follow-up

Project 2: A working map renderer that I plan to implement the game of Picross with.

The data structures:

The parser:

Example File:

Result

Another example:

Result:

3 Answers 3

Naming

Parser

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions