My goal is to write a smaller compiler-like program, which allows me to draw geometric shapes into a 3D Diagram. There is no need for turing-completeness and the program should only be viewed as an exercise and not as a program used by anybody. However, I want the program to have a compiler-like nature.
At the moment the user provides a text file like this:
(1,45,6)
(7,8,5)
(10,77,88)
(99999,1,1)
(5,7,6)
(1,2,3)
(4,5,6)
These points will be translated into a python file, which will later display all points in a 3D-Diagram when executed. For the moment, I just want it to print out a list of points when executed.
--> [(1, 45, 6), (7, 8, 5), (10, 77, 88), (99999, 1, 1), (5, 7, 6), (1, 2, 3), (4, 5, 6)]
The python file looks like this:
list = []
list.append((1,45,6))
list.append((7,8,5))
list.append((10,77,88))
list.append((99999,1,1))
list.append((5,7,6))
list.append((1,2,3))
list.append((4,5,6))
print(list)
Therefore, I build the following code using C (just to improve C-skills, I am aware that writing it in python would be more applicable)
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
size_t seek(char* buffer, size_t start, const size_t end, char to_be_seeked);
int translateString2Number(char* c, long length);
int power(int base, int exponent);
int main(int argc, const char * argv[]) {
if(argc <= 2)return -1;
char file_name[100];
strncpy(file_name, argv[1], 100);
FILE* fp = fopen(file_name, "read");
if(!fp)return -1;
fseek(fp, 0, SEEK_END);
const size_t elements_num = ftell(fp);
rewind(fp);
remove("translation.py");
FILE * python_file_pointer = fopen("translation.py", "ab+");
fprintf(python_file_pointer, "list = []\n");
//Do parsing
char* buffer = malloc(sizeof(char) * elements_num);
fread(buffer, elements_num, 1, fp);
size_t start = 0;
while(start < elements_num){
if(buffer[start] != '(') return -1;
size_t end = seek(buffer, start, elements_num, ')');
if(end == -1) return -1;
size_t comma_pos[2];
comma_pos[0] = seek(buffer, start, end, ',');
comma_pos[1] = seek(buffer, comma_pos[0]+1, end, ',');
if(comma_pos[0] == -1 || comma_pos[1] == -1 )return -1;
char first_number_size = comma_pos[0]-start-1;
char first_number[first_number_size];
for(size_t i = 0; i < first_number_size; i++){
first_number[i] = buffer[start+1+i];
}
char second_number_size = comma_pos[1]-comma_pos[0]-1;
char second_number[second_number_size];
for(size_t i = 0; i < second_number_size; i++){
second_number[i] = buffer[comma_pos[0]+1+i];
}
char third_number_size = end - comma_pos[1]-1;
char third_number[third_number_size];
for(size_t i = 0; i < third_number_size; i++){
third_number[i] = buffer[comma_pos[1]+1+i];
}
if( (first_number_size < 0) || second_number_size < 0|| third_number_size < 0){
return -1;
}
if( (first_number_size > 11) || second_number_size > 11|| third_number_size > 11){ //Avoid potential overflow
return -1;
}
int first = translateString2Number(first_number, first_number_size);
int second = translateString2Number(second_number, second_number_size);
int third = translateString2Number(third_number, third_number_size);
fprintf(python_file_pointer, "list.append((%d,%d,%d))\n", first,second,third);
const size_t value = seek(buffer, end, elements_num, '\n');
if(value == -1)break;
start = value+1;
}
fprintf(python_file_pointer, "print(list)\n");
fclose(python_file_pointer);
system("python3 translation.py");
fclose(fp);
}
int power(int base, int exponent){
int result = 1;
for(int i = 0; i < exponent; i++){
result *= base;
}
return result;
}
int translateString2Number(char* c, long length){
int res = 0;
for(int i = 0; i < length; i++){
res += (c[i]-'0')*power(10, (int)(length-i-1));
//printf("\n%d", res);
}
return res;
}
size_t seek(char* buffer, size_t start, const size_t end, char to_be_seeked){
do{
if(buffer[start] == to_be_seeked)return start;
} while(++start < end);
return -1;
}
3 Answers 3
Allocations can fail
Don't use buffer
until we know it's not null. (And no need to multiply by sizeof (char)
, since that's automatically a no-op).
Please remember to free()
it too (at least as an option). That reduces false positives with Valgrind's memory checker.
I/O operations can fail
Always check that your I/O succeeds.
Consider this scenario: we run the program in a directory containing a (hostile) translation.py
. The directory and the file are both read-only, so the remove()
and the fopen()
both fail, as do all the fprintf()
calls using the invalid file descriptor.
Then we get to the call to system()
. What's the Python code that's executed?
Use a temporary file
Instead of assuming it's possible and desirable to overwrite translation.py
in the current directory, perhaps we should mktemp()
or similar, and remove the temporary file when we exit?
Python source files are text
It makes no difference on a POSIX system, but it's misleading to use "b"
in the fopen()
call. We don't ever read from it, so don't need the "+"
, and want to replace any existing file, not append (so we wouldn't need to remove()
), so the open mode really should be plain "w"
.
Use the standard library
translateString2Number
(and therefore also power()
) can be replaced with a simple call to sscanf
(since we know the numbers are all terminated by a non-digit).
In fact, if we can rely on the input being correctly formatted (and simply error out if it's wrong), we can just read all the input using scanf()
, rather than allocating buffer
to hold the entire input stream.
-
\$\begingroup\$ Consider this scenario: we run the program in a directory containing a (hostile) translation.py. The directory and the file are both read-only, so the remove() and the fopen() both fail, as do all the fprintf() calls using the invalid file descriptor. Then we get to the call to system(). What's the Python code that's executed? Are you a security engineer? WTF! Hella Smart! Blow my mind. Thanks for your answer,. \$\endgroup\$TVSuchty– TVSuchty2019年06月18日 09:39:38 +00:00Commented Jun 18, 2019 at 9:39
-
\$\begingroup\$ So could you further explain how you would read without a buffer and how to assure the allocation has happened correctly? \$\endgroup\$TVSuchty– TVSuchty2019年06月18日 09:51:01 +00:00Commented Jun 18, 2019 at 9:51
-
\$\begingroup\$ Well, the simple way to read without allocating our own buffer is to
fscanf()
directly from the input stream. That would greatly limit our options for error reporting, so I'd likely recommend using a fixed-size line buffer instead: read a line at a time, fail if line is too long or iffscanf(buffer, " (%d,%d,%d)", &a, &b, &c) != 3)
, maintain a line counter (for the error reporting). Oh, and I do have some experience in security analysis on various Unix platforms, but it's not my current main job. \$\endgroup\$Toby Speight– Toby Speight2019年06月18日 11:54:18 +00:00Commented Jun 18, 2019 at 11:54 -
\$\begingroup\$ You further said that you suggest making a temporary file. However, I have not figured out what that accomplishes? What do I do with the temp file afterwards? \$\endgroup\$TVSuchty– TVSuchty2019年06月19日 04:53:48 +00:00Commented Jun 19, 2019 at 4:53
-
\$\begingroup\$
mktemp()
will give you a file that's writeable and not already in use, so it reduces the likelihood of a few foreseeable failure cases. After use, it's considered polite to remove the temporary file (many installations have a "tmpreaper" daemon to clean up old files from/tmp
or equivalent, but treat that as a last resort). \$\endgroup\$Toby Speight– Toby Speight2019年06月19日 07:21:27 +00:00Commented Jun 19, 2019 at 7:21
I see a number of things that may help you improve your program. Since the existing review covered a lot of good points, this review will cover the parts not already mentioned.
Use the correct form for main
There are exactly two allowed version of main
, according to the standard, and yours isn't one of them. This code has this:
int main(int argc, const char * argv[]) {
But we need to remove the const
here. See this question for details.
Use whitespace to make code more readable
Code lines like this:
if(argc <= 2)return -1;
are generally more readable if they include a little more whitespace. I'd write that like this:
if(argc < 2) {
return -1;
}
Note that we only need argc
to be at least two -- exactly 2 arguments is just fine and not an error.
Don't make pointless copies
The first few lines of the code are these:
int main(int argc, const char * argv[]) {
if(argc <= 2)return -1;
char file_name[100];
strncpy(file_name, argv[1], 100);
FILE* fp = fopen(file_name, "read");
if(!fp)return -1;
First, 100 is an awfully arbitrary limit that might not be an entire path. Second, and most importantly, there's no need for the copy at all. This could all be reduced to this:
int main(int argc, char * argv[]) {
if(argc < 2) {
return -1;
}
FILE *in = fopen(argv[1], "r");
if (!in) {
return errno;
}
The read mode is "r" and not "read". Note that we return errno
(which is set by fopen
) on error to give a slightly higher chance that the user can figure out what went wrong.
Don't do more work than needed
There's no real reason to seek to the end of the file to find out how big it is. Instead, one could parse the file character at a time and just look for the special EOF
(end of file) token while parsing.
Don't leak memory
The buffer is allocated with this line
char* buffer = malloc(sizeof(char) * elements_num);
But there is no corresponding call to free()
so this creates a memory leak. Also sizeof(char)
is defined by the standard to be 1, so multiplying it here is pointless.
Write more concise Python
One could write this, as the current program does:
list = []
list.append((1,1,1))
list.append((2,2,2))
Or it could be written instead like this:
list = [(1,1,1), (2,2,2)]
I'd prefer the latter form, perhaps limiting the output line length to no more than 70 or so characters.
Don't convert numbers from text only to convert them back
There's no need to convert the input text to a number only to then re-convert to text on output. Instead, write each character directly as a character.
Use a state machine for parsing
A parser can often be implemented as an explicit state machine. Such parsers are often easier to reason about and to debug and augment. For that reason, I'd suggest writing this as a state machine.
Don't hardcode file names
Since there's only one output file, why not let the user specify its name instead of hardcoding it? Even better, don't use file names or handlers at all. Simply read from stdin
and write to stdout
and let the user redirect the files as needed. This gives the user complete control and allows you to simplify the code.
Eliminate "magic numbers"
There are a few numbers in the code, such as 2
and 100
that have a specific meaning in their particular context. By using named constants instead, the program becomes easier to read and maintain. For cases in which the constant is not used to size a static array, use #define
; otherwise use const
.
An example
Here's one alternative using all of these suggestions:
#include <stdio.h>
#include <ctype.h>
int main(void) {
printf("list = [");
enum { openparen, num, comma, closeparen, error } state = openparen;
// expected number of additional numbers beyond the first
const int expected = 2;
int numbers = expected;
for (char ch = getchar(); ch != EOF; ch = getchar()) {
if (isspace(ch)) {
continue;
}
switch (state) {
case openparen:
if (ch == '(') {
putchar(ch);
state = num;
} else {
state = error;
}
break;
case num:
if (isdigit(ch)) {
putchar(ch);
if (numbers == 0) {
numbers = expected;
state = closeparen;
} else {
state = comma;
}
} else {
state = error;
}
break;
case comma:
if (isdigit(ch)) {
putchar(ch);
} else if (ch == ',' && numbers) {
putchar(ch);
--numbers;
state = num;
} else {
state = error;
}
break;
case closeparen:
if (isdigit(ch)) {
putchar(ch);
} else if (ch == ')') {
putchar(ch);
putchar(',');
state = openparen;
} else {
state = error;
}
break;
default:
fprintf(stderr, "Error in input data.\n");
return 1;
break;
}
}
printf("]\n");
return 0;
}
-
6\$\begingroup\$ @TVSuchty These are 2 separate answers by 2 separate members of Code Review, there is no way to merge the answers. Be happy because 2 of the best C/C++ experts on code review answered your question. \$\endgroup\$2019年06月18日 13:33:37 +00:00Commented Jun 18, 2019 at 13:33
-
2\$\begingroup\$ @wizzwizz4: your instincts are good -- if you had defined your own function like this, it would indeed make sense to have it
const
, but there are a number of C language rules that makemain
unique and that is one. \$\endgroup\$Edward– Edward2019年06月18日 18:43:05 +00:00Commented Jun 18, 2019 at 18:43 -
1\$\begingroup\$ @Edward They are modifiable, but the program simply chooses not to. The prototype does not have
const
, but the implementation does. \$\endgroup\$wizzwizz4– wizzwizz42019年06月18日 20:13:00 +00:00Commented Jun 18, 2019 at 20:13 -
2\$\begingroup\$ I ‘ll note that the linked answer got the same question (as a comment) and the same answer:
const argv
does not seem to be allowed by the standard. \$\endgroup\$Edward– Edward2019年06月18日 20:56:14 +00:00Commented Jun 18, 2019 at 20:56 -
1\$\begingroup\$ Actually, this code does handle
\n
and spaces and tabs by skipping all of them in usingif (isspace(ch)) { continue; }
This may or may not be what you want, because it would accept input such as "( 9 9 9, 3 2 1, 1 1 1 )" as "(999,321,111)". Also any of those spaces could also be newline characters or formfeed characters, making for some really ugly input! On output, only a single newline is issued with the closing bracket. \$\endgroup\$Edward– Edward2019年06月19日 11:40:50 +00:00Commented Jun 19, 2019 at 11:40
Another point that hasn't been fleshed out by other reviewers is the generated python code.
list
is a built-in function in python - by calling your list list
you are overriding it. That's generally considered bad form in the python community, mostly because someone could be stung if they try to use the list
constructor later in the code.
So instead, let's name the list after what it contains. I'm thinking points
.
Creating a list and then manually appending every item could be rather slow for a large list - so lets create the list in one go. You could do this as a one liner - that way it's all in the first line and (assuming you don't word wrap) you can skip past it to the flesh of the program. But if we're going for neatness - I'd arrange it like this;
points = [
(1,45,6),
(7,8,5),
(10,77,88),
(99999,1,1),
(5,7,6),
(1,2,3),
(4,5,6),
]
This is pretty easy to generate - as all you need to do is:
- Write the header (
points = [
) - The leading indentation, the value, then a trailing comma (
{line},
) - Then the footer (
]
). - Then you can write the rest of the program as you were planning to anyway (In this case,
print(points)
).
Note that trailing commas on the last item is accepted in python (some even encourage it, like myself) so you don't need to worry about detecting where you are in the file.
Lastly, if you want to keep your main python code separate to your list - consider using imports. Assuming you call your generated file points.py
, your main code could start with the following:
from points import points
print(points)
This has the advantage of not having to write your python code in a large C string.
system("python3 translation.py");
to generate the output. \$\endgroup\$size_t seek(const char *buffer, size_t start, size_t end, char to_be_seeked);
\$\endgroup\$