Heavily limited premature compiler translates text into excecutable python code

Question 1

My goal is to write a smaller compiler-like program, which allows me to draw geometric shapes into a 3D Diagram. There is no need for turing-completeness and the program should only be viewed as an exercise and not as a program used by anybody. However, I want the program to have a compiler-like nature.

At the moment the user provides a text file like this:

(1,45,6)
(7,8,5)
(10,77,88)
(99999,1,1)
(5,7,6)
(1,2,3)
(4,5,6)

These points will be translated into a python file, which will later display all points in a 3D-Diagram when executed. For the moment, I just want it to print out a list of points when executed.

--> [(1, 45, 6), (7, 8, 5), (10, 77, 88), (99999, 1, 1), (5, 7, 6), (1, 2, 3), (4, 5, 6)]

The python file looks like this:

list = []
list.append((1,45,6))
list.append((7,8,5))
list.append((10,77,88))
list.append((99999,1,1))
list.append((5,7,6))
list.append((1,2,3))
list.append((4,5,6))
print(list)

Therefore, I build the following code using C (just to improve C-skills, I am aware that writing it in python would be more applicable)

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
size_t seek(char* buffer, size_t start, const size_t end, char to_be_seeked);
int translateString2Number(char* c, long length);
int power(int base, int exponent);
int main(int argc, const char * argv[]) {
 if(argc <= 2)return -1;
 char file_name[100];
 strncpy(file_name, argv[1], 100);
 FILE* fp = fopen(file_name, "read");
 if(!fp)return -1;
 fseek(fp, 0, SEEK_END);
 const size_t elements_num = ftell(fp);
 rewind(fp);
 remove("translation.py");
 FILE * python_file_pointer = fopen("translation.py", "ab+");
 fprintf(python_file_pointer, "list = []\n");
 //Do parsing
 char* buffer = malloc(sizeof(char) * elements_num);
 fread(buffer, elements_num, 1, fp);
 size_t start = 0;
 while(start < elements_num){
 if(buffer[start] != '(') return -1;
 size_t end = seek(buffer, start, elements_num, ')');
 if(end == -1) return -1;
 size_t comma_pos[2];
 comma_pos[0] = seek(buffer, start, end, ',');
 comma_pos[1] = seek(buffer, comma_pos[0]+1, end, ',');
 if(comma_pos[0] == -1 || comma_pos[1] == -1 )return -1;
 char first_number_size = comma_pos[0]-start-1;
 char first_number[first_number_size];
 for(size_t i = 0; i < first_number_size; i++){
 first_number[i] = buffer[start+1+i];
 }
 char second_number_size = comma_pos[1]-comma_pos[0]-1;
 char second_number[second_number_size];
 for(size_t i = 0; i < second_number_size; i++){
 second_number[i] = buffer[comma_pos[0]+1+i];
 }
 char third_number_size = end - comma_pos[1]-1;
 char third_number[third_number_size];
 for(size_t i = 0; i < third_number_size; i++){
 third_number[i] = buffer[comma_pos[1]+1+i];
 }
 if( (first_number_size < 0) || second_number_size < 0|| third_number_size < 0){
 return -1;
 }
 if( (first_number_size > 11) || second_number_size > 11|| third_number_size > 11){ //Avoid potential overflow
 return -1;
 }
 int first = translateString2Number(first_number, first_number_size);
 int second = translateString2Number(second_number, second_number_size);
 int third = translateString2Number(third_number, third_number_size);
 fprintf(python_file_pointer, "list.append((%d,%d,%d))\n", first,second,third);
 const size_t value = seek(buffer, end, elements_num, '\n');
 if(value == -1)break;
 start = value+1;
 }
 fprintf(python_file_pointer, "print(list)\n");
 fclose(python_file_pointer);
 system("python3 translation.py");
 fclose(fp);
}
int power(int base, int exponent){
 int result = 1;
 for(int i = 0; i < exponent; i++){
 result *= base;
 }
 return result;
}
int translateString2Number(char* c, long length){
 int res = 0;
 for(int i = 0; i < length; i++){
 res += (c[i]-'0')*power(10, (int)(length-i-1));
 //printf("\n%d", res);
 }
 return res;
}
size_t seek(char* buffer, size_t start, const size_t end, char to_be_seeked){
 do{
 if(buffer[start] == to_be_seeked)return start;
 } while(++start < end);
 return -1;
}

Question 2

Sure, you are right. But actually, this is happening already only in a redundant form. How would you call it?

Question 3

@dfhwze: To be fair, the code actually seems to emit executable Python code, since the OP actually does system("python3 translation.py"); to generate the output.

Question 4

@AlexV You guys are right :)

Question 5

Very good example for code review, thanks for sharing and thanks to all these good anseers

Question 6

@TVSuchty That's true for pointers, because if not, the caller doesn't know if the function changed the value pointed by the pointer. But variables can't be changed by a function, so the caller knows it will be constant. And for the function itself, the compiler is smart enough to read the whole function and know if it is being used in the function or not. In the specific case of this function it would make sense to make the pointer const (actually, not the pointer, but the type pointed by the pointer): size_t seek(const char *buffer, size_t start, size_t end, char to_be_seeked);

Question 7

Allocations can fail

Don't use buffer until we know it's not null. (And no need to multiply by sizeof (char), since that's automatically a no-op).

Please remember to free() it too (at least as an option). That reduces false positives with Valgrind's memory checker.

I/O operations can fail

Always check that your I/O succeeds.

Consider this scenario: we run the program in a directory containing a (hostile) translation.py. The directory and the file are both read-only, so the remove() and the fopen() both fail, as do all the fprintf() calls using the invalid file descriptor.

Then we get to the call to system(). What's the Python code that's executed?

Use a temporary file

Instead of assuming it's possible and desirable to overwrite translation.py in the current directory, perhaps we should mktemp() or similar, and remove the temporary file when we exit?

Python source files are text

It makes no difference on a POSIX system, but it's misleading to use "b" in the fopen() call. We don't ever read from it, so don't need the "+", and want to replace any existing file, not append (so we wouldn't need to remove()), so the open mode really should be plain "w".

Use the standard library

translateString2Number (and therefore also power()) can be replaced with a simple call to sscanf (since we know the numbers are all terminated by a non-digit).

In fact, if we can rely on the input being correctly formatted (and simply error out if it's wrong), we can just read all the input using scanf(), rather than allocating buffer to hold the entire input stream.

Question 8

Consider this scenario: we run the program in a directory containing a (hostile) translation.py. The directory and the file are both read-only, so the remove() and the fopen() both fail, as do all the fprintf() calls using the invalid file descriptor. Then we get to the call to system(). What's the Python code that's executed? Are you a security engineer? WTF! Hella Smart! Blow my mind. Thanks for your answer,.

Question 9

So could you further explain how you would read without a buffer and how to assure the allocation has happened correctly?

Question 10

Well, the simple way to read without allocating our own buffer is to fscanf() directly from the input stream. That would greatly limit our options for error reporting, so I'd likely recommend using a fixed-size line buffer instead: read a line at a time, fail if line is too long or if fscanf(buffer, " (%d,%d,%d)", &a, &b, &c) != 3), maintain a line counter (for the error reporting). Oh, and I do have some experience in security analysis on various Unix platforms, but it's not my current main job.

Question 11

You further said that you suggest making a temporary file. However, I have not figured out what that accomplishes? What do I do with the temp file afterwards?

Question 12

mktemp() will give you a file that's writeable and not already in use, so it reduces the likelihood of a few foreseeable failure cases. After use, it's considered polite to remove the temporary file (many installations have a "tmpreaper" daemon to clean up old files from /tmp or equivalent, but treat that as a last resort).

Question 13

I see a number of things that may help you improve your program. Since the existing review covered a lot of good points, this review will cover the parts not already mentioned.

Use the correct form for `main`

There are exactly two allowed version of main, according to the standard, and yours isn't one of them. This code has this:

int main(int argc, const char * argv[]) {

But we need to remove the const here. See this question for details.

Use whitespace to make code more readable

Code lines like this:

if(argc <= 2)return -1;

are generally more readable if they include a little more whitespace. I'd write that like this:

if(argc < 2) {
 return -1;
}

Note that we only need argc to be at least two -- exactly 2 arguments is just fine and not an error.

Don't make pointless copies

The first few lines of the code are these:

int main(int argc, const char * argv[]) {
 if(argc <= 2)return -1;
 char file_name[100];
 strncpy(file_name, argv[1], 100);
 FILE* fp = fopen(file_name, "read");
 if(!fp)return -1;

First, 100 is an awfully arbitrary limit that might not be an entire path. Second, and most importantly, there's no need for the copy at all. This could all be reduced to this:

int main(int argc, char * argv[]) {
 if(argc < 2) {
 return -1;
 }
 FILE *in = fopen(argv[1], "r");
 if (!in) {
 return errno;
 }

The read mode is "r" and not "read". Note that we return errno (which is set by fopen) on error to give a slightly higher chance that the user can figure out what went wrong.

Don't do more work than needed

There's no real reason to seek to the end of the file to find out how big it is. Instead, one could parse the file character at a time and just look for the special EOF (end of file) token while parsing.

Don't leak memory

The buffer is allocated with this line

char* buffer = malloc(sizeof(char) * elements_num);

But there is no corresponding call to free() so this creates a memory leak. Also sizeof(char) is defined by the standard to be 1, so multiplying it here is pointless.

Write more concise Python

One could write this, as the current program does:

list = []
list.append((1,1,1))
list.append((2,2,2))

Or it could be written instead like this:

list = [(1,1,1), (2,2,2)]

I'd prefer the latter form, perhaps limiting the output line length to no more than 70 or so characters.

Don't convert numbers from text only to convert them back

There's no need to convert the input text to a number only to then re-convert to text on output. Instead, write each character directly as a character.

Use a state machine for parsing

A parser can often be implemented as an explicit state machine. Such parsers are often easier to reason about and to debug and augment. For that reason, I'd suggest writing this as a state machine.

Don't hardcode file names

Since there's only one output file, why not let the user specify its name instead of hardcoding it? Even better, don't use file names or handlers at all. Simply read from stdin and write to stdout and let the user redirect the files as needed. This gives the user complete control and allows you to simplify the code.

Eliminate "magic numbers"

There are a few numbers in the code, such as 2 and 100 that have a specific meaning in their particular context. By using named constants instead, the program becomes easier to read and maintain. For cases in which the constant is not used to size a static array, use #define; otherwise use const.

An example

Here's one alternative using all of these suggestions:

#include <stdio.h>
#include <ctype.h>
int main(void) {
 printf("list = [");
 enum { openparen, num, comma, closeparen, error } state = openparen;
 // expected number of additional numbers beyond the first
 const int expected = 2;
 int numbers = expected;
 for (char ch = getchar(); ch != EOF; ch = getchar()) {
 if (isspace(ch)) {
 continue;
 }
 switch (state) {
 case openparen:
 if (ch == '(') {
 putchar(ch);
 state = num;
 } else {
 state = error;
 }
 break;
 case num:
 if (isdigit(ch)) {
 putchar(ch);
 if (numbers == 0) {
 numbers = expected;
 state = closeparen;
 } else {
 state = comma;
 }
 } else {
 state = error;
 }
 break;
 case comma:
 if (isdigit(ch)) {
 putchar(ch);
 } else if (ch == ',' && numbers) {
 putchar(ch);
 --numbers;
 state = num;
 } else {
 state = error;
 }
 break;
 case closeparen:
 if (isdigit(ch)) {
 putchar(ch);
 } else if (ch == ')') {
 putchar(ch);
 putchar(',');
 state = openparen;
 } else {
 state = error;
 }
 break;
 default:
 fprintf(stderr, "Error in input data.\n");
 return 1;
 break;
 }
 }
 printf("]\n");
 return 0;
}

Question 14

@TVSuchty These are 2 separate answers by 2 separate members of Code Review, there is no way to merge the answers. Be happy because 2 of the best C/C++ experts on code review answered your question.

Question 15

@wizzwizz4: your instincts are good -- if you had defined your own function like this, it would indeed make sense to have it const, but there are a number of C language rules that make main unique and that is one.

Question 16

@Edward They are modifiable, but the program simply chooses not to. The prototype does not have const, but the implementation does.

Question 17

I ‘ll note that the linked answer got the same question (as a comment) and the same answer: const argv does not seem to be allowed by the standard.

Question 18

Actually, this code does handle \n and spaces and tabs by skipping all of them in using if (isspace(ch)) { continue; } This may or may not be what you want, because it would accept input such as "( 9 9 9, 3 2 1, 1 1 1 )" as "(999,321,111)". Also any of those spaces could also be newline characters or formfeed characters, making for some really ugly input! On output, only a single newline is issued with the closing bracket.

Question 19

Another point that hasn't been fleshed out by other reviewers is the generated python code.

list is a built-in function in python - by calling your list list you are overriding it. That's generally considered bad form in the python community, mostly because someone could be stung if they try to use the list constructor later in the code.

So instead, let's name the list after what it contains. I'm thinking points.

Creating a list and then manually appending every item could be rather slow for a large list - so lets create the list in one go. You could do this as a one liner - that way it's all in the first line and (assuming you don't word wrap) you can skip past it to the flesh of the program. But if we're going for neatness - I'd arrange it like this;

points = [
 (1,45,6),
 (7,8,5),
 (10,77,88),
 (99999,1,1),
 (5,7,6),
 (1,2,3),
 (4,5,6),
]

This is pretty easy to generate - as all you need to do is:

Write the header (points = [)
The leading indentation, the value, then a trailing comma ( {line},)
Then the footer (]).
Then you can write the rest of the program as you were planning to anyway (In this case, print(points)).

Note that trailing commas on the last item is accepted in python (some even encourage it, like myself) so you don't need to worry about detecting where you are in the file.

Lastly, if you want to keep your main python code separate to your list - consider using imports. Assuming you call your generated file points.py, your main code could start with the following:

from points import points
print(points)

This has the advantage of not having to write your python code in a large C string.

Toby Speight Toby Speight 87.3k14 gold badges104 silver badges322 bronze badges · Accepted Answer · 2019-06-18 09:34:13Z

Allocations can fail

Don't use buffer until we know it's not null. (And no need to multiply by sizeof (char), since that's automatically a no-op).

Please remember to free() it too (at least as an option). That reduces false positives with Valgrind's memory checker.

I/O operations can fail

Always check that your I/O succeeds.

Consider this scenario: we run the program in a directory containing a (hostile) translation.py. The directory and the file are both read-only, so the remove() and the fopen() both fail, as do all the fprintf() calls using the invalid file descriptor.

Then we get to the call to system(). What's the Python code that's executed?

Use a temporary file

Instead of assuming it's possible and desirable to overwrite translation.py in the current directory, perhaps we should mktemp() or similar, and remove the temporary file when we exit?

Python source files are text

It makes no difference on a POSIX system, but it's misleading to use "b" in the fopen() call. We don't ever read from it, so don't need the "+", and want to replace any existing file, not append (so we wouldn't need to remove()), so the open mode really should be plain "w".

Use the standard library

translateString2Number (and therefore also power()) can be replaced with a simple call to sscanf (since we know the numbers are all terminated by a non-digit).

In fact, if we can rely on the input being correctly formatted (and simply error out if it's wrong), we can just read all the input using scanf(), rather than allocating buffer to hold the entire input stream.

Consider this scenario: we run the program in a directory containing a (hostile) translation.py. The directory and the file are both read-only, so the remove() and the fopen() both fail, as do all the fprintf() calls using the invalid file descriptor. Then we get to the call to system(). What's the Python code that's executed? Are you a security engineer? WTF! Hella Smart! Blow my mind. Thanks for your answer,.
So could you further explain how you would read without a buffer and how to assure the allocation has happened correctly?
Well, the simple way to read without allocating our own buffer is to fscanf() directly from the input stream. That would greatly limit our options for error reporting, so I'd likely recommend using a fixed-size line buffer instead: read a line at a time, fail if line is too long or if fscanf(buffer, " (%d,%d,%d)", &a, &b, &c) != 3), maintain a line counter (for the error reporting). Oh, and I do have some experience in security analysis on various Unix platforms, but it's not my current main job.
You further said that you suggest making a temporary file. However, I have not figured out what that accomplishes? What do I do with the temp file afterwards?
mktemp() will give you a file that's writeable and not already in use, so it reduces the likelihood of a few foreseeable failure cases. After use, it's considered polite to remove the temporary file (many installations have a "tmpreaper" daemon to clean up old files from /tmp or equivalent, but treat that as a last resort).

Stack Exchange Network

Heavily limited premature compiler translates text into excecutable python code

3 Answers 3

Allocations can fail

I/O operations can fail

Use a temporary file

Python source files are text

Use the standard library

Use the correct form for `main`

Use whitespace to make code more readable

Don't make pointless copies

Don't do more work than needed

Don't leak memory

Write more concise Python

Don't convert numbers from text only to convert them back

Use a state machine for parsing

Don't hardcode file names

Eliminate "magic numbers"

An example

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Heavily limited premature compiler translates text into excecutable python code

3 Answers 3

Allocations can fail

I/O operations can fail

Use a temporary file

Python source files are text

Use the standard library

Use the correct form for main

Use whitespace to make code more readable

Don't make pointless copies

Don't do more work than needed

Don't leak memory

Write more concise Python

Don't convert numbers from text only to convert them back

Use a state machine for parsing

Don't hardcode file names

Eliminate "magic numbers"

An example

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions

Use the correct form for `main`