5
\$\begingroup\$

I implemented a compiler that compiles a dialect of BASIC known as SimpleBASIC into Simpletron Machine Language to be run by the Simpletron simulator.

I wrote the manual for the SimpleBASIC compiler in troff, it contains the commands of the SimpleBASIC language and some example programs.

simple(6) Games Manual simple(6)
NAME
 simple - compiles SimpleBASIC into Simple Machine Language
SYNOPSIS
 simple [-O] [-o outfile] infile
DESCRIPTION
 simple compiles a source code written in a dialect of BASIC known as
 SimpleBASIC into a program in Simple Machine Language (SML) ready to
 be run by the Simpletron simulator.
 A file containing a SimpleBASIC program is read by the compiler and
 converted to SML code. The SML code is output to a file on disk, in
 which SML instructions appear one per line. This file is by default
 a.out, but can be set by the option -o. The SML file can then be
 loaded into the Simpletron simulator.
 The options are as follows:
 -O Optimize the compilation. Makes the compiler do another pass
 to get rid of redundant instructions.
 -o outfile
 Put executable program into output file outfile.
THE SIMPLEBASIC COMPILATOR
 The symbol table
 The symbol table is a table that contains each symbol (label or
 variable) in the program and its corresponding location in the Sim‐
 pletron's memory. For a label, the location is the position in the
 Simpletron memory where the SML instruction generated by the Simple‐
 BASIC statement begin. For a variable, the location is the position
 in the Simpletron memory where the variable is stored.
 Each symbol (variable or label) is a single-character identifier.
 The SML array
 The SML array contains the Simpletron Machine Language (SML) in‐
 structions generated by the SimpleBASIC commands. An SML instruc‐
 tion is a four-digit integer that comprises two parts: the operation
 code (opcode) and the operand.
 The opcode is determined by commands in SimpleBASIC. For example,
 the SimpleBASIC command input correspond to SML opcode 11 (write).
 The operand is a immediate value or a memory location containing the
 data on which the operation code performs its task. For example,
 the opcode 10 (input) reads a value from keyboard and stores it in
 the memory location specified by the operand. The compiler searchs
 the symbol table to determine the Simpletron memory location for
 each symbol so the corresponding location can be used to complete
 the SML instructions.
 Generaly, each SimpleBASIC command generates a single SML instruc‐
 tion. But compilation of IF...GOTO and LET statements is more com‐
 plicated than other statements: they are the only statements that
 produce more than one SML instruction.
 For an IF...GOTO statement, the compiler produces code to test the
 condition and to branch to another line if necessary. The result of
 the branch could be an unresolved reference. Each of the relational
 and quality operators can be simulated using SML's branch zero and
 branch negative instructions (or possibly a combination of both).
 For a LET statement, the compiler produces code to evaluate an arbi‐
 trarily complex arithmetic expression consisting of integer vari‐
 ables and/or constants. When a compiler encounters an expression,
 it converts the expression from infix notation to postfix notation,
 then evaluates the postfix expression.
 The flag array
 When a GOTO statement is compiled with an unresolved reference (ie',
 it refers to a location in the code that has not been read yet), the
 SML instruction must be flagged to indicate that the second pass of
 the compiler must complete the instruction. The flags are stored in
 an array of type char in which each element is initialized to zero.
 If the memory location to which a line number in the Simple program
 refers is not yet known (ie', it's not in the symbol table), the
 line number is stored in the flag array in the element with the same
 subscript as the incomplete instruction. The operand of the incom‐
 plete instruction is set to 00 temporarily.
 For example, an unconditional branch instruction (making a forward
 reference) is left as +5000 until the resolve pass of the compiler.
 Memory size
 The Simpletron machine contains only 100 locations of memory, and a
 SML program loaded into the Simpletron's machine occupies the whole
 memory. Both data and instructions are located in memory. The be‐
 ginning of the memory is used to store instructions, while the end
 of the memory is used to store data. The position of the memory
 where the next instruction and data should be placed is determined
 by the instruction counter and the the data counter, respectively.
 Counters
 It's necessary to keep track of the next instruction location in the
 SML array because there is not a one-to-one correspondence between
 SimpleBASIC statements and SML instructions.
 Each time an instruction is produced, the instruction counter is in‐
 cremented to the next location in the SML array.
 The size of Simpletron's memory could present a problem for Simple
 programs with many statements, variables and constants. It's con‐
 ceivable that the compiler will run out of memory. To test for this
 case, the program contains a data counter to keep track of the loca‐
 tion at which the next variable or constant will be stored in the
 SML array.
 If the value of the instruction counter is larger than the data
 counter, the SML array is full. In this case, the compilation
 process terminates and the compiler prints an error message indicat‐
 ing that it ran out of memory during compilation.
 Passes
 The compiler performs four passes (five, if optimization is set) to
 convert a SimpleBASIC program to SML.
 Initialize
 The zeroth pass initializates the compilation variables and
 arrays. The symbol table, the SML array and the flag array
 are allocated and initialized. The compilation counters are
 zeroed.
 Populate
 The first actual pass populate the symbol table with the
 variable names and label names from the source code, populate
 the SML array with instructions generated by the SimpleBASIC
 statements, and flag any instruction that is incomplete and
 needs another pass to be fully completed.
 Optimization
 This pass only occurs if the -O option is used. It walks
 through the SML array in order to find sets of redundant in‐
 structions and replace them with a single instruction. When‐
 ever a redundant set of instructions is found, the locations
 in the symbol table and the flags are ajusted to point to the
 new instruction.
 Resolve
 This pass resolves any incomplete instruction. Incomplete
 instructions have only the opcode, but does not have a oper‐
 and, and are flagged as incomplete by the populate pass.
 When an instruction flagged as incomplete is found, this pass
 locate the symbol refered to by the flag array and insert the
 memory location from the symbol into the instruction with the
 unresolved reference.
 Assemble
 This pass walks through the SML array in order to write the
 instructions to a file. This file is ready to be executed by
 the Simpletron simulator. By default, this file is created
 as a.out, but this can be changed by the -o outfile option.
THE SIMPLEBASIC LANGUAGE
 SimpleBASIC is a simple, yet powerful, high level language similar
 to early versions of the popular language BASIC.
 Each line is a SimpleBASIC statement that consists of a command and
 its arguments. Each command begins with one of the following key‐
 words.
 input
 let
 print
 goto
 if
 end
 Variable and label names are single-letter names. SimpleBASIC does
 not allow descriptive variable names, so variables should be ex‐
 plained in comments to indicate their use in the program.
 SimpleBASIC uses only integer variables. Simple does not have vari‐
 able declarations, merely mentioning a variable name in a program
 causes the variable to be declared and initialized to zero automati‐
 cally.
 The syntax of SimpleBASIC does not allow string manipulation. If a
 string is encountered in a SimpleBASIC command, the compiler gener‐
 ates a syntax error.
 SimpleBASIC uses the conditional IF...GOTO statement and the uncon‐
 ditional GOTO statement to alter the flow of control during program
 execution. If the condition in the IF...GOTO statement is true,
 control is transferred to a specific line of the program.
 A comment in SimpleBASIC beggins with ; and go through the end of
 line.
 Labels
 A label is any letter followed by a colon. A label can occur before
 any SimpleBASIC statement. Labels are used by transfer of control
 statements to implement flow of control and loops during process ex‐
 ecution.
 SimpleBASIC commands
 Commands in SimpleBASIC are case insensitive. Both INPUT and input
 are the same command.
 INPUT A input statement prompts the user to enter an integer and
 assign it to a variable. For example, the following state‐
 ment reads an integer from the keyboard and stores that inte‐
 ger in x.
 INPUT x
 LET A let statement assign the value of an expression to a vari‐
 able. For example, the following statement assign u the
 value of 4 * (j - 56). An arbitarily complex expression can
 appear to the right of the equal sign. SimpleBASIC evaluates
 only integer expressions using the +, -, *, / and % opera‐
 tors. These operators have the same precedence as in C.
 Parentheses can be used to change the order of evaluation of
 an expression.
 LET u = 4 * (j - 56)
 PRINT A print statement display the value of a previously defined
 variable. For example, the following statement display the
 value of w.
 PRINT w
 GOTO A goto statement transfer program control to the statement
 after a specified label in the source code. For example, the
 following statement transfer control to the statement after
 the label a.
 GOTO a
 IF ... GOTO
 A if...goto statement compare two variables and transfer pro‐
 gram control to the label specified after GOTO if the condi‐
 tion is true; otherwise, continue execution with the next
 statement. The following relational and equality operators
 are valid in an IF...GOTO statement: <, >, <=, >=, == or !=.
 For example, the following statement compare i and z for
 equality and transfer program control to label j if they are
 equal.
 IF i == z GOTO j
 end A end statement terminate program execution. For example
 END
EXAMPLES
 add.basic
 The following program reads two integers from the keyboard, stores
 the values in variables a and b, and computes and prints their sum
 (stored in variable c).
 ; determine and print the sum of two integers
 ; input two integers
 INPUT a
 INPUT b
 ; add integers and store result in c
 LET c = a + b
 ; print the result
 PRINT c
 ; terminate program execution
 END
 larger.basic
 The following program determines and prints the larger of two inte‐
 gers. The integers are input from the keyboard and stored in s and
 t. The IF...GOTO statement tests the condition s >= t. If the con‐
 dition is true, control is transferred to label a and s is output;
 otherwise, t is output and control is transferred to the END state‐
 ment in label
 ; determine the larger of two integers
 INPUT s
 INPUT t
 ; test if s >= t
 IF s >= t GOTO a
 ; t is greater than s, so print t
 PRINT t
 GOTO b
 ; s is greater than or equal to t, so print s
 a:
 PRINT s
 b:
 END
 square.basic
 SimpleBASIC does not provide a repetition structure (such as C's
 for, while or do...while). However, SimpleBASIC can simulate each
 of C's repetition structures using the IF...GOTO and GOTO state‐
 ments.
 The following program uses a sentinel-controlled loop to calculate
 the squares of several integers. Each integer is input from the
 keyboard and stored in variable j. If the value entered is the sen‐
 tinel +0000, control is transfered to END, where the program termi‐
 nates. Otherwise, k is assigned the square of j, k is output to the
 screen and control is passed to where the next integer is input.
 ; Calculate the squares of several integers
 a:
 INPUT j
 ; set i as the sentinel value
 LET i = +0000
 ; test for sentinel value
 IF j == i GOTO b
 ; calculate square of j and assign result to k
 LET k = j * j
 PRINT k
 ; loop to get next j
 GOTO a
 b:
 END
EXIT STATUS
 0 Success.
 >0 Error occurred.
HISTORY
 This version of simple, the SimpleBASIC compiler, is based on the
 exercises 12.26~12.28 from the [Build Your Own Compiler] pdf pro‐
 vided by Deitel.
 The line label system is unique to this implementation, since the
 exercise use line number system as the target of GOTO statements.
 For more information, see the Wikipedia pages on [Line Number] and
 [Line Label].
CAVEATS
 This version of simple supports only variables in IF statements. It
 does not support expressions or constants in IF statements.
 This version of simple only supports single-letter symbols. In a
 next version, I will replace the symbol table by a binary search
 tree and implement multi-letter symbols.
 It also does not support immediate operands for instructions.
SEE ALSO
 simpletron(6)
 [Build Your Own Compiler]
 https://web.archive.org/web/20190819021934/http://www.deitel.com/bookresources/chtp8/CompilerExercises.pdf
 [Line number]
 https://en.wikipedia.org/wiki/Line_number
 [Line label]
 https://en.wikipedia.org/wiki/Line_label
 [Deitel & Deitel]
 C: How to Program (8th edition), Paul Deitel and Harvey Dei‐
 tel
 simple(6)

Here is simple.h

#define MEMSIZE 100
#define TOKENSIZE 63
typedef int16_t memtype;
enum tokentype {
 COMMENT,
 RELATIONAL,
 ARITHMETIC,
 ASSIGNMENT,
 GOTOKEYWRD,
 VARIABLE,
 EXPRESSION,
 LABEL,
 NEWLINE,
 COMMAND
};
enum operation {
 READ = 10,
 WRITE = 11,
 LOAD = 20,
 STORE = 21,
 ADD = 30,
 SUBTRACT = 31,
 DIVIDE = 32,
 MULTIPLY = 33,
 REMINDER = 34,
 ADD_I = 40,
 SUBTRACT_I = 41,
 DIVIDE_I = 42,
 MULTIPLY_I = 43,
 REMINDER_I = 44,
 BRANCH = 50,
 BRANCHNEG = 51,
 BRANCHZERO = 52,
 HALT = 53
};
struct symbol {
 enum {
 label,
 variable,
 none
 } type;
 size_t location;
};
struct expression {
 enum {
 symb,
 num,
 op
 } type;
 union {
 memtype n;
 char c;
 } u;
 struct expression *next;
};
struct compiler {
 size_t memsize;
 struct symbol *symtable; /* the symbol table */
 memtype *sml; /* the sml instructions */
 char *flag; /* the flag array */
 char *file; /* name of file to be compiled*/
 size_t ln; /* current line of file to be compiled */
 size_t inscount;
 size_t datacount;
};
typedef void (*instruction)(struct compiler *);

Here is simple.c

#include <err.h>
#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
#include <unistd.h>
#include <limits.h>
#include "simple.h"
static char *tokenstring[] = {
 [COMMENT] = "comment",
 [RELATIONAL] = "relational operator",
 [ARITHMETIC] = "arithmetic operator",
 [ASSIGNMENT] = "equal sign",
 [GOTOKEYWRD] = "goto keyword",
 [VARIABLE] = "variable",
 [EXPRESSION] = "expression",
 [LABEL] = "label",
 [NEWLINE] = "newline",
 [COMMAND] = "command"
};
/* compilation passes */
static void initialize(struct compiler *, char *);
static void populate(struct compiler *);
static void optimize(struct compiler *);
static void resolve(struct compiler *);
static void assemble(struct compiler, char *);
/* grammatical functions */
static char *gettoken(struct compiler *, enum tokentype, enum tokentype *);
/* functions to generate machine instructions */
static void command_input(struct compiler *);
static void command_let(struct compiler *);
static void command_print(struct compiler *);
static void command_goto(struct compiler *);
static void command_if(struct compiler *);
static void command_end(struct compiler *);
static instruction iscommand(char *);
/* symbol table manipulation functions */
static struct symbol *inserttable(struct compiler *, char, int, size_t);
static struct symbol *searchtable(struct compiler *, char);
/* expression handler functions */
static struct expression *getexpr(struct compiler *, char *s);
static int isoperator(char c);
static void enqueue(struct expression **head, struct expression **tail, struct expression operand);
/* usage */
static void usage(void);
/* simpleBASIC to Simpletron Machine Language compiler */
int
main(int argc, char *argv[])
{
 struct compiler comp;
 int c;
 char *out = "a.out";
 bool dooptimize;
 dooptimize = false;
 while ((c = getopt(argc, argv, "Oo:")) != -1) {
 switch (c) {
 case 'O':
 dooptimize = true;
 break;
 case 'o':
 out = optarg;
 break;
 default:
 usage();
 break;
 }
 }
 argc -= optind;
 argv += optind;
 if (argc != 1)
 usage();
 initialize(&comp, *argv);
 populate(&comp);
 if (dooptimize)
 optimize(&comp);
 resolve(&comp);
 assemble(comp, out);
 return EXIT_SUCCESS;
}
/* initialize symbol table, instruction array, flag array and counters */
static void
initialize(struct compiler *comp, char *filename)
{
 static struct symbol symtable['z' - 'A'];
 static memtype sml[MEMSIZE];
 static char flag[MEMSIZE];
 size_t i;
 comp->symtable = symtable;
 comp->sml = sml;
 comp->flag = flag;
 comp->memsize = MEMSIZE;
 comp->file = filename;
 comp->ln = 1;
 comp->inscount = 0;
 comp->datacount = comp->memsize - 1;
 /* initialize arrays */
 for (i = 'A'; i < 'z'; i++)
 symtable[i - 'A'].type = none;
 for (i = 0; i < comp->memsize; i++)
 flag[i] = '0円';
}
/* populate symbol table, instruction array, and flag array */
static void
populate(struct compiler *comp)
{
 enum tokentype toktype;
 char *tok;
 /* while there is a command, generate instructions for it */
 while ((tok = gettoken(comp, COMMAND, &toktype)) != NULL) {
 /* ignore comments */
 if (toktype == COMMENT)
 continue;
 /* if command is a label declaration, put it in symbol table */
 if (toktype == LABEL) {
 if (searchtable(comp, *tok) != NULL) {
 fprintf(stderr, "%s:%lu: error: redeclaration of '%c'\n",
 comp->file, comp->ln, *tok);
 exit(EXIT_FAILURE);
 }
 inserttable(comp, *tok, label, comp->inscount);
 continue;
 }
 /* generate instructions for statement */
 (*iscommand(tok))(comp);
 /* get obligatory newline after statement */
 gettoken(comp, NEWLINE, &toktype);
 /* if instructions overlap data, compilation run out of memory */
 if (comp->inscount > comp->datacount)
 errx(1, "compilation ran out of memory");
 }
}
/* optimize compilation by getting rid of redundant instructions */
static void
optimize(struct compiler *comp)
{
 memtype opcode0, opcode1, opcode2; /* current, next and nextnext opcode */
 memtype operand0, operand1; /* current and next operand */
 size_t i, j;
 for (i = 0; i < comp->inscount; i++) {
 opcode0 = comp->sml[i] / comp->memsize;
 operand0 = comp->sml[i] % comp->memsize;
 opcode1 = comp->sml[i+1] / comp->memsize;
 operand1 = comp->sml[i+1] % comp->memsize;
 opcode2 = comp->sml[i+2] / comp->memsize;
 if (operand0 == operand1 && opcode2 == STORE
 && opcode0 == STORE && opcode1 == LOAD) {
 for (j = 0; j < 'z' - 'A'; j++)
 if (comp->symtable[j].type == label && comp->symtable[j].location > i)
 comp->symtable[j].location -= 2;
 for (j = i; j < comp->inscount; j++) {
 comp->sml[j] = comp->sml[j + 2];
 comp->flag[j] = comp->flag[j + 2];
 }
 }
 }
}
/* resolve memory locations flagged as incomplete */
static void
resolve(struct compiler *comp)
{
 struct symbol *sym;
 size_t i;
 /* traverse flag array looking for flagged memory locations */
 for (i = 0; i < comp->memsize; i++) {
 if (comp->flag[i] != '0円') {
 sym = searchtable(comp, comp->flag[i]);
 if (sym == NULL)
 errx(1, "failed to find label %c", comp->flag[i]);
 if (sym->type != label)
 errx(1, "failed to find label %c", comp->flag[i]);
 comp->sml[i] += sym->location;
 }
 }
}
/* write machine code in memory to file */
static void
assemble(struct compiler comp, char *file)
{
 FILE *fp;
 size_t i;
 if ((fp = fopen(file, "w")) == NULL)
 err(EXIT_FAILURE, "%s", file);
 for (i = 0; i < comp.memsize; i++)
 fprintf(fp, "%+05d\n", comp.sml[i]);
 if (ferror(fp))
 err(1, "%s", file);
 fclose(fp);
}
/*
 * Get next token from fp, and return a pointer to it (or NULL if EOF is found).
 * expected is the type of expected token, *got is the type of got token.
 * If expected and *got are different, exit with error, except when expected a
 * command and got a label definition.
 */
static char *
gettoken(struct compiler *comp, enum tokentype expected, enum tokentype *got)
{
 instruction inst;
 static char s[TOKENSIZE];
 static FILE *fp = NULL;
 int c;
 size_t i;
 if (fp == NULL) {
 fp = fopen(comp->file, "r");
 if (fp == NULL)
 err(EXIT_FAILURE, "%s", comp->file);
 }
 /* get blank on the beginning */
 while (isblank(c = getc(fp)))
 ;
 i = 0;
 /* consider blank lines as remarks */
 if (expected == COMMAND && c == '\n') {
 s[i] = ';';
 s[i++] = '0円';
 comp->ln++;
 *got = COMMENT;
 } else if (expected == EXPRESSION) { /* if expected expression, get remaining of line */
 while (c != '\n' && i < TOKENSIZE - 1) {
 s[i++] = c;
 c = getc(fp);
 }
 if (c != '\n') {
 fprintf(stderr, "%s:%lu: error: expression too long\n", comp->file, comp->ln);
 exit(EXIT_FAILURE);
 }
 ungetc(c, fp);
 s[i] = '0円';
 *got = EXPRESSION;
 } else if (c == '\n') { /* because newline is a token */
 s[i++] = c;
 s[i] = '0円';
 *got = NEWLINE;
 } else if (c == ';') { /* comment */
 s[i++] = c;
 s[i] = '0円';
 while (c != '\n')
 c = getc(fp);
 comp->ln++;
 *got = COMMENT;
 } else if (c == '=') { /* test if it's = or == */
 s[i++] = c;
 c = getc(fp);
 if (c == '=') {
 s[i++] = c;
 *got = RELATIONAL;
 } else {
 ungetc(c, fp);
 *got = ASSIGNMENT;
 }
 s[i] = '0円';
 } else if (c == '!') { /* token is != */
 s[i++] = c;
 c = getc(fp);
 if (c != '=') {
 fprintf(stderr, "%s:%lu: error: unexpected character '%c'\n",
 comp->file, comp->ln, c);
 exit(EXIT_FAILURE);
 }
 s[i++] = c;
 s[i] = '0円';
 *got = RELATIONAL;
 } else if (c == '<' || c == '>') { /* token is <, >, <= or >= */
 s[i++] = c;
 c = getc(fp);
 if (c == '=')
 s[i++] = c;
 else
 ungetc(c, fp);
 s[i] = '0円';
 *got = RELATIONAL;
 } else if (isoperator(c)) { /* token is operator */
 s[i++] = c;
 s[i] = '0円';
 *got = ARITHMETIC;
 } else if (isalpha(c)) { /* token is command, variable or label */
 do {
 s[i++] = c;
 c = getc(fp);
 } while (isalpha(c) && i < TOKENSIZE - 1);
 s[i] = '0円';
 if (isalpha(c)) {
 fprintf(stderr, "%s:%lu: error: token too long\n", comp->file, comp->ln);
 exit(EXIT_FAILURE);
 }
 if (c == ':') {
 *got = LABEL;
 while (isblank(c = getc(fp)))
 ;
 if (c != '\n')
 ungetc(c, fp);
 } else if ((inst = iscommand(s)) != NULL) {
 ungetc(c, fp);
 if (expected == GOTOKEYWRD && inst == command_goto)
 *got = GOTOKEYWRD;
 else
 *got = COMMAND;
 } else {
 ungetc(c, fp);
 if (expected == LABEL)
 *got = LABEL;
 else
 *got = VARIABLE;
 }
 } else if (c == EOF) { /* close file after reading it */
 fclose(fp);
 return NULL;
 } else {
 fprintf(stderr, "%s:%lu: error: unexpected character '%c'\n", comp->file, comp->ln, c);
 exit(EXIT_FAILURE);
 }
 if (ferror(fp))
 err(1, "%s", comp->file);
 /* if got a newline, increment line count */
 if (*s == '\n')
 comp->ln++;
 /* test whether you got what you expected (except for labels) */
 if (expected != *got) {
 if (expected != COMMAND || (*got != LABEL && *got != COMMENT)) {
 fprintf(stderr, "%s:%lu: error: %s expected (got %s)\n",
 comp->file, comp->ln, tokenstring[expected], tokenstring[*got]);
 exit(EXIT_FAILURE);
 }
 }
 return s;
}
/* generate machine instructions for let statement */
static void
command_let(struct compiler *comp)
{
 struct symbol *sym;
 char *tok;
 enum tokentype toktype;
 size_t *stack, stacksize;
 size_t var, i;
 struct expression *expr;
 /* get location of variable to be assigned by let statement */
 tok = gettoken(comp, VARIABLE, &toktype); /* get variable symbol */
 sym = searchtable(comp, *tok); /* search in symbol table*/
 if (sym == NULL) /* if not found, insert it*/
 sym = inserttable(comp, *tok, variable, comp->datacount--);
 var = sym->location;
 /* get equal sign */
 tok = gettoken(comp, ASSIGNMENT, &toktype);
 /* get expression to assign and stack of addresses */
 tok = gettoken(comp, EXPRESSION, &toktype);
 expr = getexpr(comp, tok);
 stacksize = strlen(tok);
 stack = calloc(stacksize, sizeof *stack);
 /* generate machine instructions */
 i = 0;
 while (i < stacksize && expr != NULL) {
 size_t op1, op2;
 struct expression *tmp;
 if (expr->type == num) {
 comp->sml[comp->datacount] = expr->u.n;
 stack[i++] = comp->datacount--;
 } else if (expr->type == symb) {
 if ((sym = searchtable(comp, expr->u.c)) == NULL) /* if symbol not found, insert it*/
 sym = inserttable(comp, expr->u.c, variable, comp->datacount--);
 stack[i++] = sym->location;
 } else {
 if (i < 2) {
 fprintf(stderr, "%s:%lu error improper expression '%s'\n",
 comp->file, comp->ln, tok);
 exit(EXIT_FAILURE);
 }
 op2 = stack[--i];
 op1 = stack[--i];
 comp->sml[comp->inscount++] = LOAD * MEMSIZE + op1;
 if (expr->u.c == '+')
 comp->sml[comp->inscount++] = ADD * MEMSIZE + op2;
 if (expr->u.c == '-')
 comp->sml[comp->inscount++] = SUBTRACT * MEMSIZE + op2;
 if (expr->u.c == '*')
 comp->sml[comp->inscount++] = MULTIPLY * MEMSIZE + op2;
 if (expr->u.c == '/')
 comp->sml[comp->inscount++] = DIVIDE * MEMSIZE + op2;
 if (expr->u.c == '%')
 comp->sml[comp->inscount++] = REMINDER * MEMSIZE + op2;
 comp->sml[comp->inscount++] = STORE * MEMSIZE + comp->datacount;
 stack[i++] = comp->datacount--;
 }
 tmp = expr;
 expr = expr->next;
 free(tmp);
 }
 if (i == 0) {
 fprintf(stderr, "%s:%lu error improper expression '%s'\n",
 comp->file, i, tok);
 exit(EXIT_FAILURE);
 }
 comp->sml[comp->inscount++] = LOAD * MEMSIZE + stack[--i];
 comp->sml[comp->inscount++] = STORE * MEMSIZE + var;
}
/* generate machine instructions for input statement */
static void
command_input(struct compiler *comp)
{
 char *tok;
 enum tokentype toktype;
 struct symbol *sym;
 /* get location of variable to be input */
 tok = gettoken(comp, VARIABLE, &toktype); /* get variable name */
 sym = searchtable(comp, *tok); /* search in symbol table */
 if (sym == NULL) /* if not found, insert it*/
 sym = inserttable(comp, *tok, variable, comp->datacount--);
 /* generate machine instruction */
 comp->sml[comp->inscount++] = READ * MEMSIZE + sym->location;
}
/* generate machine instructions for print statement */
static void
command_print(struct compiler *comp)
{
 char *tok;
 enum tokentype toktype;
 struct symbol *sym;
 /* get location of variable to print */
 tok = gettoken(comp, VARIABLE, &toktype); /* get variable name */
 sym = searchtable(comp, *tok); /* search in symbol table */
 if (sym == NULL) { /* if not found, it's an error */
 fprintf(stderr, "%s:%lu: error: '%c' undeclared\n",
 comp->file, comp->ln, *tok);
 exit(EXIT_FAILURE);
 }
 /* generate machine instruction */
 comp->sml[comp->inscount++] = WRITE * MEMSIZE + sym->location;
}
/* generate machine instructions for goto statement */
static void
command_goto(struct compiler *comp)
{
 char *tok;
 enum tokentype toktype;
 struct symbol *sym;
 size_t labeladdr;
 /* get location to go to */
 labeladdr = 0;
 tok = gettoken(comp, LABEL, &toktype); /* get label name */
 sym = searchtable(comp, *tok); /* search in symbol table */
 if (sym == NULL) /* if label not found, flag it*/
 comp->flag[comp->inscount] = *tok;
 else /* if label is found, set it */
 labeladdr = sym->location;
 /* generate machine instruction */
 comp->sml[comp->inscount++] = BRANCH * MEMSIZE + labeladdr;
}
/* generate machine instructions for if statement */
static void
command_if(struct compiler *comp)
{
 char *tok;
 char relop[3]; /* relational operator */
 enum tokentype toktype;
 struct symbol *sym;
 size_t op1, op2, labeladdr;
 /* get location of first variable */
 tok = gettoken(comp, VARIABLE, &toktype); /* get variable */
 sym = searchtable(comp, *tok); /* search in symbol table */
 if (sym == NULL) /* if not found, insert it*/
 sym = inserttable(comp, *tok, variable, comp->datacount--);
 op1 = sym->location;
 /* get relational operator */
 tok = gettoken(comp, RELATIONAL, &toktype);
 strncpy(relop, tok, 2);
 relop[3] = '0円';
 /* get location of second variable */
 tok = gettoken(comp, VARIABLE, &toktype); /* get variable */
 sym = searchtable(comp, *tok); /* search in symbol table */
 if (sym == NULL) /* if not found, insert it*/
 sym = inserttable(comp, *tok, variable, comp->datacount--);
 op2 = sym->location;
 /* get obligatory 'goto' keyword */
 tok = gettoken(comp, GOTOKEYWRD, &toktype);
 /* get address of label to go to */
 labeladdr = 0;
 tok = gettoken(comp, LABEL, &toktype); /* get label */
 sym = searchtable(comp, *tok); /* search in symbol table */
 if (sym != NULL) /* if label is found, set it */
 labeladdr = sym->location;
 /* generate instructions based on branch type */
 if (strcmp(relop, "==") == 0) {
 comp->sml[comp->inscount++] = LOAD * MEMSIZE + op1;
 comp->sml[comp->inscount++] = SUBTRACT * MEMSIZE + op2;
 comp->sml[comp->inscount++] = BRANCHZERO * MEMSIZE + labeladdr;
 } else if (strcmp(relop, "!=") == 0) {
 comp->sml[comp->inscount++] = LOAD * MEMSIZE + op1;
 comp->sml[comp->inscount++] = SUBTRACT * MEMSIZE + op2;
 comp->sml[comp->inscount++] = BRANCHZERO * MEMSIZE + 2;
 comp->sml[comp->inscount++] = BRANCH * MEMSIZE + labeladdr;
 } else if (strcmp(relop, "<") == 0) {
 comp->sml[comp->inscount++] = LOAD * MEMSIZE + op1;
 comp->sml[comp->inscount++] = SUBTRACT * MEMSIZE + op2;
 comp->sml[comp->inscount++] = BRANCHNEG * MEMSIZE + labeladdr;
 } else if (strcmp(relop, ">") == 0) {
 comp->sml[comp->inscount++] = LOAD * MEMSIZE + op2;
 comp->sml[comp->inscount++] = SUBTRACT * MEMSIZE + op1;
 comp->sml[comp->inscount++] = BRANCHNEG * MEMSIZE + labeladdr;
 } else if (strcmp(relop, "<=") == 0) {
 comp->sml[comp->inscount++] = LOAD * MEMSIZE + op1;
 comp->sml[comp->inscount++] = SUBTRACT * MEMSIZE + op2;
 comp->sml[comp->inscount++] = BRANCHNEG * MEMSIZE + labeladdr;
 comp->sml[comp->inscount++] = BRANCHZERO * MEMSIZE + labeladdr;
 } else if (strcmp(relop, ">=") == 0) {
 comp->sml[comp->inscount++] = LOAD * MEMSIZE + op2;
 comp->sml[comp->inscount++] = SUBTRACT * MEMSIZE + op1;
 comp->sml[comp->inscount++] = BRANCHNEG * MEMSIZE + labeladdr;
 comp->sml[comp->inscount++] = BRANCHZERO * MEMSIZE + labeladdr;
 } else {
 exit(EXIT_FAILURE);
 }
 /*
 * this checking must go after generating the instructions,
 * for it needs the variable inscount to be modified.
 */
 if (sym == NULL) /* if label not found, flag it */
 comp->flag[comp->inscount-1] = *tok;
}
/* generate machine instructions for end statement */
static void
command_end(struct compiler *comp)
{
 comp->sml[comp->inscount++] = HALT * MEMSIZE + 0;
}
/* return type of command s */
static instruction
iscommand(char *s)
{
 if (strcasecmp(s, "INPUT") == 0)
 return command_input;
 if (strcasecmp(s, "PRINT") == 0)
 return command_print;
 if (strcasecmp(s, "LET") == 0)
 return command_let;
 if (strcasecmp(s, "GOTO") == 0)
 return command_goto;
 if (strcasecmp(s, "IF") == 0)
 return command_if;
 if (strcasecmp(s, "END") == 0)
 return command_end;
 return NULL;
}
/* search for symbol in symbol table */
static struct symbol *
searchtable(struct compiler *comp, char c)
{
 if (comp->symtable[c - 'A'].type != none)
 return comp->symtable + c - 'A';
 return NULL;
}
/* insert symbol in symbol table */
static struct symbol *
inserttable(struct compiler *comp, char c, int type, size_t loc)
{
 if (comp->symtable[c - 'A'].type != none)
 return NULL;
 comp->symtable[c - 'A'].type = type;
 comp->symtable[c - 'A'].location = loc;
 return comp->symtable + c - 'A';
}
/* Convert infix expression in s into postfix expression in *expr list */
struct expression *
getexpr(struct compiler *comp, char *s)
{
 struct expression *head, *tail;
 char *stack, *endp;
 long n;
 int i;
 head = tail = NULL;
 stack = endp = NULL;
 i = 0;
 if ((stack = malloc(strlen(s) + 1)) == NULL)
 err(1, NULL);
 while (*s != '0円') {
 while (isspace(*s))
 s++;
 if (isalpha(*s)) {
 enqueue(&head, &tail, (struct expression) {.type = symb, .u.c = *s});
 } else if (isdigit(*s) || ((*s == '+' || *s == '-') && isdigit(s[1]))) {
 n = strtol(s, &endp, 10);
 if (n > INT_MAX || n < INT_MIN || endp == s) {
 fprintf(stderr, "%s:%lu: error: integer to big\n",
 comp->file, comp->ln);
 exit(EXIT_FAILURE);
 }
 enqueue(&head, &tail, (struct expression) {.type = num, .u.n = n});
 } else if (*s == '(') {
 stack[i++] = '(';
 } else if (*s == ')') {
 while (i > 0 && isoperator(stack[i-1])) {
 i--;
 enqueue(&head, &tail, (struct expression) {.type = op, .u.c = stack[i]});
 }
 if (i > 0 && stack[i-1] == '(')
 i--;
 } else if (isoperator(*s)) {
 while (i > 0 && isoperator(stack[i-1])
 && isoperator(stack[i-1]) >= isoperator(*s)) {
 i--;
 enqueue(&head, &tail, (struct expression) {.type = op, .u.c = stack[i]});
 }
 stack[i++] = *s;
 } else {
 fprintf(stderr, "%s:%lu: error: unexpected character '%c'\n",
 comp->file, comp->ln, *s);
 exit(EXIT_FAILURE);
 }
 s++;
 }
 while (i > 0 && isoperator(stack[i-1])) {
 i--;
 enqueue(&head, &tail, (struct expression) {.type = op, .u.c = stack[i]});
 }
 free(stack);
 return head;
}
/* return precedence of operator c, or 0 if it is not an operator */
static int
isoperator(char c)
{
 switch (c) {
 case '+': case '-':
 return 1;
 case '/': case '*': case '%':
 return 2;
 default:
 break;
 }
 return 0;
}
/* enqueue an operand or operator (op) into the expression list */
static void
enqueue(struct expression **head, struct expression **tail, struct expression op)
{
 struct expression *p;
 if ((p = malloc(sizeof *p)) == NULL)
 err(1, NULL);
 p->type = op.type;
 if (p->type == num)
 p->u.n = op.u.n;
 else
 p->u.c = op.u.c;
 p->next = NULL;
 if (*head == NULL)
 *head = p;
 else
 (*tail)->next = p;
 *tail = p;
}
static void
usage(void)
{
 (void) fprintf(stderr, "usage: simple [-O] [-o file.sml] file.simp\n");
 exit(EXIT_FAILURE);
}

Here is a sample program in the SimpleBASIC language, sum.basic, it receives an integer (in the format of a Simpletron instruction) and outputs the sum of each number from 1 to the value entered.

; Sum 1 to x
INPUT x
; check y == x
a: IF y == x GOTO e
; increment y
LET y = y + 1
; ADD y to total
LET t = t + y
; loop y
GOTO a
; output result
e:
PRINT t
END

The input of a Simpletron program must be a signed four-digit integer, like +0007 or -0001.

asked Mar 29, 2020 at 15:55
\$\endgroup\$
5
  • \$\begingroup\$ [COMMENT] = "comment" - what the heck is happening here? Is this some way of specifying the index of assignment for an array initializer? \$\endgroup\$ Commented Mar 30, 2020 at 1:13
  • \$\begingroup\$ COMMENT is 0, it is defined in enum tokentype. The array tokenstring is initialized with the values in enum tokentype. The array is used by the function gettoken to generate error strings when the got token is different from the expected token. For example, when the compiler expects a variable name but gets a comment, it prints "%s expected (got %s)", tokenstring[VARIABLE], tokenstring[COMMENT] \$\endgroup\$ Commented Mar 30, 2020 at 1:19
  • \$\begingroup\$ The array tokenstring is initialized with the values in enum tokentype - right; specifically I'm unfamiliar with the brackets-on-the-LHS syntax. \$\endgroup\$ Commented Mar 30, 2020 at 1:21
  • 1
    \$\begingroup\$ An array can be initialized with the elements 2 and 4 specified as int my_array[5] = { [2] = 5, [4] = 9 };. It is equivalent to int my_array[5] = { 0, 0, 5, 0, 9 };. That's a C99 thing, I think. \$\endgroup\$ Commented Mar 30, 2020 at 1:24
  • \$\begingroup\$ Neat :) I don't pretend to know the whole standard; that's useful. \$\endgroup\$ Commented Mar 30, 2020 at 1:34

1 Answer 1

5
\$\begingroup\$

C99, assignment-in-conditions

At the risk of sounding like a broken record - I'll make the same recommendations as in Simpletron simulator in C . Consider moving your variable declarations closer to where they're used, and expanding out your assignment-in-condition statements.

There's another benefit to C99: this -

comp->symtable = symtable;
comp->sml = sml;
comp->flag = flag;
comp->memsize = MEMSIZE;
comp->file = filename;
comp->ln = 1;
comp->inscount = 0;
comp->datacount = comp->memsize - 1;

could be converted to a const static structure instance using C99 standard section 6.7.8 Initialization ("designated initializers"), like:

const static struct compiler default = {
 .ln = 1,
 // ...
};

and then all of the defaults assigned in a single statement.

Nasty side-effects

This:

static struct symbol symtable['z' - 'A'];
static memtype sml[MEMSIZE];
static char flag[MEMSIZE];
comp->symtable = symtable;
comp->sml = sml;
comp->flag = flag;

is nasty. From the outside it looks like compiler is a re-entrant structure, but in actuality the memory is shared. A naive caller would pass in two different compiler instances, and then be surprised when data leaks from one to the other.

Surprising function names

I would assume that this:

iscommand(tok)

returns a boolean, but it actually returns a pointer?

(*iscommand(tok))(comp);

It should be named something like getinstruction.

Helper pointers

This:

 opcode0 = comp->sml[i] / comp->memsize;
 operand0 = comp->sml[i] % comp->memsize;
 opcode1 = comp->sml[i+1] / comp->memsize;
 operand1 = comp->sml[i+1] % comp->memsize;
 opcode2 = comp->sml[i+2] / comp->memsize;

can be somewhat abbreviated by the creation of a temporary pointer equal to comp->sml + i.

Switch

This is the perfect use-case for a switch:

 if (expr->u.c == '+')
 comp->sml[comp->inscount++] = ADD * MEMSIZE + op2;
 if (expr->u.c == '-')
 comp->sml[comp->inscount++] = SUBTRACT * MEMSIZE + op2;
 if (expr->u.c == '*')
 comp->sml[comp->inscount++] = MULTIPLY * MEMSIZE + op2;
 if (expr->u.c == '/')
 comp->sml[comp->inscount++] = DIVIDE * MEMSIZE + op2;
 if (expr->u.c == '%')
 comp->sml[comp->inscount++] = REMINDER * MEMSIZE + op2;
answered Mar 30, 2020 at 1:32
\$\endgroup\$
5
  • \$\begingroup\$ iscommand actually returned int in a previous version. It returned a number representing the command, or zero, if the string is not a command. But then I needed a function that returned a pointer to the function related to a string, then reused iscommand for that. \$\endgroup\$ Commented Mar 30, 2020 at 1:37
  • \$\begingroup\$ From the outside it looks like compiler is a re-entrant structure, but in actuality the memory is shared. Should I define the arrays inside the structure, then? It would be a humongous array, so I should switch the pass-by-value from function assemble to a pass-by-reference. \$\endgroup\$ Commented Mar 30, 2020 at 1:49
  • 1
    \$\begingroup\$ I think your structure can basically remain as-is; just use malloc instead of assigning a reference to a static variable. \$\endgroup\$ Commented Mar 30, 2020 at 1:49
  • \$\begingroup\$ Consider moving your variable declarations closer to where they're used. I have adopted a coding style that explicitly specifies to not mix declaration and code and to put declarations at the beginning of the function. On assignment-in-conditions, I watched out to not use them, but I think there is a few I wrote on automatic. \$\endgroup\$ Commented Mar 30, 2020 at 2:25
  • \$\begingroup\$ I agree with Reinderien - that coding style is 40 years obsolete. The original reason (in the 1970s) to put declarations first was to accommodate memory limitations of hardware at that time (e.g. my first computer in 1978 only had 8K of RAM). In more modern renditions of C, the computer accommodates the human instead of the human having to accommodate the machine. Unless you're going to code exclusively for a PDP-8, sticking with that limitation is going to impede your progress to more complex projects. \$\endgroup\$ Commented Mar 30, 2020 at 2:46

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.