Reverse Polish notation based compiler

Question 1

Description

Very small subset of Forth
This is a proof of concept level compiler, no optimizations or over/underflow checking
See the embedded POD for more information
NASM is used as assembler
gcc is used to link with glibc
32bit ELF Binary is generated

bhathiforth.pl

#!/usr/bin/perl
use strict;
use warnings;
use feature qw(say);
sub tokenize {
 my $fullcode = shift;
 if ( not defined $fullcode ) {
 die "Invalid Arguments";
 }
 my @tokens;
 while ( $fullcode =~ /([0-9]+|\+|\-|\*|\/|\.)/g ) {
 push @tokens, 1ドル;
 }
 return @tokens;
}
sub generate_assembly {
 my @tokens = @{ $_[0] };
 if ( not @tokens ) {
 die "Invalid Arguments";
 }
 my $assembly = "section .text\nglobal main\nextern printf\nmain:\n";
 say "Tokens";
 say "==================";
 foreach (@tokens) {
 say "<$_>";
 if ( $_ =~ /[0-9]+/ ) {
 $assembly .= "push $_\n";
 }
 elsif ( $_ eq "+" ) {
 $assembly .= "pop ebx\npop eax\nadd eax,ebx\npush eax\n";
 }
 elsif ( $_ eq "-" ) {
 $assembly .= "pop ebx\npop eax\nsub eax,ebx\npush eax\n";
 }
 elsif ( $_ eq "/" ) {
 $assembly .= "mov edx,0\npop ecx\npop eax\ndiv ecx\npush eax\n";
 }
 elsif ( $_ eq "*" ) {
 $assembly .= "mov edx,0\npop ecx\npop eax\nmul ecx\npush eax\n";
 }
 elsif ( $_ eq "." ) {
 $assembly .= "push message\ncall printf\nadd esp, 8\n";
 }
 }
 $assembly .= "ret\nmessage db \"%d\", 10, 0;";
 say "==================";
 return $assembly;
}
my $version = "0.1";
say "Welcome to BhathiFoth compiler v$version";
say "========================================";
my $source = shift @ARGV;
my $output = shift @ARGV;
if ( not defined $source or not defined $output ) {
 say
"Invalid Commandline arguments.\n\nUSAGE:\n% ./bhathiforth.pl <source> <output>";
 exit;
}
open my $CODE, "<", $source or die "Cannot open file '$source'\n:$!";
my ( $line, $fullcode );
$fullcode = "";
while ( $line = <$CODE> ) {
 $fullcode .= $line;
}
close $CODE or die "Cannot close file '$source'\n$!";
my @tokens = tokenize($fullcode);
my $assembly = generate_assembly( \@tokens );
say "Assembly Code";
say "==================";
say $assembly;
say "==================";
open my $ASM, '>', "$output.asm"
 or die "Cannot open file to write '$output.asm'\n:$!";
print $ASM $assembly;
close $ASM or die "Cannot close file '$output.asm'\n$!";
say "Building Executable...";
system("nasm -f elf $output.asm && gcc -m32 -o $output $output.o");
exit;
# ---------------------------------
# BhathiForth compiler Documentation
# ---------------------------------
=head1 NAME
BhathiForth
=head1 SYNOPSIS
BhathiForth
a reverse polish notation based compiler
(very small subset of forth)
(proof of concept level compiler, no optimizations or over/underflow checking)
=head2 instructions:
 451 [0-9]+ push a number (integers only) to stack
 + add
 - minus
 * multiply
 / devide
 . print (adds newline automatically, this will pop the value)
 all other characters are ignored
=head2 information
uses nasm as assembler and gcc to link (depend on c library-glibc)
works only in linux (ELF binary format is used)
=head2 usage
 chmod a+x ./bhathiforth.pl
 ./bhathiforth.pl <source> <output>
=cut 
# --------------------------------

test.b4

10 push ten
10 push ten
* pop twice, multiply and push
. print top value : this prints hundred
200
56 push twenty eight
+
. prints two five six
256
2
/ 
. prints one two eight
500
50
-
. prints four five zero

Output

When executed as ./bhathiforth.pl test.b4 test:

Welcome to BhathiFoth compiler v0.1
========================================
Tokens
==================
<10>
<10>
<*>
<.>
<200>
<56>
<+>
<.>
<256>
<2>
</>
<.>
<500>
<50>
<->
<.>
==================
Assembly Code
==================
section .text
global main
extern printf
main:
push 10
push 10
mov edx,0
pop ecx
pop eax
mul ecx
push eax
push message
call printf
add esp, 8
push 200
push 56
pop ebx
pop eax
add eax,ebx
push eax
push message
call printf
add esp, 8
push 256
push 2
mov edx,0
pop ecx
pop eax
div ecx
push eax
push message
call printf
add esp, 8
push 500
push 50
pop ebx
pop eax
sub eax,ebx
push eax
push message
call printf
add esp, 8
ret
message db "%d", 10, 0;
==================
Building Executable...

Output ELF

Executed as ./test:

Reference

Question 2

`tokenize`

The tokenize subroutine could be simplified:

sub tokenize {
 my ($code) = @_;
 die "Invalid Arguments" unless defined $code;
 return $code =~ m!\d+|[-+*/.]!g;
}

Changes include:

Shorter parameter name
One-line validation
Use global match in list context to produce a list of all matches
Simpler regex that avoids leaning toothpick syndrome

Note that any unrecognized token is treated as a comment, which is quite lenient.

`generate_assembly`

For readability, I would just pass the tokens as a list rather than as a reference to a list.

I don't recommend printing output as as side-effect: it hinders code reuse.

The assembly code for the operators could be produced by a hash lookup.

main

A convention for declaring version numbers is

our $VERSION = 0.1;

An double_underline() subroutine could be useful.

sub double_underline {
 my ($text) = @_;
 return $text . "\n" . ('=' x len($text));
}
say double_underline("Welcome to BhathiForth compiler v$VERSION"); # Fixed typo "Foth"

To read a file fully, you don't need a loop. Use "slurp mode":

local $/ = undef;
my $code = <$CODE>;

Question 3

Just one small point that @200_success probably left out so I can say something too: when matching something against $_, for example in if ($_ =~ /[0-9]+/) { ... }, you can simply omit the $_:

if (/[0-9]+/) {
 # ...
}

It's "the default input and pattern-searching space". Read more about the $_ variable in man perlvar or on perldoc.perl.org.

The main loop in generate_assembly rewritten to use this (+ a hashmap for the operators):

my %ops = (
 '+' => "pop ebx\npop eax\nadd eax,ebx\npush eax\n",
 '-' => "pop ebx\npop eax\nsub eax,ebx\npush eax\n",
 '/' => "mov edx,0\npop ecx\npop eax\ndiv ecx\npush eax\n",
 '*' => "mov edx,0\npop ecx\npop eax\nmul ecx\npush eax\n",
 '.' => "push message\ncall printf\nadd esp, 8\n",
);
foreach (@tokens) {
 say "<$_>";
 if (/[0-9]+/) {
 $assembly .= "push $_\n";
 } else {
 $assembly .= $ops{$_};
 }
}

This made me notice one more point: if I add garbage in the input file like this:

256
2
garbage
/

The tokenizer will wipe out "garbage" without a warning. This could hide some bugs. I think it would be better to use a more sophisticated parser that can detect syntax errors.

In other words: the language is underspecified. (Thanks @200_success!)

Question 4

One programmer's comment is another programmers garbage.

Question 5

@200_success I updated a bit to clarify the real garbage

Question 6

Can you imagine how much code-golfers would like to use this,Ex: non-tokens can be code in another language -> write code that is valid on two compilers.

Question 7

I'm not a huge fan of reading all input in one one go and transform it in one and then write it in one go. It seems clumsy and will not scale properly (although in the given case it's probably not going to be a problem).

The basic structure I'd go for is this:

open my $ASM, '>', "$output.asm"
 or die "Cannot open file to write '$output.asm'\n:$!";
while ( $line = <$CODE> ) {
 @tokens = tokenize($line);
 foreach (@tokens) {
 $assembly = generate_assembly($_);
 print $ASM $assembly;
 }
}
close $ASM or die "Cannot close file '$output.asm'\n$!";

generate_assembly should just return the assembler output for the given token.

Question 8

NOTE that generating code as you go will break at the point that the language has to deal with forward references / jump targets. Although right now that's not a problem, obviously.

200_success 200_success 145k22 gold badges190 silver badges478 bronze badges · Accepted Answer · 2014-10-22 03:05:43Z

`tokenize`

The tokenize subroutine could be simplified:

sub tokenize {
 my ($code) = @_;
 die "Invalid Arguments" unless defined $code;
 return $code =~ m!\d+|[-+*/.]!g;
}

Changes include:

Shorter parameter name
One-line validation
Use global match in list context to produce a list of all matches
Simpler regex that avoids leaning toothpick syndrome

Note that any unrecognized token is treated as a comment, which is quite lenient.

`generate_assembly`

For readability, I would just pass the tokens as a list rather than as a reference to a list.

I don't recommend printing output as as side-effect: it hinders code reuse.

The assembly code for the operators could be produced by a hash lookup.

main

A convention for declaring version numbers is

our $VERSION = 0.1;

An double_underline() subroutine could be useful.

sub double_underline {
 my ($text) = @_;
 return $text . "\n" . ('=' x len($text));
}
say double_underline("Welcome to BhathiForth compiler v$VERSION"); # Fixed typo "Foth"

To read a file fully, you don't need a loop. Use "slurp mode":

local $/ = undef;
my $code = <$CODE>;

Stack Exchange Network

Reverse Polish notation based compiler

3 Answers 3

`tokenize`

`generate_assembly`

main

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Reverse Polish notation based compiler

3 Answers 3

tokenize

generate_assembly

main

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`tokenize`

`generate_assembly`