Description
- Very small subset of Forth
- This is a proof of concept level compiler, no optimizations or over/underflow checking
- See the embedded POD for more information
- NASM is used as assembler
- gcc is used to link with glibc
- 32bit ELF Binary is generated
bhathiforth.pl
#!/usr/bin/perl
use strict;
use warnings;
use feature qw(say);
sub tokenize {
my $fullcode = shift;
if ( not defined $fullcode ) {
die "Invalid Arguments";
}
my @tokens;
while ( $fullcode =~ /([0-9]+|\+|\-|\*|\/|\.)/g ) {
push @tokens, 1ドル;
}
return @tokens;
}
sub generate_assembly {
my @tokens = @{ $_[0] };
if ( not @tokens ) {
die "Invalid Arguments";
}
my $assembly = "section .text\nglobal main\nextern printf\nmain:\n";
say "Tokens";
say "==================";
foreach (@tokens) {
say "<$_>";
if ( $_ =~ /[0-9]+/ ) {
$assembly .= "push $_\n";
}
elsif ( $_ eq "+" ) {
$assembly .= "pop ebx\npop eax\nadd eax,ebx\npush eax\n";
}
elsif ( $_ eq "-" ) {
$assembly .= "pop ebx\npop eax\nsub eax,ebx\npush eax\n";
}
elsif ( $_ eq "/" ) {
$assembly .= "mov edx,0\npop ecx\npop eax\ndiv ecx\npush eax\n";
}
elsif ( $_ eq "*" ) {
$assembly .= "mov edx,0\npop ecx\npop eax\nmul ecx\npush eax\n";
}
elsif ( $_ eq "." ) {
$assembly .= "push message\ncall printf\nadd esp, 8\n";
}
}
$assembly .= "ret\nmessage db \"%d\", 10, 0;";
say "==================";
return $assembly;
}
my $version = "0.1";
say "Welcome to BhathiFoth compiler v$version";
say "========================================";
my $source = shift @ARGV;
my $output = shift @ARGV;
if ( not defined $source or not defined $output ) {
say
"Invalid Commandline arguments.\n\nUSAGE:\n% ./bhathiforth.pl <source> <output>";
exit;
}
open my $CODE, "<", $source or die "Cannot open file '$source'\n:$!";
my ( $line, $fullcode );
$fullcode = "";
while ( $line = <$CODE> ) {
$fullcode .= $line;
}
close $CODE or die "Cannot close file '$source'\n$!";
my @tokens = tokenize($fullcode);
my $assembly = generate_assembly( \@tokens );
say "Assembly Code";
say "==================";
say $assembly;
say "==================";
open my $ASM, '>', "$output.asm"
or die "Cannot open file to write '$output.asm'\n:$!";
print $ASM $assembly;
close $ASM or die "Cannot close file '$output.asm'\n$!";
say "Building Executable...";
system("nasm -f elf $output.asm && gcc -m32 -o $output $output.o");
exit;
# ---------------------------------
# BhathiForth compiler Documentation
# ---------------------------------
=head1 NAME
BhathiForth
=head1 SYNOPSIS
BhathiForth
a reverse polish notation based compiler
(very small subset of forth)
(proof of concept level compiler, no optimizations or over/underflow checking)
=head2 instructions:
451 [0-9]+ push a number (integers only) to stack
+ add
- minus
* multiply
/ devide
. print (adds newline automatically, this will pop the value)
all other characters are ignored
=head2 information
uses nasm as assembler and gcc to link (depend on c library-glibc)
works only in linux (ELF binary format is used)
=head2 usage
chmod a+x ./bhathiforth.pl
./bhathiforth.pl <source> <output>
=cut
# --------------------------------
test.b4
10 push ten 10 push ten * pop twice, multiply and push . print top value : this prints hundred 200 56 push twenty eight + . prints two five six 256 2 / . prints one two eight 500 50 - . prints four five zero
Output
When executed as ./bhathiforth.pl test.b4 test
:
Welcome to BhathiFoth compiler v0.1 ======================================== Tokens ================== <10> <10> <*> <.> <200> <56> <+> <.> <256> <2> </> <.> <500> <50> <-> <.> ================== Assembly Code ================== section .text global main extern printf main: push 10 push 10 mov edx,0 pop ecx pop eax mul ecx push eax push message call printf add esp, 8 push 200 push 56 pop ebx pop eax add eax,ebx push eax push message call printf add esp, 8 push 256 push 2 mov edx,0 pop ecx pop eax div ecx push eax push message call printf add esp, 8 push 500 push 50 pop ebx pop eax sub eax,ebx push eax push message call printf add esp, 8 ret message db "%d", 10, 0; ================== Building Executable...
Output ELF
Executed as ./test
:
100 256 128 450
3 Answers 3
tokenize
The tokenize
subroutine could be simplified:
sub tokenize {
my ($code) = @_;
die "Invalid Arguments" unless defined $code;
return $code =~ m!\d+|[-+*/.]!g;
}
Changes include:
- Shorter parameter name
- One-line validation
- Use global match in list context to produce a list of all matches
- Simpler regex that avoids leaning toothpick syndrome
Note that any unrecognized token is treated as a comment, which is quite lenient.
generate_assembly
For readability, I would just pass the tokens as a list rather than as a reference to a list.
I don't recommend printing output as as side-effect: it hinders code reuse.
The assembly code for the operators could be produced by a hash lookup.
main
A convention for declaring version numbers is
our $VERSION = 0.1;
An double_underline()
subroutine could be useful.
sub double_underline {
my ($text) = @_;
return $text . "\n" . ('=' x len($text));
}
say double_underline("Welcome to BhathiForth compiler v$VERSION"); # Fixed typo "Foth"
To read a file fully, you don't need a loop. Use "slurp mode":
local $/ = undef;
my $code = <$CODE>;
Just one small point that @200_success probably left out so I can say something too:
when matching something against $_
, for example in if ($_ =~ /[0-9]+/) { ... }
,
you can simply omit the $_
:
if (/[0-9]+/) {
# ...
}
It's "the default input and pattern-searching space". Read more about the $_
variable in man perlvar
or on perldoc.perl.org.
The main loop in generate_assembly
rewritten to use this (+ a hashmap for the operators):
my %ops = (
'+' => "pop ebx\npop eax\nadd eax,ebx\npush eax\n",
'-' => "pop ebx\npop eax\nsub eax,ebx\npush eax\n",
'/' => "mov edx,0\npop ecx\npop eax\ndiv ecx\npush eax\n",
'*' => "mov edx,0\npop ecx\npop eax\nmul ecx\npush eax\n",
'.' => "push message\ncall printf\nadd esp, 8\n",
);
foreach (@tokens) {
say "<$_>";
if (/[0-9]+/) {
$assembly .= "push $_\n";
} else {
$assembly .= $ops{$_};
}
}
This made me notice one more point: if I add garbage in the input file like this:
256 2 garbage /
The tokenizer will wipe out "garbage" without a warning. This could hide some bugs. I think it would be better to use a more sophisticated parser that can detect syntax errors.
In other words: the language is underspecified. (Thanks @200_success!)
-
1\$\begingroup\$ One programmer's comment is another programmers garbage. \$\endgroup\$200_success– 200_success2014年10月22日 07:01:24 +00:00Commented Oct 22, 2014 at 7:01
-
\$\begingroup\$ @200_success I updated a bit to clarify the real garbage \$\endgroup\$janos– janos2014年10月22日 07:08:12 +00:00Commented Oct 22, 2014 at 7:08
-
2\$\begingroup\$ Can you imagine how much code-golfers would like to use this,Ex: non-tokens can be code in another language -> write code that is valid on two compilers. \$\endgroup\$JaDogg– JaDogg2014年10月22日 08:27:22 +00:00Commented Oct 22, 2014 at 8:27
I'm not a huge fan of reading all input in one one go and transform it in one and then write it in one go. It seems clumsy and will not scale properly (although in the given case it's probably not going to be a problem).
The basic structure I'd go for is this:
open my $ASM, '>', "$output.asm"
or die "Cannot open file to write '$output.asm'\n:$!";
while ( $line = <$CODE> ) {
@tokens = tokenize($line);
foreach (@tokens) {
$assembly = generate_assembly($_);
print $ASM $assembly;
}
}
close $ASM or die "Cannot close file '$output.asm'\n$!";
generate_assembly
should just return the assembler output for the given token.
-
1\$\begingroup\$ NOTE that generating code as you go will break at the point that the language has to deal with forward references / jump targets. Although right now that's not a problem, obviously. \$\endgroup\$user56992– user569922014年10月28日 08:07:42 +00:00Commented Oct 28, 2014 at 8:07
Explore related questions
See similar questions with these tags.