I was/am working on an interpreter for a scheme-like language. Just some time back I shifted my implementation from C++ to Rust, which I just started learning. I know there are parser libraries like LALRPROP and nom in Rust that could parse easily for me, but I wanted to try and write it in Rust from scratch. Any comments and help would be much appreciated so that I can write more idiomatic Rust.
object.rs
: has the definition of the AST and functions for display and conversion of the AST to strings.
use crate::util;
use std::fmt;
#[derive(Debug)]
pub enum Object {
Integer(i64),
Boolean(bool),
Character(char),
String(String),
Symbol(String),
Cons { car: Box<Object>, cdr: Box<Object> },
Nil,
}
pub type ObjectBox = Box<Object>;
pub mod constants {
use super::Object;
pub const TRUE: Object = Object::Boolean(true);
pub const FALSE: Object = Object::Boolean(false);
pub const NIL: Object = Object::Nil;
}
impl Object {
fn pair_to_string(&self) -> String {
let mut result = String::from("");
if let Object::Cons { car, cdr } = &*self {
result.push_str(&car.to_string());
match **cdr {
Object::Cons { car: _, cdr: _ } => {
result.push(' ');
result.push_str(&cdr.pair_to_string());
}
Object::Nil => result.push_str(""),
_ => {
result.push_str(" . ");
result.push_str(&cdr.to_string());
}
};
return result;
} else {
panic!("I shouldnt be here!");
}
}
pub fn to_string(&self) -> String {
match &*self {
Object::Integer(i) => i.to_string(),
Object::Boolean(i) => {
if *i {
"#t".to_string()
} else {
"#f".to_string()
}
}
Object::Character(i) => match i {
'\n' => "$\\n".to_string(),
'\t' => "$\\t".to_string(),
'\r' => "$\\r".to_string(),
'0円' => "$\0円".to_string(),
'\\' => "$\\\\".to_string(),
' ' => "$space".to_string(),
_ => format!("${}", i),
},
Object::String(i) => format!("\"{}\"", util::raw_string(i)),
Object::Symbol(i) => i.to_string(),
cons @ Object::Cons { car: _, cdr: _ } => format!("({})", cons.pair_to_string()),
Object::Nil => "()".to_string(),
}
}
}
impl fmt::Display for Object {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(f, "{}", self.to_string())
}
}
parser.rs
: The parser, it tokenizes and parses at the same time. Meaningfull error messages still need to be added, I am planning on making a custom error type, so i can also pass the error position.
use super::object::*;
pub struct Parser<'a> {
source: &'a str,
start: usize,
current: usize,
}
impl<'a> Parser<'a> {
pub fn new(source: &'a str) -> Self {
Parser {
source: source,
start: 0,
current: 0,
}
}
pub fn set_source(&mut self, new_source: &'a str) {
self.source = new_source;
}
fn is_at_end(&self) -> bool {
self.current - 1 == self.source.chars().count()
}
fn advance(&mut self) -> Option<char> {
self.current += 1;
self.source.chars().nth(self.current - 1)
}
fn peek(&self) -> Option<char> {
self.source.chars().nth(self.current)
}
fn is_delimiter(&self, ch: char) -> bool {
ch.is_ascii_whitespace() || matches!(ch, '(' | ')' | ';' | '"' | '\'') || self.is_at_end()
}
fn peek_delimiter(&self) -> bool {
if let Some(ch) = self.peek() {
self.is_delimiter(ch)
} else {
true
}
}
fn is_symbol_initial(&self, ch: char) -> bool {
!matches!(ch, '(' | ')' | ';' | '$' | '#' | '"' | '-' | '\'') && !ch.is_ascii_digit()
}
fn is_symbol_character(&self, ch: char) -> bool {
!matches!(ch, '(' | ')' | ';' | '\'') || ch.is_ascii_whitespace()
}
fn skip_whitespace(&mut self) {
while let Some(ch) = self.peek() {
if ch.is_whitespace() {
self.advance();
continue;
} else if ch == ';' {
while let Some(ch) = self.advance() {
if ch == '\n' {
break;
}
}
continue;
}
break;
}
}
fn peek_number(&self, ch: char) -> bool {
ch.is_ascii_digit() || (ch == '-' && self.peek().unwrap().is_ascii_digit())
}
fn parse_number(&mut self) -> Result<ObjectBox, String> {
while let Some(ch) = self.peek() {
if ch.is_ascii_digit() {
self.advance();
} else {
break;
}
}
Ok(ObjectBox::new(Object::Integer(
self.source
.chars()
.skip(self.start)
.take(self.current - self.start)
.collect::<String>()
.parse::<i64>()
.unwrap(),
)))
}
fn parse_boolean(&mut self) -> Result<ObjectBox, String> {
if let Some(ch) = self.advance() {
match ch {
't' => Ok(ObjectBox::new(constants::TRUE)),
'f' => Ok(ObjectBox::new(constants::FALSE)),
_ => Err(String::from("")),
}
} else {
Err(String::from(""))
}
}
fn parse_character(&mut self) -> Result<ObjectBox, String> {
// TODO: Add $\x80 and $\d97 style characters
if let Some(ch) = self.advance() {
if ch == '\\' {
if let Some(ch) = self.advance() {
if !self.peek_delimiter() {
return Err(String::from(""));
}
match ch {
'n' => return Ok(ObjectBox::new(Object::Character('\n'))),
'r' => return Ok(ObjectBox::new(Object::Character('\r'))),
't' => return Ok(ObjectBox::new(Object::Character('\t'))),
'\\' => return Ok(ObjectBox::new(Object::Character('\\'))),
'0' => return Ok(ObjectBox::new(Object::Character('0円'))),
' ' => return Ok(ObjectBox::new(Object::Character(' '))),
_ => return Err(String::from("")),
}
} else {
return Err(String::from(""));
}
} else {
if !self.peek_delimiter() {
return Err(String::from(""));
}
return Ok(ObjectBox::new(Object::Character(ch)));
}
} else {
return Err(String::from(""));
}
}
fn parse_string(&mut self) -> Result<ObjectBox, String> {
let mut result = String::new();
while let Some(ch) = self.advance() {
if self.is_at_end() {
return Err(String::from("at end"));
}
if ch == '"' {
return Ok(ObjectBox::new(Object::String(result)));
}
if ch == '\\' {
if let Some(ch) = self.advance() {
match ch {
'n' => result.push('\n'),
'r' => result.push('\r'),
't' => result.push('\t'),
'\\' => result.push('\\'),
'0' => result.push('0円'),
_ => result.push(ch),
}
continue;
} else {
return Err(String::from("empty escape sequence"));
}
}
result.push(ch);
}
Err(String::from("no ending quote"))
}
fn parse_pair(&mut self) -> Result<ObjectBox, String> {
let car: ObjectBox;
let cdr: ObjectBox;
self.skip_whitespace();
if let Some(ch) = self.peek() {
if ch == ')' {
self.advance();
return Ok(ObjectBox::new(constants::NIL));
}
match self.parse() {
Ok(object_box) => car = object_box,
a @ Err(_) => return a,
}
self.skip_whitespace();
if let Some(ch) = self.peek() {
if ch == '.' {
self.advance();
if !self.peek_delimiter() {
return Err(String::from(""));
}
match self.parse() {
Ok(object_box) => cdr = object_box,
a @ Err(_) => return a,
}
self.skip_whitespace();
if let Some(ch) = self.advance() {
if ch == ')' {
return Ok(ObjectBox::new(Object::Cons { car, cdr }));
} else {
return Err(String::from(""));
}
} else {
return Err(String::from(""));
}
} else {
match self.parse_pair() {
Ok(object_box) => cdr = object_box,
a @ Err(_) => return a,
}
return Ok(ObjectBox::new(Object::Cons { car, cdr }));
}
}
}
Err(String::from("here"))
}
fn parse_symbol(&mut self) -> Result<ObjectBox, String> {
while let Some(ch) = self.peek() {
if self.is_symbol_character(ch) {
self.advance();
} else {
break;
}
}
return Ok(ObjectBox::new(Object::Symbol(
self.source
.chars()
.skip(self.start)
.take(self.current - self.start)
.collect::<String>(),
)));
}
pub fn parse(&mut self) -> Result<ObjectBox, String> {
self.skip_whitespace();
self.start = self.current;
let ch = self.advance().unwrap();
match ch {
ch if self.peek_number(ch) => self.parse_number(),
'#' => self.parse_boolean(),
'$' => self.parse_character(),
'"' => self.parse_string(),
'(' => self.parse_pair(),
ch if self.is_symbol_initial(ch) => self.parse_symbol(),
_ => Err(String::from("Unknown Character")),
}
}
}
util.rs
: a utility module, for now just has a function that converts strings like "Hello \n World!"
to "Hello \\n World"
pub fn raw_string(string: &str) -> String {
let mut result = String::from("");
for ch in string.chars() {
match ch {
'\n' => result.push_str("\\n"),
'\t' => result.push_str("\\t"),
'\r' => result.push_str("\\r"),
'\\' => result.push_str("\\\\"),
'0円' => result.push_str("\0円"),
_ => result.push(ch),
}
}
result
}
main.rs
: The driver code, for now I am hard coding the input, the plan is to make a interactive REPL.
#![allow(dead_code)]
mod core;
mod util;
use self::core::parser::*;
fn main() {
let mut scanner = Parser::new("\"Hello \\nWorld\"");
match scanner.parse() {
Ok(object) => println!("{}", object),
Err(message) => println!("{}", message),
}
}
Any tips to make the code more Rust-like will be very helpful!
1 Answer 1
Clippy
First, always, always run cargo clippy
. Clippy gives you lots of helpful tips to make your code more idiomatic and simpler, and sometimes catches more serious missteps. From Clippy:
Lots of unnecessary
return
statements. To return a value on the last line of a function (or the last line of each if branch), in Rust it is preferred to omit the return statement. In your case, the biggest offender isfn parse_character
where Clippy finds that most of yourreturn
s can be omitted (however, I would note that this function honestly looks pretty gnarly regardless and should be broken up into simpler more understandable functions).In constructing a
struct
, you can use the field name directly if it matches the local variable name:Parser { source, start: 0, current: 0 }
Finally, Clippy finds a potentially more serious problem, and links you to this page: the problem is that
.to_string()
is automatically implemented for any type that implements Display (via the ToString trait), so there are actually two definitions of.to_string()
for your struct, one of them being overwritten. Easiest fix is to rename your publicpub fn to_string(&self) -> String
to a private function likefn to_string_core(&self) -> String
and use it only internally. More idiomatic fix would be to put this logic directly in theDisplay
implementation and omitto_string_core
.
That's all for Clippy fixes. Other than what Clippy finds, your code is generally pretty good, though a bit complicated in places (but this is maybe to be expected with parsing code).
Additional comments
In object.rs
:
The function fn pair_to_string
is undesirable: it has an unstated precondition, namely that the object is a Cons
case, and it panics if this is not satisfied. Additionally, everywhere this is used, you actually already know that the input is a cons, so the correct design is to take as input the branches of the cons directly. Here is the improved function:
fn pair_to_string(&self, other: &Self) -> String {
let mut result = self.to_string();
match other {
Object::Cons { car, cdr } => {
result.push(' ');
result.push_str(&car.pair_to_string(&cdr));
}
Object::Nil => result.push_str(""),
_ => {
result.push_str(" . ");
result.push_str(&other.to_string());
}
};
result
}
now the code is more self-explanatory (it's clear that it takes a pair of Object
s and prints them), simpler, and doesn't have a hidden assumption.
In general, the Cons
object constructor and its usage should be documented and explained: a reader unfamiliar with Scheme is highly unlikely to know what this is used for (a list vs a tree, etc.)
In main.rs
:
Blanket #![allow(dead_code)]
statements are counterproductive. If you intend core
as a public API, the easiest fix is to do pub mod core
. If you just temporarily want to ignore the dead code warning on set_source
, use a method-level annotation:
#[allow(dead_code)]
pub fn set_source(&mut self, new_source: &'a str) {
In parser.rs
:
The use of self.source.chars().nth
to keep getting the nth character is problematic. Note that .chars()
will be a new iterator over the source
string each time and nth
will get the nth character; so you probably have a very inefficient algorithm (at least n^2) for parsing large strings. There are two possible fixes here.
First, the easy way: Strings are a pain because character boundaries are of different lengths, making indexing into them difficult. So to completely avoid string issues, get rid of &str
in the beginning and instead convert it to an array of characters: &[char]
. Converting a string s
to a vector of characters is easy: let char_vec: Vec<char> = s.chars().collect()
. Then use &char_vec
to get a &[char]
.
The harder way: to work with the string directly, what you "really" want is a pointer into the string, not an index. So start
and current
should really be iterators with lifetime &'a
which are pointing into the string source
. Like this:
use std::str::Chars;
pub struct ParserBetter<'a> {
source: &'a str,
start: Chars<'a>,
current: Chars<'a>,
}
This will involve a more substantial code rewrite, but it might be a good learning exercise for how to properly work with strings.
The takeaway message is don't index into strings with an integer. If you need to index, use a vector of characters. Indexing into strings is either highly inefficient (if done with .chars().nth(i)
) or can fall in the middle of character boundaries (if done with [i]
).
-
\$\begingroup\$ Thanks for the comments, I didn't know
clippy
existed! Just a few more questions, Is using an iterator over strings the preferred way to index into strings, or can I get away with a vector ofchars
? How would you break up theparse_characters
function? The reason I didn't move the functionality ofto_string
directly into theDisplay
trait is becauseto_string
is mutually recursive withpair_to_string
, how do I move it to theDisplay
trait? \$\endgroup\$Bhargav Kulkarni– Bhargav Kulkarni2022年01月02日 04:07:35 +00:00Commented Jan 2, 2022 at 4:07 -
\$\begingroup\$ @BhargavKulkarni I honestly recommend a vector of chars as it is easier to work with. I'm not completely sure about
parse_characters
. For the mutually recursive issue, that's fine -- one can be part of Display and one outside Display (still mutually recursive), something likefmt_pair
for the format version ofpair_to_string
\$\endgroup\$Caleb Stanford– Caleb Stanford2022年01月02日 05:10:27 +00:00Commented Jan 2, 2022 at 5:10 -
\$\begingroup\$ Multiple design choices will work. Your way works as long as you rename
to_string
toto_string_core
\$\endgroup\$Caleb Stanford– Caleb Stanford2022年01月02日 05:11:06 +00:00Commented Jan 2, 2022 at 5:11
util.rs
? \$\endgroup\$core
directory defining a module (mod.rs
) and autil.rs
at the top level. \$\endgroup\$pub fn raw_string(i: &str) -> String
and it compiles now. :) Reviewing your code now! \$\endgroup\$