7
\$\begingroup\$

I've written an interpreter for a simple assembly-like language and it's performing slower than I would like.

It's split into 3 files: the Parser that converts the source to a vector of ints, the VM that actually runs the bytecode, and Tests that has a bubble sort written in the language.

It sorts 100 numbers in about 6 seconds in GHCi. The profiler doesn't tell me much except that the most time is spent inside the step function as it's expected.

The Parser file isn't that needed because it's only run once so it doesn't affect performance.

Another thing to note is that it takes around 250 000 ticks (instructions executed) to do it so I'm pretty sure it could be much faster than 6 seconds.

Is there anything obvious that I could improve?

Parser

module Parser where
import Data.Vector (Vector, fromList)
import Data.Char (toUpper)
import Data.List (sort)
type ByteCode = [Int]
data OpCode = Push | Pop | Add | Sub | Mult | Div | Store | Load | Jmp | Cmp | Not | Br | Dup | Inc | Dec | Swp
 deriving (Enum, Read, Show, Ord, Eq)
arity :: Vector Int
arity = (fromList . map snd . sort) $ zip [Push, Store, Load] [1, 1..] ++ zip [Pop, Add, Sub, Mult, Div] [0, 0..]
charIsNumeric :: Char -> Bool
charIsNumeric c = '0' <= c && '9' >= c
stringIsNumeric :: String -> Bool
stringIsNumeric ('-' : s) = all charIsNumeric s
stringIsNumeric s = all charIsNumeric s
capitalize :: String -> String
capitalize [] = []
capitalize (x : xs) = toUpper x : xs
wordToByteCode :: String -> Int
wordToByteCode str = if stringIsNumeric str then read str else fromEnum opCodeEnum
 where
 opCodeEnum :: OpCode
 opCodeEnum = read $ capitalize str
stringToByteCode :: String -> ByteCode
stringToByteCode = map wordToByteCode . words
sourceToByteCode :: String -> ByteCode
sourceToByteCode = map wordToByteCode . concatMap words . lines

VM

module VM where
import Parser (ByteCode, OpCode(..), arity)
import qualified Data.IntMap as IM
import Data.Vector (Vector, (!))
import qualified Data.Vector as Vector
import Data.List (intercalate)
import Utility
data VM = VM {
 byteCode :: Vector Int,
 programCounter :: Int,
 stack :: [Int],
 memory :: IM.IntMap Int
 }
 deriving (Show)
fromCode :: ByteCode -> VM
fromCode code = VM { byteCode = Vector.fromList code, programCounter = 0, stack = [], memory = IM.empty }
step :: VM -> VM
step vm = next
 where
 bc = byteCode vm
 pc = programCounter vm
 st = stack vm
 mm = memory vm
 inst = toEnum $ bc ! pc
 pop1 = tail st
 pop2 = tail pop1
 top1 = head st
 top2 = head pop1
 nextPc = pc + 1
 next = case inst of
 Pop -> vm { stack = pop1, programCounter = nextPc }
 Push -> vm { stack = bc ! nextPc : st, programCounter = pc + 2 }
 Add -> vm { stack = (top1 + top2) : pop2, programCounter = nextPc }
 Sub -> vm { stack = (top2 - top1) : pop2, programCounter = nextPc }
 Mult -> vm { stack = (top1 * top2) : pop2, programCounter = nextPc }
 Div -> vm { stack = (top2 `div` top1) : pop2, programCounter = nextPc }
 Store -> vm { stack = pop2, programCounter = nextPc, memory = IM.insert top1 top2 mm }
 Load -> vm { stack = mm IM.! top1 : pop1, programCounter = nextPc }
 Jmp -> vm { stack = pop1, programCounter = top1 }
 Cmp -> vm { stack = signum (top2 - top1) : pop2, programCounter = nextPc }
 Not -> vm { stack = (if top1 > 0 then -1 else 1) : pop1, programCounter = nextPc }
 Br -> vm { stack = pop2, programCounter = if top2 > 0 then top1 else nextPc } 
 Dup -> vm { stack = top1 : st, programCounter = nextPc }
 Inc -> vm { stack = (top1 + 1) : pop1, programCounter = nextPc } 
 Dec -> vm { stack = (top1 - 1) : pop1, programCounter = nextPc }
 Swp -> vm { stack = top2 : top1 : pop2, programCounter = nextPc }
endState :: VM -> Bool
endState vm = programCounter vm == Vector.length (byteCode vm)
run :: VM -> VM
run = until endState step
runCount :: VM -> (Int, VM)
runCount = untilCount endState step
debug :: (VM -> String) -> VM -> (VM, [String])
debug watch vm = if endState vm then (vm, []) else (nextVm, watch vm : logs)
 where
 (nextVm, logs) = debug watch (step vm)
instructionLogger :: VM -> String
instructionLogger vm = show (toEnum $ byteCode vm ! programCounter vm :: OpCode)
watch :: Int -> VM -> String
watch n vm = case IM.lookup n (memory vm) of
 Nothing -> "undefined"
 Just a -> show a
composeLoggers :: [VM -> String] -> VM -> String
composeLoggers loggers vm = (intercalate " " . map ($ vm)) loggers
printDebug :: (VM -> String) -> VM -> IO ()
printDebug f v = putStr $ unlines $ snd $ debug f v

Tests

module Tests where
import qualified Parser as Parser
import qualified VM as VM
import qualified Data.IntMap as IM
bubble = unlines [
 "push 0", "push 1000", "store",
 "push 0", "push 1001", "store",
 "push 1000", "load", "load",
 "push 1001", "load", "load",
 "cmp",
 "push 38", "br",
 "push 1000", "load", "load",
 "push 1001", "load", "load",
 "push 1000", "load", "store",
 "push 1001", "load", "store",
 "push 1001", "load", "inc", "dup", "push 1001", "store",
 "push 100",
 "cmp", "not",
 "push 10", "br", 
 "push 0", "push 1001", "store",
 "push 1000", "load", "inc", "dup", "push 1000", "store", 
 "push 100",
 "cmp", "not",
 "push 10", "br"
 ]
vm = VM.fromCode $ Parser.sourceToByteCode bubble
vmWithData = vm { VM.memory = IM.fromList $ zip [0..100] [100, 99..0] }
main = print $ VM.run $ vmWithData
dbg = VM.printDebug (VM.composeLoggers [VM.instructionLogger, VM.watch 101, show . VM.programCounter]) vmWithData
Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Jul 21, 2014 at 19:02
\$\endgroup\$
5
  • \$\begingroup\$ I don't see anything obviously bad, except that you're benchmarking ghci's interpreter. It's not designed to be efficient. Try compiling with optimizations. \$\endgroup\$ Commented Jul 22, 2014 at 5:25
  • \$\begingroup\$ I did to profile it. It's obviously much faster but it still hangs at around 500, taking around 4 seconds. \$\endgroup\$ Commented Jul 22, 2014 at 8:13
  • \$\begingroup\$ I'll take a closer look, then. \$\endgroup\$ Commented Jul 22, 2014 at 15:05
  • \$\begingroup\$ I can't actually compile and test because the Utility module is absent. \$\endgroup\$ Commented Jul 22, 2014 at 15:12
  • \$\begingroup\$ Ok, Utility was only used for untilCount, which I reimplemented. Compiled with -O2, this runs basically instantly. \$\endgroup\$ Commented Jul 22, 2014 at 15:58

1 Answer 1

3
\$\begingroup\$

Based on my investigations, I'm going to say that your problem is exactly what I said in the comments: performance testing with ghci.

I modified VM.hs a bit, to get it to build:

{-# LANGUAGE BangPatterns #-}
module VM where
import Parser (ByteCode, OpCode(..), arity)
import qualified Data.IntMap as IM
import Data.Vector (Vector, (!))
import qualified Data.Vector as Vector
import Data.List (intercalate)
data VM = VM {
 byteCode :: Vector Int,
 programCounter :: Int,
 stack :: [Int],
 memory :: IM.IntMap Int
 }
 deriving (Show)
fromCode :: ByteCode -> VM
fromCode code = VM { byteCode = Vector.fromList code, programCounter = 0, stack = [], memory = IM.empty }
step :: VM -> VM
step vm = next
 where
 bc = byteCode vm
 pc = programCounter vm
 st = stack vm
 mm = memory vm
 inst = toEnum $ bc ! pc
 pop1 = tail st
 pop2 = tail pop1
 top1 = head st
 top2 = head pop1
 nextPc = pc + 1
 next = case inst of
 Pop -> vm { stack = pop1, programCounter = nextPc }
 Push -> vm { stack = bc ! nextPc : st, programCounter = pc + 2 }
 Add -> vm { stack = (top1 + top2) : pop2, programCounter = nextPc }
 Sub -> vm { stack = (top2 - top1) : pop2, programCounter = nextPc }
 Mult -> vm { stack = (top1 * top2) : pop2, programCounter = nextPc }
 Div -> vm { stack = (top2 `div` top1) : pop2, programCounter = nextPc }
 Store -> vm { stack = pop2, programCounter = nextPc, memory = IM.insert top1 top2 mm }
 Load -> vm { stack = mm IM.! top1 : pop1, programCounter = nextPc }
 Jmp -> vm { stack = pop1, programCounter = top1 }
 Cmp -> vm { stack = signum (top2 - top1) : pop2, programCounter = nextPc }
 Not -> vm { stack = (if top1 > 0 then -1 else 1) : pop1, programCounter = nextPc }
 Br -> vm { stack = pop2, programCounter = if top2 > 0 then top1 else nextPc } 
 Dup -> vm { stack = top1 : st, programCounter = nextPc }
 Inc -> vm { stack = (top1 + 1) : pop1, programCounter = nextPc } 
 Dec -> vm { stack = (top1 - 1) : pop1, programCounter = nextPc }
 Swp -> vm { stack = top2 : top1 : pop2, programCounter = nextPc }
endState :: VM -> Bool
endState vm = programCounter vm == Vector.length (byteCode vm)
run :: VM -> VM
run = until endState step
runCount :: VM -> (Int, VM)
runCount = untilCount endState step
 where
 untilCount f g = go 0
 where
 go !n x | f x = (n, x)
 | otherwise = go (n + 1) (g x)
debug :: (VM -> String) -> VM -> (VM, [String])
debug watch vm = if endState vm then (vm, []) else (nextVm, watch vm : logs)
 where
 (nextVm, logs) = debug watch (step vm)
instructionLogger :: VM -> String
instructionLogger vm = show (toEnum $ byteCode vm ! programCounter vm :: OpCode)
watch :: Int -> VM -> String
watch n vm = case IM.lookup n (memory vm) of
 Nothing -> "undefined"
 Just a -> show a
composeLoggers :: [VM -> String] -> VM -> String
composeLoggers loggers vm = (intercalate " " . map ($ vm)) loggers
printDebug :: (VM -> String) -> VM -> IO ()
printDebug f v = putStr $ unlines $ snd $ debug f v

My changes were:

  1. Enable the BangPatterns extension to make it easier to efficiently write untilCount
  2. Remove the import of Utility.
  3. Add untilCount into runCount.

I also changed Tests to use runCount just to be sure I was getting the same operation count as you.

After those changes, this is a sample session:

carl@debian:~/hask/codereview/stackint$ ghc -O2 -main-is Tests Tests.hs 
[1 of 3] Compiling Parser ( Parser.hs, Parser.o )
[2 of 3] Compiling VM ( VM.hs, VM.o )
[3 of 3] Compiling Tests ( Tests.hs, Tests.o )
Linking Tests ...
carl@debian:~/hask/codereview/stackint$ time ./Tests 
(267252,VM {byteCode = fromList [0,0,0,1000,6,0,0,0,1001,6,0,1000,7,7,0,1001,7,7,9,0,38,11,0,1000,7,7,0,1001,7,7,0,1000,7,6,0,1001,7,6,0,1001,7,13,12,0,1001,6,0,100,9,10,0,10,11,0,0,0,1001,6,0,1000,7,13,12,0,1000,6,0,100,9,10,0,10,11], programCounter = 73, stack = [], memory = fromList [(0,0),(1,1),(2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,8),(9,9),(10,10),(11,11),(12,12),(13,13),(14,14),(15,15),(16,16),(17,17),(18,18),(19,19),(20,20),(21,21),(22,22),(23,23),(24,24),(25,25),(26,26),(27,27),(28,28),(29,29),(30,30),(31,31),(32,32),(33,33),(34,34),(35,35),(36,36),(37,37),(38,38),(39,39),(40,40),(41,41),(42,42),(43,43),(44,44),(45,45),(46,46),(47,47),(48,48),(49,49),(50,50),(51,51),(52,52),(53,53),(54,54),(55,55),(56,56),(57,57),(58,58),(59,59),(60,60),(61,61),(62,62),(63,63),(64,64),(65,65),(66,66),(67,67),(68,68),(69,69),(70,70),(71,71),(72,72),(73,73),(74,74),(75,75),(76,76),(77,77),(78,78),(79,79),(80,80),(81,81),(82,82),(83,83),(84,84),(85,85),(86,86),(87,87),(88,88),(89,89),(90,90),(91,91),(92,92),(93,93),(94,94),(95,95),(96,96),(97,97),(98,98),(99,99),(100,100),(1000,101),(1001,0)]})
real 0m0.155s
user 0m0.012s
sys 0m0.116s

Nearly everything there is in sys time as well, which usually means doing IO. Let me do some proper benchmarking. Proper benchmarking in haskell involves using the criterion package. I added a new file to contain the criterion code, Main.hs:

import Criterion.Main
import qualified VM
import qualified Tests
main :: IO ()
main = defaultMain [bench "sort" $ whnf (fst . VM.runCount) Tests.vmWithData]

As a quick explanation of Criterion in general - benchmarking in a lazy language can be tricky. Criterion provides tools to let you make sure you're doing it right. I used the whnf function to benchmark counting the number of steps the program runs. Since it's impossible to determine how many steps it runs without actually running it, that ensures that the benchmarking isn't fooled by laziness. And here's another terminal log:

carl@debian:~/hask/codereview/stackint$ ghc -O2 Main.hs 
[3 of 4] Compiling Tests ( Tests.hs, Tests.o ) [flags changed]
[4 of 4] Compiling Main ( Main.hs, Main.o )
Linking Main ...
carl@debian:~/hask/codereview/stackint$ ./Main 
warming up
estimating clock resolution...
mean is 21.01945 us (40001 iterations)
found 2284 outliers among 39999 samples (5.7%)
 678 (1.7%) low severe
 1345 (3.4%) high severe
estimating cost of a clock call...
mean is 16.59960 us (6 iterations)
benchmarking sort
collecting 100 samples, 1 iterations each, in estimated 6.015491 s
mean: 52.26690 ms, lb 50.39930 ms, ub 57.08317 ms, ci 0.950
std dev: 14.30133 ms, lb 6.851806 ms, ub 29.74848 ms, ci 0.950
found 7 outliers among 100 samples (7.0%)
 3 (3.0%) high mild
 4 (4.0%) high severe
variance introduced by outliers: 96.804%
variance is severely inflated by outliers

Criterion gives you a bunch of statistical analysis of its results. It tells me, among other things, that benchmarking in a VirtualBox VM introduces a lot of jitter. That's what all the stuff about variance and outliers is about. However, if you look at the absolute timings, that doesn't matter too much. Even with the inflated variance, the timing ranges from about 50ms to 57ms. In other words, your code is pretty darn fast already.

But if you're going to benchmark, do it properly.


Now, it is possible to improve upon this code a bit. It suffers from some minor excessive laziness.

  1. Change the import of IntMap to Data.IntMap.Strict. This will keep unevaluated expressions from building up in the values in the IntMap.
  2. Add strictness annotations to the fields that would benefit from it in the VM record.

With those two changes, I cut the time spent in the criterion benchmark in half. Here's what I settled on for the definition of VM:

data VM = VM {
 byteCode :: Vector Int,
 programCounter :: !Int,
 stack :: [Int],
 memory :: !(IM.IntMap Int)
 }
 deriving (Show)

Note that this is a valid data definition without any extensions. It's basic haskell that putting a ! on a field in a data declaration marks that field as strict. More precisely, it means "when the constructor of this type is evaluated, also evaluate this field to whnf".

The byteCode field never changes during execution, so it's not necessary to mark it as strict. Once it's evaluated, it stays evaluated. The stack field is a recursive data type that is always built directly from constructors. It doesn't help anything to make it strict, it's always already in whnf.

Of the two fields that strictness annotations do help on, I was very surprised that it's actually the programCounter field that gets a huge benefit from becoming strict. In retrospect, that's probably because I'm working with GHC 7.8, which automatically unboxes strict "small" fields in data types, and Int is small. The auto unboxing significantly reduces the amount of pointer chasing during the loop, so it does make sense that it would improve things.

To get that same improvement on older versions of GHC, you would have to define the data type as:

data VM = VM {
 byteCode :: Vector Int,
 programCounter :: {-# UNPACK #-} !Int,
 stack :: [Int],
 memory :: !(IM.IntMap Int)
 }
 deriving (Show)

The UNPACK pragma indicates to ghc that it should unbox the next field in a data declaration, assuming it is strict. (If it isn't strict, unboxing it would change the semantics, so the pragma is ignored.)

answered Jul 22, 2014 at 16:49
\$\endgroup\$
1
  • \$\begingroup\$ Thank you so much for taking the time to review my code. \$\endgroup\$ Commented Jul 22, 2014 at 17:00

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.