= Outcoding UNIX geniuses = Static types prevent disasters by catching bugs at compile time. Yet many languages have self-defeating types. Type annotation can be so laborious that weary programmers give up and switch to unsafe languages. Adding insult to injury, some of these prolix type systems are simultaneously inexpressive. Some lack https://en.wikipedia.org/wiki/Parametric_polymorphism[parametric polymorphism], forcing programmers to duplicate code or subvert static typing with casts. A purist might escape this dilemma by writing code to generate code, but this leads to another set of problems. Type theory shows how to avoid these pitfalls, but many programmers seem unaware: * Popular authors Bruce Eckel and Robert C. Martin mistakenly believe https://docs.google.com/document/d/1aXs1tpwzPjW9MdsG5dI7clNFyYayFBkcXwRDo-qvbIk/preview[strong typing implies verbosity], and worse still, https://blog.cleancoder.com/uncle-bob/2017/01/11/TheDarkPath.html[testing conquers all]. Tests are undoubtedly invaluable, but at best they https://en.wikipedia.org/wiki/Proof_by_example["prove" by example]. Testing can be used https://www.cs.utexas.edu/~EWD/transcriptions/EWD03xx/EWD303.html["...to show the presence of bugs, but never to show their absence"]. One could even argue a test-heavy approach helps attackers find exploits, as untested cases may hint at overlooked bugs. Instead of testing, we need strong static types so we can prove correctness of programs. * The designers of the Go language, including famed former Bell Labs researchers, have been stumped by polymorphism for years. To be fair, parametric polymorphism is only half the story. The other half is ad hoc polymorphism, which Haskell researchers only neatly solved in the late 1980s with type classes. Practical Haskell compilers also need more type trickery for unboxing. Why is this so? Perhaps people think the theory is arcane, dry, and impractical? By working through some programming interview questions, we'll find the relevant theory is surprisingly accessible. We quickly arrive at a simple _type inference_ or _type reconstruction_ algorithm that seems too good to be true: it powers strongly typed languages that support parametric polymorphism without requiring any type declarations. We'll write code that figures out the function: ---------------------------------------------------------------- \x y z -> x z(y z) ---------------------------------------------------------------- has type: ---------------------------------------------------------------- (a -> b -> c) -> (a -> b) -> a -> c ---------------------------------------------------------------- And given that `+++(+)+++` has type `+Int -> Int -> Int+`, our code determines that the function: ---------------------------------------------------------------- \x -> (+) (x 42) ---------------------------------------------------------------- has type: ---------------------------------------------------------------- (Int -> Int) -> Int -> Int ---------------------------------------------------------------- == 1. Identifying twins == Determine if two binary trees storing integers in their leaf nodes are equal. *Solution:* Two words: `deriving Eq`. ------------------------------------------------------------------------------ data Tree a = Leaf a | Branch (Tree a) (Tree a) deriving Eq ------------------------------------------------------------------------------ https://www.haskell.org/onlinereport/derived.html[Haskell's derived instance feature] automatically works on any algebraic data type built on any type for which equality makes sense. It even works for mutually recursive data types (https://hackage.haskell.org/package/containers/docs/Data-Tree.html[see `Data.Tree`]): ------------------------------------------------------------------------------ data Tree a = Node a (Forest a) deriving Eq data Forest a = Forest [Tree a] deriving Eq ------------------------------------------------------------------------------ Perhaps my interviewer would ask me to explain `deriving Eq` does. Roughly speaking, it generates code like the following, saving the programmer from stating the obvious: ------------------------------------------------------------------------------ data Tree a = Leaf a | Branch (Tree a) (Tree a) eq (Leaf x) (Leaf y) = x == y eq (Branch xl xr) (Branch yl yr) = eq xl yl && eq xr yr eq _ _ = False ------------------------------------------------------------------------------ == 2. On assignment == This time, the first tree may contain variables in place of integers. Can we _unify_ the trees, that is, assign integers to all variables so the trees are equal? The same variable may appear more than once. *Solution:* We extend our data structure to hold variables: \begin{code} data Tree a = Var String | Leaf a | Tree a :. Tree a deriving Show \end{code} As before, we traverse both trees and look for nodes that differ in value or type. If the first is a variable, then we record a constraint, that is, a variable assignment that is required for the trees to be equal such as `a = 4` or `b = 2`. If there are conflicting values for the same variable, then we indicate failure by returning `Nothing`. Otherwise we return `Just` the list of assignments found. \begin{code} solve1 = go [] where go as = \cases (xl :. xr) (yl :. yr) -> do as0 <- go as xl yl go as0 xr yr (Leaf x) (Leaf y) | x == y -> Just as | otherwise -> Nothing (Var v) (Leaf x) -> case lookup v as of Nothing -> Just $ (v, x):as Just x' | x == x' -> Just as | otherwise -> Nothing _ _ -> Nothing solve1 (Leaf 1) (Leaf 2) solve1 (Leaf 3) (Leaf 3) solve1 (Leaf 4) (Var "x") solve1 (Var "x" :. Var "y") (Leaf 5 :. Leaf 6) solve1 (Var "x" :. Var "x") (Leaf 5 :. Leaf 6) \end{code} == 3. Both Sides, Now == Now suppose leaf nodes in both trees can hold integer variables. Can two trees be made equal by assigning certain integers to the variables? If so, find the most general solution. *Solution:* We proceed as before, but now we may encounter constraints such as `a = b`, which equate two variables. To handle such a constraint, we pick one of the variables, such as `a`, and replace all occurrences of `a` with the other side, which is `b` in our example. This eliminates `a` from all constraints. Eventually, all our constraints have an integer on at least one side, which we check for consistency. We write a function that substitutes a given tree for a given variable. This is more general than necessary since we only need to substitute variables for variables, but we'll soon need its full power. \begin{code} sub x t = \case Var x' | x == x' -> t l :. r -> sub x t l :. sub x t r a -> a sub "x" (Leaf 1 :. Leaf 2) ((Var "x" :. (Leaf 3 :. Var "y")) :. Var "x") \end{code} We discard redundant constraints where the same variable appears on both sides, such as `a = a`. Thus a variable may wind up with no integer assigned to it, which means if a solution exists, it can take any value. Our code is inefficient. Rather than repeat the same chain of substitutions to reach a representative variable, we should use a more efficient https://en.wikipedia.org/wiki/Disjoint-set_data_structure[disjoint-set data structure]. \begin{code} foldSub = foldr $ uncurry sub solve2 = go [] where go as = \cases (Leaf x) (Leaf y) | x == y -> Just as | otherwise -> Nothing (xl :. xr) (yl :. yr) -> go as xl yl>>= \as0 -> go as0 (foldSub xr as0) (foldSub yr as0) (Var _) (_ :. _) -> Nothing (Var v) (Var w) | v == w -> Just as (Var v) t -> Just $ (v, t):as x y@(Var _) -> go as y x _ _ -> Nothing solve2 ((Var "x" :. Leaf 1) :. Var "z") ((Var "y" :. Var "y") :. Leaf 2) solve2 ((Var "x" :. Leaf 1) :. Var "z") ((Var "y" :. Var "z") :. Leaf 2) solve2 ((Var "x" :. Leaf 1) :. Var "x") ((Var "y" :. Var "z") :. Leaf 2) \end{code} == 4. Once more, with subtrees == What if variables can represent subtrees? *Solution:* Although we've significantly generalized the problem, our answer almost remains the same. When unifying a variable `v` against a tree `t`, to avoid infinite recursion, we perform an _occurs check_: the variable `v` must not occur in `t`, except in the trivial case when `v == t`. If we pass this check, then we record that in future, we should replace `v` with the tree resulting from applying all substitutions found so far to `t`. \begin{code} treeUnify :: (Show a, Eq a) => Tree a -> Tree a -> Maybe [(String, Tree a)] treeUnify = go [] where go as = \cases (Leaf x) (Leaf y) | x == y -> Just as | otherwise -> Nothing (xl :. xr) (yl :. yr) -> do as0 <- go as xl yl go as0 (foldSub xr as0) (foldSub yr as0) (Var v) (Var w) | v == w -> Just as (Var v) t | occurs t -> Nothing | otherwise -> Just $ (v, foldSub t as):as where occurs = \case l :. r -> occurs l || occurs r Var w | v == w -> True _ -> False x y@(Var _) -> go as y x _ _ -> Nothing treeUnify (Var "x") (Leaf 5) treeUnify (Var "x" :. Leaf 5) (Leaf 4 :. Var "y") treeUnify ((Var "x" :. Var "x") :. Var "w") (((Leaf 3 :. Leaf 5) :. (Var "y" :. Var "z")) :. Var "w") \end{code} == 5. Type inference! == Design a language based on lambda calculus where integers and strings are primitve types, and where we can deduce whether a given expression is _typable_, that is, whether types can be assigned to the untyped bindings so that the expression is well-typed. If so, find the most general type. For example, the expression `+\f -> f 2+` which takes its first argument `f` and applies to the integer 2 must have type `+(Int -> u) -> u+`. Here, `u` is a _type variable_, that is, we can substitute `u` with any type. This is known as _parametric polymorphism_. More precisely the inferred type is most general, or _principal_ if: 1. Substituting types such as `Int` or `+(Int -> Int) -> Int+` (sometimes called _type constants_ for clarity) for all the type variables results in a well-typed closed term. 2. There are no other ways of typing the given expression. *Solution:* We define a tree to represent types. Like our previous trees, leaf nodes can be variables (`TV` for _type variable_) or constants (`TC` for _type constant_). The `+(:->)+` constructor plays the same role as `(:.)` above, but this time it also represents functions. \begin{code} infixr 5 :-> data Type = TC String | Type :-> Type | TV String instance Show Type where showsPrec d = \case TC s -> (s++) TV s -> (s++) a :-> b -> showParen (d> 0) $ showsPrec 1 a . (" -> "++) . shows b \end{code} We write a function `tsub` that substitutes a given tree for a given variable, analogous to `sub` above: \begin{code} tsub (x, t) = \case TV x' | x == x' -> t a :-> b -> tsub (x, t) a :-> tsub (x, t) b y -> y \end{code} Tree unification is almost the same, though this time we switch from `Maybe` to `Either` so we can add error messages: \begin{code} unify :: Type -> Type -> Either String [(String, Type)] unify = go [] where go as = \cases (TC x) (TC y) | x == y -> Right as (xl :-> xr) (yl :-> yr) -> do as0 <- go as xl yl go as0 (foldr tsub xr as0) (foldr tsub yr as0) (TV v) (TV w) | v == w -> Right as (TV v) t | occurs t -> Left "occurs check" | otherwise -> Right $ (v, foldr tsub t as):as where occurs = \case l :-> r -> occurs l || occurs r TV w | v == w -> True _ -> False x y@(TV _) -> go as y x x y -> Left $ show x ++ " versus " ++ show y \end{code} Next, we define a data structure to hold an expression in our language. Applications, lambda abstractions, variables, integers, and strings: \begin{code} infixl 5 :@ data Expr = Expr :@ Expr | Lam String Expr | V String | I Int | S String instance Show Expr where showsPrec d = \case a :@ b -> showParen (d> 0) $ shows a . (' ':) . showsPrec 1 b Lam v t -> ('\\':) . (v++) . (" -> "++) . shows t V s -> showParen (s == "+") $ (s++) I i -> shows i S s -> shows s \end{code} Unfortunately my base library lacks the state monad transformer, so we swipe some code from GHC: \begin{code} data StateT s m a = StateT { runStateT :: s -> m (a,s) } instance (Monad m) => Monad (StateT s m) where return a = StateT $ \ s -> return (a, s) m>>= k = StateT $ \ s -> do (a, s') <- runStateT m s runStateT (k a) s' instance (Functor m) => Functor (StateT s m) where fmap f m = StateT $ \ s -> fmap (\(a, s') -> (f a, s')) $ runStateT m s instance (Functor m, Monad m) => Applicative (StateT s m) where pure a = StateT $ \ s -> return (a, s) StateT mf <*> StateT mx = StateT $ \ s -> do (f, s') <- mf s (x, s'') <- mx s' return (f x, s'') m *> k = m>>= \_ -> k state f = StateT (return . f) getT = state $ \ s -> (s, s) putT s = state $ \ _ -> ((), s) lift m = StateT $ \ s -> do a <- m return (a, s) \end{code} To finish off, we traverse the expression and unify types along the way to enforce the rules: * The type of a primitive value is its corresponding type; for example, 5 has type `TC "Int"` and "Hello, World" has type `TC "String"`. * For an application `f x`, we recursively determine the type `tf` of `f` and `tx` of `x`, possibly requiring new assignments. We return a new type variable `tfx`, and unify `tf` with `+tx :-> tfx+` after applying any new assignments. * For a lambda abstraction `\x.t`, we generate a new type variable `tx` to represent the type of `x`. Then we recursively find the type `tt` of `t` during which we assign the type `tx` to any free occurrence of `x`, then return the type `+tx :-> tt+`. Let's walk through an example. The expression `+\f -> f 2+` would be represented as the `Expr` tree `Lam "f" (V "f" :@ I 2)`. To type this expression: 1. At the top level we have a lambda so begin by generating a new type variable `t0`. 2. Recursively type the lambda body, with the additional context that the symbol `f` has type `t0`. .. Generate a new type variable `t1`. .. Recursively type the left child of the `(:@)` node: ... We have the symbol `f`, which has type `t0` from the context. .. Recursively type the right child of the `(:@)` node: ... We have the integer constant 2, so we return the type `TC "Int"`. .. Unify `t0` with `+TC "Int" -> t1+`, which assigns the latter to the former. .. Return the type `t1`. 3. Return the type `+t0 -> t1+`. Thus `+\f -> f 2+` has type `+t0 -> t1+` which expands to `+(Int -> t1) -> t1+` if we apply all the assignments we found. We use `StateT` to help with bookkeeping. To guarantee a unique name for each type variable, we maintain a counter which we increment for each new name, which is stored along with the type variable assignments collected so far. The environment `gamma` holds types of any predefined functions, and also records the types of variables in lambda abstractions. We ultimately return the type of an expression, or an error message explaining why it is ill-typed. We name the function `gather` because it accumulates more and more type variable assignments over time, but also because "to gather" can mean "to infer". \begin{code} gather :: [(String, Type)] -> Expr -> StateT ([(String, Type)], Int) (Either String) Type gather gamma = \case I _ -> pure $ TC "Int" S _ -> pure $ TC "String" f :@ x -> do tf <- gather gamma f tx <- gather gamma x tfx <- newTV (cs, i) <- getT let rewrite t = foldr tsub t cs cs' <- lift $ unify (rewrite tf) (rewrite $ tx :-> tfx) putT ((second rewrite <$> cs') ++ cs, i) pure tfx V s -> lift $ maybe (Left $ "missing " ++ s) Right $ lookup s gamma Lam x t -> do tx <- newTV (tx :->) <$> gather ((x, tx):gamma) t where newTV = do (cs, i) <- getT putT (cs, i + 1) pure $ TV $ 't':show i \end{code} This algorithm is the heart of the link:pcf.html[_Hindley-Milner type system_], or HM for short. See https://web.cecs.pdx.edu/~mpj/thih/thih.pdf[Mark P. Jones, _Typing Haskell in Haskell_]. The `infer` wrapper shows the type returned by `gather` with all assignments applied, or returns an error message: \begin{code} infer gamma x = case runStateT (gather gamma x) ([], 0) of Right (ty, (cs, _)) -> show $ foldr tsub ty cs Left e -> "FAIL: " ++ e infer [] (Lam "x" (V "x" :@ I 2)) \end{code} We build a demo with a couple of predefined functions with certain types: \begin{code} prelude :: [(String, Type)] prelude = [ ("+", TC "Int" :-> TC "Int" :-> TC "Int"), ("length", TC "String" :-> TC "Int")] demo expr = do putStrLn $ "type of `" ++ show expr ++ "`:" putStrLn $ " " ++ infer prelude expr ++ "\n" demo $ V "length" demo $ V "length" :@ S "hello" demo $ V "length" :@ V "length" demo $ V "+" :@ I 1 :@ I 1 demo $ V "+" :@ I 1 :@ S "hello" demo $ Lam "x" (V "x" :@ I 2) demo $ Lam "x" (V "x" :@ V "x") demo $ Lam "x" ((V "+" :@ V "x") :@ I 42) demo $ Lam "x" (V "+" :@ (V "x" :@ I 42)) demo $ Lam "x" (V "x") demo $ Lam "x" (Lam "y" (V "x")) demo $ Lam "x" (Lam "y" (Lam "z" ((V "x" :@ V "z") :@ (V "y" :@ V "z")))) \end{code} == Theorems for free == Parametric polymorphism not only aids correctness, but also allows easier manipulation of code. Suppose we infer that a function `f` has type: ---------------------------------------------------------------- f :: [a] -> [a] ---------------------------------------------------------------- Then link:../compiler/para.html[we can mechanically deduce] that: ---------------------------------------------------------------- f (map r xs) = map r (f xs) ---------------------------------------------------------------- for any function `r`. This rewrite rule could be applied automatically to help optimize code.