4
\$\begingroup\$

I've been trying to get better at Haskell for a while, and have recently been working on a lot of small projects with it. This constructs a binary decision tree.

The command to run it is:

stack exec decision-tree-exe <threshold> <training file> <testing file>

where threshold is in the range (0,1].

I think I've gotten a lot better, but I'm still having problems, especially with performance and readability. For this project, I took a more top down approach, implementing functions after using them. src/DecisionTree.hs is where the bulk of the logic is, and the file is pretty much in order of writing. I would love to get some feedback from some more experienced people on where I might improve.

module DecisionTree where
import Data.List (genericLength, maximumBy, nub)
import Data.Map (elemAt, foldlWithKey', fromListWith)
import Data.Ord
data DecisionTree a b
 = Node ([a] -> Bool) (DecisionTree a b) (DecisionTree a b)
 | Leaf b
type Dataset cat attrs = [(cat, [attrs])]
type Threshold = Double
type Splitter c a = ([a] -> Bool, Dataset c a, Dataset c a)
apply :: DecisionTree a b -> [a] -> b
apply (Leaf b) _ = b
apply (Node f l r) a =
 case f a of
 False -> apply l a
 True -> apply r a
train ::
 (Ord c)
 => (Dataset c a -> Maybe (Splitter c a))
 -> Dataset c a
 -> DecisionTree a c
train splitter dataset =
 case splitter dataset of
 Just (partitioner, left, right) ->
 Node partitioner (train splitter left) (train splitter right)
 Nothing -> Leaf majority
 where
 classCounts = fromListWith (+) $ map (\k -> (fst k, 1)) dataset
 majority = fst $ foldlWithKey' max (elemAt 0 classCounts) classCounts
 max acc k v
 | v > snd acc = (k, v)
 | otherwise = acc
giniSplitter ::
 (Ord a, Ord c) => Threshold -> Dataset c a -> Maybe (Splitter c a)
giniSplitter threshold dataset =
 case fst maxDelta > threshold of
 True -> Just $ snd maxDelta
 False -> Nothing
 where
 attrs = nub . concat . snd . unzip $ dataset
 partitioner a = (a `elem`)
 delta a = giniDelta (partitioner a) dataset
 maxDelta = maximumBy (comparing fst) $ map delta attrs
giniDelta :: (Eq c) => ([a] -> Bool) -> Dataset c a -> (Double, Splitter c a)
giniDelta partitioner dataset =
 ( gini dataset - (d1 / d * gini left + d2 / d * gini right)
 , (partitioner, left, right))
 where
 left = filter (not . partitioner . snd) dataset
 right = filter (partitioner . snd) dataset
 d1 = genericLength left
 d2 = genericLength right
 d = genericLength dataset
gini :: (Eq c) => Dataset c a -> Double
gini d = 1 - sum [(pj c) ** 2 | c <- nub . fst . unzip $ d]
 where
 pj c = genericLength (filter ((== c) . fst) d) / genericLength d
asked May 2, 2019 at 22:28
\$\endgroup\$

1 Answer 1

2
\$\begingroup\$

Just a few random comments:

  1. Both elemAt and maximumBy give hints that you're expecting to operate on non-empty structures. Maybe give Data.List.NonEmpty a try.

  2. A few places could be more clear with more pattern matching. E.g. max (k1, v1) k2 v2 instead of max acc k v. Or (maxDelta, splitter) = maximumBy ...

  3. map snd is more conventional than snd . unzip. I suspect it would be more efficient too but I might be wrong.

  4. In several places you're traversing the same list multiple times. In general, it's better to avoid this as it's likely to force the spine of the (potentially large) list in memory. You might be able to merge these multiple traversals into one (e.g. using the foldl package). More likely, you should simply use vector.

  5. In giniDelta you could use Data.List.partition to construct left and right.

  6. Apply top-down ordering in your where-clauses. E.g. in train, majority should come first as it is the declaration that is referenced from the main function body.

EDIT: All in all I think readability is actually pretty good!

answered May 22, 2019 at 1:28
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.