I've written a function in Haskell for calculating the entropy of a collection. I'd like feedback on how the function could be rewritten to be more flexible/reusable, and also how to profile the function and how it could be tuned and/or modified for better performance.
import Data.List (foldl1')
entropy :: [Int] -> Int -> Int -> Double
entropy itemFrequencies totalElements logarithmicBase =
-(foldl1' (+) $ map (\p -> p * (logBase b p)) probabilities)
where
is = map fromIntegral itemFrequencies
l = fromIntegral totalElements
b = fromIntegral logarithmicBase
probabilities = map (\i -> i / l) $ is
For some background, the Entropy calculation is a core function used in building decision trees. For more complex datasets and decision trees, this function would get called very often. I'm working on a sequential implementation of the ID3 algorithm, which this entropy function is a part of, that I will later make parallel/concurrent as a separate exercise, and then I will eventually also write implementations of ID3's descendants: C4.5 and C5.0.
Please respect my desire to struggle on my own with the concurrency/parallelism aspect of rewriting this code, I'm only interested in any sequential improvements I could make for performance and any refactoring that could be done to make this code easier to maintain and reuse.
-
\$\begingroup\$ Since it's been a week, I'm going to go ahead and select cole's answer, even though I would have liked for someone to have provided an answer that made an attempt at looking at the potential performance improvements. \$\endgroup\$Josiah– Josiah2019年06月09日 22:03:49 +00:00Commented Jun 9, 2019 at 22:03
1 Answer 1
General Code Review
First, your function and my proposed changes (entropy'
) side-by-side.
import Data.List (foldl1', foldl')
entropy :: [Int] -> Int -> Int -> Double
entropy itemFrequencies totalElements logarithmicBase =
-(foldl1' (+) $ map (\p -> p * (logBase b p)) probabilities)
where
is = map fromIntegral itemFrequencies
l = fromIntegral totalElements
b = fromIntegral logarithmicBase
probabilities = map (\i -> i / l) $ is
entropy' :: (Foldable f, Integral a, Floating b) => a -> a -> f a -> b
entropy' totalElems base =
negate . foldl' (\ent f2 -> ent + freqEntropy f2) 0
where
freqEntropy f = let p = (fromIntegral f) / l
in p * logBase b p
l = fromIntegral totalElems
b = fromIntegral base
My comments, in an arbitrary order:
- Why are you using
foldl1'
? It makes sense to usefoldl'
from a performance perspective, but it isn't clear to me why you require a nonempty list. Perhaps use aMaybe
to encapsulate this possibility of failure, or outline why you expect a nonempty list in a comment. It's a good idea to keep tabs on where your partial functions are to avoid surprises at runtime. My function just returns 0 for a nullFoldable
. - Your types can be generalized more than
Int
andDouble
, if you want this to be more flexible/reusable. What I did to find these types was track down what functions you were using and figure out what their types were (which were more general thanInt
orDouble
or[]
). Then I resolved the overall function to its most general type. Whether this is necessary or useful depends on your application. I think the most useful generalization here is toFoldable
in case you wanted to calculate entropy of things that were not lists. - When I changed to generalize to
Foldable
, I rolled all of themap
s into thefoldl'
. This may be more performant if the compiler doesn't combinemap
s, but it's also a tad bit more complicated to understand. - I moved the
itemFrequencies
to be the last argument so I could write the function pointfree. Pointfree is kind of cute, but if you think it's more readable you can change the order and/or put the explicititemFrequencies
back in. - I added an explicit call to
negate
(I didn't get what was going on at first with the-(foldl1' ...)
). - I shortened the names so that things are not overwhelmingly long or verbose. This is just my personal taste. I think if you're going to have descriptively long variable names, you shouldn't skimp on the description for your temporary variables. I think
b
is OK since it's a common variable for base, but I would recommend using something likelen
instead ofl
anditemFreqs'
instead ofis
.
Performance and Profiling
I can't really help you on this front. GHC does have a profiler, which might be useful if you want to do some serious profiling.
-
1\$\begingroup\$ I thought about and researched this for a while, and I'm not sure what it means to calculate the entropy of a set with no items, or none of the items you're asking about... On the one hand, you could say, the probability of choosing something that isn't in the set is 0, and so the entropy must be 0. On the other hand, I could also see the argument for infinity or undefined since division by 0 (which you would have to do in a set of 0 items to get the probability) is undefined. That's why I'm not permitting an empty list of itemFrequencies: entropy doesn't make sense for null sets. \$\endgroup\$Josiah– Josiah2019年06月03日 22:51:25 +00:00Commented Jun 3, 2019 at 22:51