Entropy calculation function in Haskell

Question 1

I've written a function in Haskell for calculating the entropy of a collection. I'd like feedback on how the function could be rewritten to be more flexible/reusable, and also how to profile the function and how it could be tuned and/or modified for better performance.

import Data.List (foldl1')
entropy :: [Int] -> Int -> Int -> Double
entropy itemFrequencies totalElements logarithmicBase =
 -(foldl1' (+) $ map (\p -> p * (logBase b p)) probabilities)
 where
 is = map fromIntegral itemFrequencies
 l = fromIntegral totalElements
 b = fromIntegral logarithmicBase
 probabilities = map (\i -> i / l) $ is

For some background, the Entropy calculation is a core function used in building decision trees. For more complex datasets and decision trees, this function would get called very often. I'm working on a sequential implementation of the ID3 algorithm, which this entropy function is a part of, that I will later make parallel/concurrent as a separate exercise, and then I will eventually also write implementations of ID3's descendants: C4.5 and C5.0.

Please respect my desire to struggle on my own with the concurrency/parallelism aspect of rewriting this code, I'm only interested in any sequential improvements I could make for performance and any refactoring that could be done to make this code easier to maintain and reuse.

Question 2

Since it's been a week, I'm going to go ahead and select cole's answer, even though I would have liked for someone to have provided an answer that made an attempt at looking at the potential performance improvements.

Question 3

General Code Review

First, your function and my proposed changes (entropy') side-by-side.

import Data.List (foldl1', foldl')
entropy :: [Int] -> Int -> Int -> Double
entropy itemFrequencies totalElements logarithmicBase =
 -(foldl1' (+) $ map (\p -> p * (logBase b p)) probabilities)
 where
 is = map fromIntegral itemFrequencies
 l = fromIntegral totalElements
 b = fromIntegral logarithmicBase
 probabilities = map (\i -> i / l) $ is
entropy' :: (Foldable f, Integral a, Floating b) => a -> a -> f a -> b
entropy' totalElems base =
 negate . foldl' (\ent f2 -> ent + freqEntropy f2) 0
 where
 freqEntropy f = let p = (fromIntegral f) / l
 in p * logBase b p
 l = fromIntegral totalElems
 b = fromIntegral base

My comments, in an arbitrary order:

Why are you using foldl1'? It makes sense to use foldl' from a performance perspective, but it isn't clear to me why you require a nonempty list. Perhaps use a Maybe to encapsulate this possibility of failure, or outline why you expect a nonempty list in a comment. It's a good idea to keep tabs on where your partial functions are to avoid surprises at runtime. My function just returns 0 for a null Foldable.
Your types can be generalized more than Int and Double, if you want this to be more flexible/reusable. What I did to find these types was track down what functions you were using and figure out what their types were (which were more general than Int or Double or []). Then I resolved the overall function to its most general type. Whether this is necessary or useful depends on your application. I think the most useful generalization here is to Foldable in case you wanted to calculate entropy of things that were not lists.
When I changed to generalize to Foldable, I rolled all of the maps into the foldl'. This may be more performant if the compiler doesn't combine maps, but it's also a tad bit more complicated to understand.
I moved the itemFrequencies to be the last argument so I could write the function pointfree. Pointfree is kind of cute, but if you think it's more readable you can change the order and/or put the explicit itemFrequencies back in.
I added an explicit call to negate (I didn't get what was going on at first with the -(foldl1' ...)).
I shortened the names so that things are not overwhelmingly long or verbose. This is just my personal taste. I think if you're going to have descriptively long variable names, you shouldn't skimp on the description for your temporary variables. I think b is OK since it's a common variable for base, but I would recommend using something like len instead of l and itemFreqs' instead of is.

Performance and Profiling

I can't really help you on this front. GHC does have a profiler, which might be useful if you want to do some serious profiling.

Question 4

I thought about and researched this for a while, and I'm not sure what it means to calculate the entropy of a set with no items, or none of the items you're asking about... On the one hand, you could say, the probability of choosing something that isn't in the set is 0, and so the entropy must be 0. On the other hand, I could also see the argument for infinity or undefined since division by 0 (which you would have to do in a set of 0 items to get the probability) is undefined. That's why I'm not permitting an empty list of itemFrequencies: entropy doesn't make sense for null sets.

cole cole 6263 silver badges7 bronze badges · Accepted Answer · 2019-06-01 22:26:52Z

General Code Review

First, your function and my proposed changes (entropy') side-by-side.

import Data.List (foldl1', foldl')
entropy :: [Int] -> Int -> Int -> Double
entropy itemFrequencies totalElements logarithmicBase =
 -(foldl1' (+) $ map (\p -> p * (logBase b p)) probabilities)
 where
 is = map fromIntegral itemFrequencies
 l = fromIntegral totalElements
 b = fromIntegral logarithmicBase
 probabilities = map (\i -> i / l) $ is
entropy' :: (Foldable f, Integral a, Floating b) => a -> a -> f a -> b
entropy' totalElems base =
 negate . foldl' (\ent f2 -> ent + freqEntropy f2) 0
 where
 freqEntropy f = let p = (fromIntegral f) / l
 in p * logBase b p
 l = fromIntegral totalElems
 b = fromIntegral base

My comments, in an arbitrary order:

Why are you using foldl1'? It makes sense to use foldl' from a performance perspective, but it isn't clear to me why you require a nonempty list. Perhaps use a Maybe to encapsulate this possibility of failure, or outline why you expect a nonempty list in a comment. It's a good idea to keep tabs on where your partial functions are to avoid surprises at runtime. My function just returns 0 for a null Foldable.
Your types can be generalized more than Int and Double, if you want this to be more flexible/reusable. What I did to find these types was track down what functions you were using and figure out what their types were (which were more general than Int or Double or []). Then I resolved the overall function to its most general type. Whether this is necessary or useful depends on your application. I think the most useful generalization here is to Foldable in case you wanted to calculate entropy of things that were not lists.
When I changed to generalize to Foldable, I rolled all of the maps into the foldl'. This may be more performant if the compiler doesn't combine maps, but it's also a tad bit more complicated to understand.
I moved the itemFrequencies to be the last argument so I could write the function pointfree. Pointfree is kind of cute, but if you think it's more readable you can change the order and/or put the explicit itemFrequencies back in.
I added an explicit call to negate (I didn't get what was going on at first with the -(foldl1' ...)).
I shortened the names so that things are not overwhelmingly long or verbose. This is just my personal taste. I think if you're going to have descriptively long variable names, you shouldn't skimp on the description for your temporary variables. I think b is OK since it's a common variable for base, but I would recommend using something like len instead of l and itemFreqs' instead of is.

Performance and Profiling

I can't really help you on this front. GHC does have a profiler, which might be useful if you want to do some serious profiling.

I thought about and researched this for a while, and I'm not sure what it means to calculate the entropy of a set with no items, or none of the items you're asking about... On the one hand, you could say, the probability of choosing something that isn't in the set is 0, and so the entropy must be 0. On the other hand, I could also see the argument for infinity or undefined since division by 0 (which you would have to do in a set of 0 items to get the probability) is undefined. That's why I'm not permitting an empty list of itemFrequencies: entropy doesn't make sense for null sets.

Stack Exchange Network

Entropy calculation function in Haskell

1 Answer 1

General Code Review

Performance and Profiling

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Entropy calculation function in Haskell

1 Answer 1

General Code Review

Performance and Profiling

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions