I was recently tasked a programming assignment with a prospective company, unfortunately I didn't pass the assignment and didn't receive any feedback on what I could have done better. I would appreciate any and all feedback on how I could make the program below better.
Task:
Write a class or module that has a function or method that returns a random number chosen from a set of numbers where each has a specified probability. For example, the input could be an associative array consisting of:
1 => 0.25
2 => 0.5
7 => 0.25
When the function or method is called repeatedly, the above input might yield 2, 7, 1, 1, 2, 2, 2, 7, 2, etc.
Additional requirements:
The random number generator will be called billions of times, so performance is important.
The probabilities will be updated very infrequently, if ever. But neither the probabilities nor their distribution is known before processing.
The size of the set of numbers will typically be at least a thousand, possibly well into the millions, each with its own probability. No other number should be returned except for those specified in the input.
The class or module should support the ability to replay the same sequence of random numbers regardless of platform it is compiled on and regardless of whether the program is directly translated to another programming language. The class does not need to read or write to any device, but it is desirable to minimize the amount of data stored to reproduce the replay.
The code must be written such that a competent developer could translate your code directly into another common programming language without having to retrieve any code not written by you.
You can use whatever programming language you like as long as it is suitable for practical use (e.g., Befunge and INTERCAL are not acceptable but Python, C++, or Haskell is), has a visible community online, and is one we can freely and easily download and setup to interpret or run the code (e.g., MATLAB is not acceptable, but Octave is).
Code:
package main
import (
"errors"
"math/rand"
"sort"
)
/* WeightedProbability is the class to handle all of the weights etc
* There will be a values and weights slice, think ArrayList in Java
* Total will be the total value of weights
* ReplayValues is a slice containing the replay sequence
*/
type WeightedProbability struct {
ValueWeights []vw //assume float 64 because with such a large input size you would need decimal percentages, ie .25%
Total float64 // total value of the weights
ReplayValues []float64
Initialized bool
}
/* vw is a struct used to map the Key Value stores of the map to a slice
*/
type vw struct {
Value float64
Weight float64
}
/* sortByWeight sorts a map by weight and return a slice of struct vw and the total
* Can then perform binary search on a slice of structs, can't perform binary search on map
* @m float[64]float64
*/
func sortByWeight(m map[float64]float64) ([]vw, float64) {
var mkv []vw
total := 0.0
for value, weight := range m {
total += weight
mkv = append(mkv, vw{value, total})
}
sort.Slice(mkv, func(i, j int) bool {
return mkv[i].Weight < mkv[j].Weight
})
return mkv, total
}
/* Init
* @numPool float[64]float64
* numPool of tbe form [value] => [weight]
* Ex: 7 => .25 so 7 will appear with 25% frequency
* Total Probabilty can be over 100%
* Algorithm takes O(N) to create the weights and values
* Since using a Map there should be no duplicates except ones of form 7 vs 7.00
*/
func (wp *WeightedProbability) Init(numPool map[float64]float64) error {
if numPool == nil {
return errors.New("Number Pool is not initialized!")
}
valueWeights, total := sortByWeight(numPool)
if total > 100.00 {
return errors.New("Total is greater than 100!")
}
replayValues := []float64{}
wp.ValueWeights = valueWeights
wp.Total = total
wp.ReplayValues = replayValues
wp.Initialized = true
return nil
}
/* GenerateRandomNumber
* Returns an error if the struct is not initialized or if struct is unable to generate a random number
* Sort.Search uses binary search to find the index of the first weight >= x
*/
func (wp *WeightedProbability) GenerateRandomNumber() (float64, error) {
if !wp.Initialized {
return 0, errors.New("Not initialized")
}
x := rand.Float64() * wp.Total
// search the distribution, essentially the same as pythons bisect_left
i := sort.Search(len(wp.ValueWeights), func(i int) bool {
return wp.ValueWeights[i].Weight >= x
})
if i >= len(wp.ValueWeights) {
return 0, errors.New("Index to big")
}
wp.ReplayValues = append(wp.ReplayValues, wp.ValueWeights[i].Value)
return wp.ValueWeights[i].Value, nil
}
/* Replay
* Since we were told we shouldn't write to a file, just store and return the slice
*/
func (wp *WeightedProbability) Replay() ([]float64, error) {
if !wp.Initialized {
return nil, errors.New("Not initialized")
}
return wp.ReplayValues, nil
}
-
1\$\begingroup\$ By the way, I don't know what job it was you were applying for, but unless it's specifically in the domain of data science / statistical analysis, be glad you aren't working for a company that's fishing for obscure algorithms. \$\endgroup\$Alex Reinking– Alex Reinking2018年05月15日 16:45:42 +00:00Commented May 15, 2018 at 16:45
-
1\$\begingroup\$ Hello @AlexReinking, thank you for the feedback and the excellent answer! I was applying as a regular React/Express software engineer, so I'm pretty glad things didn't work out at this point of time. \$\endgroup\$Turtle– Turtle2018年05月15日 19:33:41 +00:00Commented May 15, 2018 at 19:33
-
\$\begingroup\$ That is absolutely embarrassing for the company. Front end work has nothing in common with this. \$\endgroup\$Alex Reinking– Alex Reinking2018年05月15日 19:44:39 +00:00Commented May 15, 2018 at 19:44
3 Answers 3
You are looking for the Alias method as outlined in this StackOverflow question. The basic idea is to precompute some tables that turn a random number in [0, 1) into a value in your distribution. It works by creating a number of binary distributions such that when you select one of them uniformly at random, and then select between its two symbols, you recover the original probabilities.
There is a detailed explanation and a C++ implementation available here.
-
\$\begingroup\$ That is a great algorithm I'd never heard of before. Thanks. \$\endgroup\$Josiah– Josiah2018年05月15日 13:02:59 +00:00Commented May 15, 2018 at 13:02
-
\$\begingroup\$ @Josiah - You're welcome! It's a good one with mostly hand-rolled implementations. The constant is kind of high, but the asymptotic complexity is better. If you need to sample tons of numbers from the distribution, or there are a lot of different options, then this is best. Both are the case in this question. \$\endgroup\$Alex Reinking– Alex Reinking2018年05月15日 16:40:57 +00:00Commented May 15, 2018 at 16:40
-
\$\begingroup\$ C++ implementation link is broken \$\endgroup\$Tony– Tony2021年08月11日 18:31:41 +00:00Commented Aug 11, 2021 at 18:31
-
\$\begingroup\$ @Tony - still working for me. \$\endgroup\$Alex Reinking– Alex Reinking2021年08月11日 21:41:50 +00:00Commented Aug 11, 2021 at 21:41
-
\$\begingroup\$ I guess I got an intermittent error. Please ignore. Meanwhile I found a C implementation : github.com/Tecnarca/Vose-Alias-Method/blob/master/extractor.c and a Java implementation : keithschwarz.com/interesting/code/?dir=alias-method I'm thinking of writing a go implementation \$\endgroup\$Tony– Tony2021年08月11日 22:52:23 +00:00Commented Aug 11, 2021 at 22:52
I see two main issues with this piece of code:
- the code is hard to read, and not idiomatic
- there is no tests/benchmarks
1. Readability
It's always a good idea to run golint
and govet
on your code to detect common style mistakes, here mainly comment/error message formatting.
The name WeightedProbability
might not be clear enough. Maybe Generator
is enough?
The Init()
function is not a good way to initialize a WeightedProbability
.
Instead of
wp := &WeightedProbability{}
err := wp.Init(map)
if err != nil {
...
}
we could add a NewGenerator()
method doing the same thing, which is the common way of doing this in go:
g, err := NewGenerator(map)
if err != nil {
...
}
The Replay()
function should not exists either. Intead, as @Josiah already said, intantiate the generator with a seed:
func NewGenerator(seed int64, numPool map[float64]float64) (*Generator, error) {
...
return &Generator{
randSource: rand.NewSource(seed),
...
}, nil
}
Don't use a map[float64]float64
to get the number set, because iteration order is not garanteed from one iteration to the next (see go maps in action for details)
This can be a problem if some value have the same weight: the order of ValueWeights
randomly change from a run to another. Intead, use two slice of []float64
A new version of the could could look like this:
package generator
import (
"fmt"
"math/rand"
"sort"
)
type numberSet struct {
values []float64
bounds []float64
}
func newNumberSet(values []float64, weight []float64) (*numberSet, error) {
if len(values) != len(weight) {
return nil, fmt.Errorf("values and weight should have the same length")
}
s := &numberSet{
values: values,
bounds: weight,
}
sort.Sort(s)
sum := float64(0)
for i, weight := range s.bounds {
sum += weight
s.bounds[i] = sum
}
if sum-1 > 1e9 {
return nil, fmt.Errorf("sum of weight should be 1, but was %f", sum)
}
return s, nil
}
func (s *numberSet) Len() int { return len(s.values) }
func (s *numberSet) Swap(i, j int) {
s.values[i], s.values[j] = s.values[j], s.values[i]
s.bounds[i], s.bounds[j] = s.bounds[j], s.bounds[i]
}
func (s *numberSet) Less(i, j int) bool { return s.bounds[i] < s.bounds[j] }
// Generator is a struct that can returns a random number chosen from a set
// of numbers where each has a specified probability.
type Generator struct {
randSource rand.Source
size int
numberSet
}
// NewGenerator return a Generator. It returns an error if len(weight) != len(values),
// or if the sum of weights is != 1.
// Two Generators with same seed, values and weight will always produce the same sequence
// of random number
func NewGenerator(seed int64, values []float64, weight []float64) (*Generator, error) {
s, err := newNumberSet(values, weight)
if err != nil {
return nil, err
}
return &Generator{
randSource: rand.NewSource(seed),
size: len(values),
numberSet: *s,
}, nil
}
// Random returns a random number from the generator number set.
func (g *Generator) Random() float64 {
r := float64(g.randSource.Int63()) / (1 << 63)
i := sort.Search(g.size, func(i int) bool {
return g.bounds[i] >= r
})
return g.values[i]
}
2. Tests / benchmark
First thing to do is to make sure that the code works as expected. Test should make sure that value weight are correctly taken into account, and that two generators with same seed and same value always generate the same sequence of values.
After that, we can add a benchmark to see how our solution performs, and more importantly how it scales:
package generator
import (
"fmt"
"testing"
"time"
)
const (
nbGeneration = 100000
setSize = 10
)
func TestGeneratorDistribution(t *testing.T) {
g := newGenerator(t, time.Now().Unix(), setSize)
numbers := map[float64]float64{}
for i := 0; i < nbGeneration; i++ {
r := g.Random()
if _, ok := numbers[r]; !ok {
numbers[r] = 0
}
numbers[r]++
}
startBound := float64(0)
for index, value := range g.values {
got := numbers[value]
want := (g.bounds[index] - startBound) * nbGeneration
delta := want * 5 / 100
if got > want+delta || got < want-delta {
t.Errorf("distribution not correct for value %f, extected %f +/- 5%%, but got %f", value, want, got)
}
startBound = g.bounds[index]
}
}
func TestGeneratorSeeding(t *testing.T) {
seed := int64(1)
firstRun := generate(t, seed, nbGeneration)
secondRun := generate(t, seed, nbGeneration)
for i := 0; i < nbGeneration; i++ {
if firstRun[i] != secondRun[i] {
t.Errorf("expected same sequence of number, but at pos %d, got %f and %f", i, firstRun[i], secondRun[i])
}
}
}
func generate(t *testing.T, seed int64, size int) []float64 {
g := newGenerator(t, seed, setSize)
runVal := make([]float64, 0, size)
for i := 0; i < size; i++ {
runVal = append(runVal, g.Random())
}
return runVal
}
func newGenerator(t *testing.T, seed int64, size int) *Generator {
values := make([]float64, 0, size)
weight := make([]float64, 0, size)
p := float64(1) / float64(size)
for i := 0; i < size; i++ {
values = append(values, float64(i))
weight = append(weight, p)
}
g, err := NewGenerator(seed, values, weight)
if err != nil {
t.Error(err)
}
return g
}
func BenchmarkScaling(b *testing.B) {
for size := 2; size <= 1024; size = size * 2 {
name := fmt.Sprintf("numberSet_size_%d", size)
b.Run(name, func(b *testing.B) {
g := newGenerator(nil, 1, size)
b.ResetTimer()
for n := 0; n < b.N; n++ {
g.Random()
}
})
}
}
from now, we can check that we don't break anything when modifying the code. We can also accurately measure the performance gain...
3. Results
We can see a clear performance improvement with the new code:
old code:
BenchmarkScaling/numberSet_size_2-2 20000000 112 ns/op 41 B/op 0 allocs/op
BenchmarkScaling/numberSet_size_4-2 20000000 109 ns/op 41 B/op 0 allocs/op
BenchmarkScaling/numberSet_size_8-2 10000000 119 ns/op 42 B/op 0 allocs/op
BenchmarkScaling/numberSet_size_16-2 10000000 130 ns/op 42 B/op 0 allocs/op
BenchmarkScaling/numberSet_size_32-2 10000000 140 ns/op 42 B/op 0 allocs/op
BenchmarkScaling/numberSet_size_64-2 10000000 157 ns/op 42 B/op 0 allocs/op
BenchmarkScaling/numberSet_size_128-2 10000000 174 ns/op 42 B/op 0 allocs/op
BenchmarkScaling/numberSet_size_256-2 10000000 185 ns/op 42 B/op 0 allocs/op
BenchmarkScaling/numberSet_size_512-2 10000000 194 ns/op 42 B/op 0 allocs/op
BenchmarkScaling/numberSet_size_1024-2 10000000 207 ns/op 42 B/op 0 allocs/op
new code:
BenchmarkScaling/numberSet_size_2-2 50000000 33.8 ns/op 0 B/op 0 allocs/op
BenchmarkScaling/numberSet_size_4-2 30000000 41.4 ns/op 0 B/op 0 allocs/op
BenchmarkScaling/numberSet_size_8-2 30000000 51.7 ns/op 0 B/op 0 allocs/op
BenchmarkScaling/numberSet_size_16-2 20000000 64.4 ns/op 0 B/op 0 allocs/op
BenchmarkScaling/numberSet_size_32-2 20000000 76.7 ns/op 0 B/op 0 allocs/op
BenchmarkScaling/numberSet_size_64-2 20000000 91.3 ns/op 0 B/op 0 allocs/op
BenchmarkScaling/numberSet_size_128-2 20000000 106 ns/op 0 B/op 0 allocs/op
BenchmarkScaling/numberSet_size_256-2 10000000 116 ns/op 0 B/op 0 allocs/op
BenchmarkScaling/numberSet_size_512-2 10000000 126 ns/op 0 B/op 0 allocs/op
BenchmarkScaling/numberSet_size_1024-2 10000000 137 ns/op 0 B/op 0 allocs/op
There's still lots of room for improvement, but it should be a good start !
-
\$\begingroup\$ Thank you so much for writing this all up! Quick question, why user
fmt.Errorf
rather thanerrors.New()
? \$\endgroup\$Turtle– Turtle2018年05月15日 15:08:39 +00:00Commented May 15, 2018 at 15:08 -
\$\begingroup\$ +1 - I don't speak Go, but it's good to have a perspective on implementation details in the language as an answer. \$\endgroup\$Alex Reinking– Alex Reinking2018年05月15日 16:16:55 +00:00Commented May 15, 2018 at 16:16
-
\$\begingroup\$ @Turtle It's really handy for error message with dynamic value, and I personnaly find it easier to read, but that's just my opinion ! \$\endgroup\$felix– felix2018年05月16日 09:41:42 +00:00Commented May 16, 2018 at 9:41
-
\$\begingroup\$ @felix you learn something new every day! Thank you so much for sharing! \$\endgroup\$Turtle– Turtle2018年05月16日 12:14:12 +00:00Commented May 16, 2018 at 12:14
The general approach seems sensible here, of generating the CDF and selecting a uniform number to do the look up.
I also like the prevalence of comments,
The approach to allowing replays is probably what catches you out most. Storing the list is space and time inefficient. Instead, the conventional approach with a prng is to remember a fixed seed, and reseed with it if needed.
Aside from that, tiny acronym variable names are generally discouraged, with preference for descriptive names.
if total > 100.00 {
That is fine, but I would suggest seeking consistency. You are not enforcing the distributions sum as high as 100, so not going over may be an unnecessary restriction. Note also that very slight floating point errors could trigger that fail case.
For a very slight microoptimisation, considering pre-dividing the thresholds by total rather than multiplying by total each random number generated.
-
\$\begingroup\$ I love your idea to avoid storing the replays! Thank you so much for the feedback! \$\endgroup\$Turtle– Turtle2018年05月14日 21:44:27 +00:00Commented May 14, 2018 at 21:44
-
1\$\begingroup\$ Storing the PRNG seed is important, yes, but generating a CDF and doing binary search is extremely slow when billions of queries will be made to the generator. An
O(1)
approach should be used instead. \$\endgroup\$Alex Reinking– Alex Reinking2018年05月14日 21:58:17 +00:00Commented May 14, 2018 at 21:58