The Discrete Guide Table method samples from arbitrary, but finite,
probability vectors. It uses the probability vector of size \(N\) or a
probability mass function with a finite support to generate random
numbers from the distribution. Discrete Guide Table has a very slow set up
(linear with the vector length) but provides very fast sampling.
Parameters:
distarray_like or object, optional
Probability vector (PV) of the distribution. If PV isn’t available,
an instance of a class with a pmf method is expected. The signature
of the PMF is expected to be: defpmf(self,k:int)->float. i.e. it
should accept a Python integer and return a Python float.
domainint, optional
Support of the PMF. If a probability vector (pv) is not available, a
finite domain must be given. i.e. the PMF must have a finite support.
Default is None. When None:
If a support method is provided by the distribution object
dist, it is used to set the domain of the distribution.
Otherwise, the support is assumed to be (0,len(pv)). When this
parameter is passed in combination with a probability vector, domain[0]
is used to relocate the distribution from (0,len(pv)) to
(domain[0],domain[0]+len(pv)) and domain[1] is ignored. See Notes
and tutorial for a more detailed explanation.
guide_factor: int, optional
Size of the guide table relative to length of PV. Larger guide tables
result in faster generation time but require a more expensive setup.
Sizes larger than 3 are not recommended. If the relative size is set to
0, sequential search is used. Default is 1.
A NumPy random number generator or seed for the underlying NumPy random
number generator used to generate the stream of uniform random numbers.
If random_state is None (or np.random), the numpy.random.RandomState
singleton is used.
If random_state is an int, a new RandomState instance is used,
seeded with random_state.
If random_state is already a Generator or RandomState instance then
that instance is used.
Set the underlying uniform random number generator.
Notes
This method works when either a finite probability vector is available or
the PMF of the distribution is available. In case a PMF is only available,
the finite support (domain) of the PMF must also be given. It is
recommended to first obtain the probability vector by evaluating the PMF
at each point in the support and then using it instead.
DGT samples from arbitrary but finite probability vectors. Random numbers
are generated by the inversion method, i.e.
Generate a random number U ~ U(0,1).
Find smallest integer I such that F(I) = P(X<=I) >= U.
Step (2) is the crucial step. Using sequential search requires O(E(X))
comparisons, where E(X) is the expectation of the distribution. Indexed
search, however, uses a guide table to jump to some I’ <= I near I to find
X in constant time. Indeed the expected number of comparisons is reduced to
2, when the guide table has the same size as the probability vector
(this is the default). For larger guide tables this number becomes smaller
(but is always larger than 1), for smaller tables it becomes larger. For the
limit case of table size 1 the algorithm simply does sequential search.
On the other hand the setup time for guide table is O(N), where N denotes
the length of the probability vector (for size 1 no preprocessing is
required). Moreover, for very large guide tables memory effects might even
reduce the speed of the algorithm. So we do not recommend to use guide
tables that are more than three times larger than the given probability
vector. If only a few random numbers have to be generated, (much) smaller
table sizes are better. The size of the guide table relative to the length
of the given probability vector can be set by the guide_factor parameter.
If a probability vector is given, it must be a 1-dimensional array of
non-negative floats without any inf or nan values. Also, there
must be at least one non-zero entry otherwise an exception is raised.
By default, the probability vector is indexed starting at 0. However, this
can be changed by passing a domain parameter. When domain is given
in combination with the PV, it has the effect of relocating the
distribution from (0,len(pv)) to (domain[0],domain[0]+len(pv)).
domain[1] is ignored in this case.
As the p-value is very high, we fail to reject the null hypothesis that
the observed frequencies are the same as the expected frequencies. Hence,
we can safely assume that the variates have been generated from the given
distribution. Note that this just gives the correctness of the algorithm
and not the quality of the samples.
If a PV is not available, an instance of a class with a PMF method and a
finite domain can also be passed.
Now, we can sample from the distribution using the rvs method
and also measure the goodness-of-fit of the samples:
>>> rvs=rng.rvs(1000)>>> _,freqs=np.unique(rvs,return_counts=True)>>> freqs=freqs/np.sum(freqs)>>> obs_freqs=np.zeros(11)# some frequencies may be zero.>>> obs_freqs[:freqs.size]=freqs>>> pv=[dist.pmf(i)foriinrange(0,11)]>>> pv=np.asarray(pv)/np.sum(pv)>>> chisquare(obs_freqs,pv).pvalue0.9999999999999989
To check that the samples have been drawn from the correct distribution,
we can visualize the histogram of the samples: