org.apache.commons.math3.random
Class EmpiricalDistribution

java.lang.Object
  extended by org.apache.commons.math3.distribution.AbstractRealDistribution
      extended by org.apache.commons.math3.random.EmpiricalDistribution
All Implemented Interfaces:
Serializable, RealDistribution

public class EmpiricalDistribution
extends AbstractRealDistribution

Represents an empirical probability distribution -- a probability distribution derived from observed data without making any assumptions about the functional form of the population distribution that the data come from.

An EmpiricalDistribution maintains data structures, called distribution digests, that describe empirical distributions and support the following operations:

Applications can use EmpiricalDistribution to build grouped frequency histograms representing the input data or to generate random values "like" those in the input file -- i.e., the values generated will follow the distribution of the values in the file.

The implementation uses what amounts to the Variable Kernel Method with Gaussian smoothing:

Digesting the input file

  1. Pass the file once to compute min and max.
  2. Divide the range from min-max into binCount "bins."
  3. Pass the data file again, computing bin counts and univariate statistics (mean, std dev.) for each of the bins
  4. Divide the interval (0,1) into subintervals associated with the bins, with the length of a bin's subinterval proportional to its count.
Generating random values from the distribution
  1. Generate a uniformly distributed value in (0,1)
  2. Select the subinterval to which the value belongs.
  3. Generate a random Gaussian value with mean = mean of the associated bin and std dev = std dev of associated bin.

EmpiricalDistribution implements the RealDistribution interface as follows. Given x within the range of values in the dataset, let B be the bin containing x and let K be the within-bin kernel for B. Let P(B-) be the sum of the probabilities of the bins below B and let K(B) be the mass of B under K (i.e., the integral of the kernel density over B). Then set P(X < x) = P(B-) + P(B) * K(x) / K(B) where K(x) is the kernel distribution evaluated at x. This results in a cdf that matches the grouped frequency distribution at the bin endpoints and interpolates within bins using within-bin kernels.

USAGE NOTES:

Version:
$Id: EmpiricalDistribution.java 1457372 2013-03-17 04:28:04Z psteitz $
See Also:
Serialized Form

Nested Class Summary
private  class EmpiricalDistribution.ArrayDataAdapter
          DataAdapter for data provided as array of doubles.
private  class EmpiricalDistribution.DataAdapter
          Provides methods for computing sampleStats and beanStats abstracting the source of data.
private  class EmpiricalDistribution.StreamDataAdapter
          DataAdapter for data provided through some input stream
 
Field Summary
private  int binCount
          number of bins
private  List<SummaryStatistics> binStats
          List of SummaryStatistics objects characterizing the bins
static int DEFAULT_BIN_COUNT
          Default bin count
private  double delta
          Grid size
private static String FILE_CHARSET
          Character set for file input
private  boolean loaded
          is the distribution loaded?
private  double max
          Max loaded value
private  double min
          Min loaded value
protected  RandomDataGenerator randomData
          RandomDataGenerator instance to use in repeated calls to getNext()
private  SummaryStatistics sampleStats
          Sample statistics
private static long serialVersionUID
          Serializable version identifier
private  double[] upperBounds
          upper bounds of subintervals in (0,1) "belonging" to the bins
 
Fields inherited from class org.apache.commons.math3.distribution.AbstractRealDistribution
random, SOLVER_DEFAULT_ABSOLUTE_ACCURACY
 
Constructor Summary
  EmpiricalDistribution()
          Creates a new EmpiricalDistribution with the default bin count.
  EmpiricalDistribution(int binCount)
          Creates a new EmpiricalDistribution with the specified bin count.
private EmpiricalDistribution(int binCount, RandomDataGenerator randomData)
          Private constructor to allow lazy initialisation of the RNG contained in the randomData instance variable.
  EmpiricalDistribution(int binCount, RandomDataImpl randomData)
          Deprecated. As of 3.1. Please use EmpiricalDistribution(int,RandomGenerator) instead.
  EmpiricalDistribution(int binCount, RandomGenerator generator)
          Creates a new EmpiricalDistribution with the specified bin count using the provided RandomGenerator as the source of random data.
  EmpiricalDistribution(RandomDataImpl randomData)
          Deprecated. As of 3.1. Please use EmpiricalDistribution(RandomGenerator) instead.
  EmpiricalDistribution(RandomGenerator generator)
          Creates a new EmpiricalDistribution with default bin count using the provided RandomGenerator as the source of random data.
 
Method Summary
private  double cumBinP(int binIndex)
          The combined probability of the bins up to and including binIndex.
 double cumulativeProbability(double x)
          For a random variable X whose values are distributed according to this distribution, this method returns P(X <= x).
 double density(double x)
          Returns the probability density function (PDF) of this distribution evaluated at the specified point x.
private  void fillBinStats(EmpiricalDistribution.DataAdapter da)
          Fills binStats array (second pass through data file).
private  int findBin(double value)
          Returns the index of the bin to which the given value belongs
 int getBinCount()
          Returns the number of bins.
 List<SummaryStatistics> getBinStats()
          Returns a List of SummaryStatistics instances containing statistics describing the values in each of the bins.
 double[] getGeneratorUpperBounds()
          Returns a fresh copy of the array of upper bounds of the subintervals of [0,1] used in generating data from the empirical distribution.
protected  RealDistribution getKernel(SummaryStatistics bStats)
          The within-bin smoothing kernel.
 double getNextValue()
          Generates a random value from this distribution.
 double getNumericalMean()
          Use this method to get the numerical value of the mean of this distribution.
 double getNumericalVariance()
          Use this method to get the numerical value of the variance of this distribution.
 StatisticalSummary getSampleStats()
          Returns a StatisticalSummary describing this distribution.
 double getSupportLowerBound()
          Access the lower bound of the support.
 double getSupportUpperBound()
          Access the upper bound of the support.
 double[] getUpperBounds()
          Returns a fresh copy of the array of upper bounds for the bins.
 double inverseCumulativeProbability(double p)
          Computes the quantile function of this distribution.
 boolean isLoaded()
          Property indicating whether or not the distribution has been loaded.
 boolean isSupportConnected()
          Use this method to get information about whether the support is connected, i.e.
 boolean isSupportLowerBoundInclusive()
          Whether or not the lower bound of support is in the domain of the density function.
 boolean isSupportUpperBoundInclusive()
          Whether or not the upper bound of support is in the domain of the density function.
private  RealDistribution k(double x)
          The within-bin kernel of the bin that x belongs to.
private  double kB(int i)
          Mass of bin i under the within-bin kernel of the bin.
 void load(double[] in)
          Computes the empirical distribution from the provided array of numbers.
 void load(File file)
          Computes the empirical distribution from the input file.
 void load(URL url)
          Computes the empirical distribution using data read from a URL.
private  double pB(int i)
          The probability of bin i.
private  double pBminus(int i)
          The combined probability of the bins up to but not including bin i.
 double probability(double x)
          For a random variable X whose values are distributed according to this distribution, this method returns P(X = x).
 void reSeed(long seed)
          Reseeds the random number generator used by getNextValue().
 void reseedRandomGenerator(long seed)
          Reseed the random generator used to generate samples.
 double sample()
          Generate a random value sampled from this distribution.
 
Methods inherited from class org.apache.commons.math3.distribution.AbstractRealDistribution
cumulativeProbability, getSolverAbsoluteAccuracy, probability, sample
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_BIN_COUNT

public static final int DEFAULT_BIN_COUNT
Default bin count

See Also:
Constant Field Values

FILE_CHARSET

private static final String FILE_CHARSET
Character set for file input

See Also:
Constant Field Values

serialVersionUID

private static final long serialVersionUID
Serializable version identifier

See Also:
Constant Field Values

randomData

protected final RandomDataGenerator randomData
RandomDataGenerator instance to use in repeated calls to getNext()


binStats

private final List<SummaryStatistics> binStats
List of SummaryStatistics objects characterizing the bins


sampleStats

private SummaryStatistics sampleStats
Sample statistics


max

private double max
Max loaded value


min

private double min
Min loaded value


delta

private double delta
Grid size


binCount

private final int binCount
number of bins


loaded

private boolean loaded
is the distribution loaded?


upperBounds

private double[] upperBounds
upper bounds of subintervals in (0,1) "belonging" to the bins

Constructor Detail

EmpiricalDistribution

public EmpiricalDistribution()
Creates a new EmpiricalDistribution with the default bin count.


EmpiricalDistribution

public EmpiricalDistribution(int binCount)
Creates a new EmpiricalDistribution with the specified bin count.

Parameters:
binCount - number of bins

EmpiricalDistribution

public EmpiricalDistribution(int binCount,
                             RandomGenerator generator)
Creates a new EmpiricalDistribution with the specified bin count using the provided RandomGenerator as the source of random data.

Parameters:
binCount - number of bins
generator - random data generator (may be null, resulting in default JDK generator)
Since:
3.0

EmpiricalDistribution

public EmpiricalDistribution(RandomGenerator generator)
Creates a new EmpiricalDistribution with default bin count using the provided RandomGenerator as the source of random data.

Parameters:
generator - random data generator (may be null, resulting in default JDK generator)
Since:
3.0

EmpiricalDistribution

@Deprecated
public EmpiricalDistribution(int binCount,
                                        RandomDataImpl randomData)
Deprecated. As of 3.1. Please use EmpiricalDistribution(int,RandomGenerator) instead.

Creates a new EmpiricalDistribution with the specified bin count using the provided RandomDataImpl instance as the source of random data.

Parameters:
binCount - number of bins
randomData - random data generator (may be null, resulting in default JDK generator)
Since:
3.0

EmpiricalDistribution

@Deprecated
public EmpiricalDistribution(RandomDataImpl randomData)
Deprecated. As of 3.1. Please use EmpiricalDistribution(RandomGenerator) instead.

Creates a new EmpiricalDistribution with default bin count using the provided RandomDataImpl as the source of random data.

Parameters:
randomData - random data generator (may be null, resulting in default JDK generator)
Since:
3.0

EmpiricalDistribution

private EmpiricalDistribution(int binCount,
                              RandomDataGenerator randomData)
Private constructor to allow lazy initialisation of the RNG contained in the randomData instance variable.

Parameters:
binCount - number of bins
randomData - Random data generator.
Method Detail

load

public void load(double[] in)
          throws NullArgumentException
Computes the empirical distribution from the provided array of numbers.

Parameters:
in - the input data array
Throws:
NullArgumentException - if in is null

load

public void load(URL url)
          throws IOException,
                 NullArgumentException,
                 ZeroException
Computes the empirical distribution using data read from a URL.

The input file must be an ASCII text file containing one valid numeric entry per line.

Parameters:
url - url of the input file
Throws:
IOException - if an IO error occurs
NullArgumentException - if url is null
ZeroException - if URL contains no data

load

public void load(File file)
          throws IOException,
                 NullArgumentException
Computes the empirical distribution from the input file.

The input file must be an ASCII text file containing one valid numeric entry per line.

Parameters:
file - the input file
Throws:
IOException - if an IO error occurs
NullArgumentException - if file is null

fillBinStats

private void fillBinStats(EmpiricalDistribution.DataAdapter da)
                   throws IOException
Fills binStats array (second pass through data file).

Parameters:
da - object providing access to the data
Throws:
IOException - if an IO error occurs

findBin

private int findBin(double value)
Returns the index of the bin to which the given value belongs

Parameters:
value - the value whose bin we are trying to find
Returns:
the index of the bin containing the value

getNextValue

public double getNextValue()
                    throws MathIllegalStateException
Generates a random value from this distribution. Preconditions:

Returns:
the random value.
Throws:
MathIllegalStateException - if the distribution has not been loaded

getSampleStats

public StatisticalSummary getSampleStats()
Returns a StatisticalSummary describing this distribution. Preconditions:

Returns:
the sample statistics
Throws:
IllegalStateException - if the distribution has not been loaded

getBinCount

public int getBinCount()
Returns the number of bins.

Returns:
the number of bins.

getBinStats

public List<SummaryStatistics> getBinStats()
Returns a List of SummaryStatistics instances containing statistics describing the values in each of the bins. The list is indexed on the bin number.

Returns:
List of bin statistics.

getUpperBounds

public double[] getUpperBounds()

Returns a fresh copy of the array of upper bounds for the bins. Bins are:
[min,upperBounds[0]],(upperBounds[0],upperBounds[1]],..., (upperBounds[binCount-2], upperBounds[binCount-1] = max].

Note: In versions 1.0-2.0 of commons-math, this method incorrectly returned the array of probability generator upper bounds now returned by getGeneratorUpperBounds().

Returns:
array of bin upper bounds
Since:
2.1

getGeneratorUpperBounds

public double[] getGeneratorUpperBounds()

Returns a fresh copy of the array of upper bounds of the subintervals of [0,1] used in generating data from the empirical distribution. Subintervals correspond to bins with lengths proportional to bin counts.

In versions 1.0-2.0 of commons-math, this array was (incorrectly) returned by getUpperBounds().

Returns:
array of upper bounds of subintervals used in data generation
Since:
2.1

isLoaded

public boolean isLoaded()
Property indicating whether or not the distribution has been loaded.

Returns:
true if the distribution has been loaded

reSeed

public void reSeed(long seed)
Reseeds the random number generator used by getNextValue().

Parameters:
seed - random generator seed
Since:
3.0

probability

public double probability(double x)
For a random variable X whose values are distributed according to this distribution, this method returns P(X = x). In other words, this method represents the probability mass function (PMF) for the distribution.

Specified by:
probability in interface RealDistribution
Overrides:
probability in class AbstractRealDistribution
Parameters:
x - the point at which the PMF is evaluated
Returns:
zero.
Since:
3.1

density

public double density(double x)
Returns the probability density function (PDF) of this distribution evaluated at the specified point x. In general, the PDF is the derivative of the CDF. If the derivative does not exist at x, then an appropriate replacement should be returned, e.g. Double.POSITIVE_INFINITY, Double.NaN, or the limit inferior or limit superior of the difference quotient.

Returns the kernel density normalized so that its integral over each bin equals the bin mass.

Algorithm description:

  1. Find the bin B that x belongs to.
  2. Compute K(B) = the mass of B with respect to the within-bin kernel (i.e., the integral of the kernel density over B).
  3. Return k(x) * P(B) / K(B), where k is the within-bin kernel density and P(B) is the mass of B.

Parameters:
x - the point at which the PDF is evaluated
Returns:
the value of the probability density function at point x
Since:
3.1

cumulativeProbability

public double cumulativeProbability(double x)
For a random variable X whose values are distributed according to this distribution, this method returns P(X <= x). In other words, this method represents the (cumulative) distribution function (CDF) for this distribution.

Algorithm description:

  1. Find the bin B that x belongs to.
  2. Compute P(B) = the mass of B and P(B-) = the combined mass of the bins below B.
  3. Compute K(B) = the probability mass of B with respect to the within-bin kernel and K(B-) = the kernel distribution evaluated at the lower endpoint of B
  4. Return P(B-) + P(B) * [K(x) - K(B-)] / K(B) where K(x) is the within-bin kernel distribution function evaluated at x.

Parameters:
x - the point at which the CDF is evaluated
Returns:
the probability that a random variable with this distribution takes a value less than or equal to x
Since:
3.1

inverseCumulativeProbability

public double inverseCumulativeProbability(double p)
                                    throws OutOfRangeException
Computes the quantile function of this distribution. For a random variable X distributed according to this distribution, the returned value is The default implementation returns

Algorithm description:

  1. Find the smallest i such that the sum of the masses of the bins through i is at least p.
  2. Let K be the within-bin kernel distribution for bin i.
    Let K(B) be the mass of B under K.
    Let K(B-) be K evaluated at the lower endpoint of B (the combined mass of the bins below B under K).
    Let P(B) be the probability of bin i.
    Let P(B-) be the sum of the bin masses below bin i.
    Let pCrit = p - P(B-)
  3. Return the inverse of K evaluated at
    K(B-) + pCrit * K(B) / P(B)

Specified by:
inverseCumulativeProbability in interface RealDistribution
Overrides:
inverseCumulativeProbability in class AbstractRealDistribution
Parameters:
p - the cumulative probability
Returns:
the smallest p-quantile of this distribution (largest 0-quantile for p = 0)
Throws:
OutOfRangeException - if p < 0 or p > 1
Since:
3.1

getNumericalMean

public double getNumericalMean()
Use this method to get the numerical value of the mean of this distribution.

Returns:
the mean or Double.NaN if it is not defined
Since:
3.1

getNumericalVariance

public double getNumericalVariance()
Use this method to get the numerical value of the variance of this distribution.

Returns:
the variance (possibly Double.POSITIVE_INFINITY as for certain cases in TDistribution) or Double.NaN if it is not defined
Since:
3.1

getSupportLowerBound

public double getSupportLowerBound()
Access the lower bound of the support. This method must return the same value as inverseCumulativeProbability(0). In other words, this method must return

inf {x in R | P(X <= x) > 0}.

Returns:
lower bound of the support (might be Double.NEGATIVE_INFINITY)
Since:
3.1

getSupportUpperBound

public double getSupportUpperBound()
Access the upper bound of the support. This method must return the same value as inverseCumulativeProbability(1). In other words, this method must return

inf {x in R | P(X <= x) = 1}.

Returns:
upper bound of the support (might be Double.POSITIVE_INFINITY)
Since:
3.1

isSupportLowerBoundInclusive

public boolean isSupportLowerBoundInclusive()
Whether or not the lower bound of support is in the domain of the density function. Returns true iff getSupporLowerBound() is finite and density(getSupportLowerBound()) returns a non-NaN, non-infinite value.

Returns:
true if the lower bound of support is finite and the density function returns a non-NaN, non-infinite value there
Since:
3.1

isSupportUpperBoundInclusive

public boolean isSupportUpperBoundInclusive()
Whether or not the upper bound of support is in the domain of the density function. Returns true iff getSupportUpperBound() is finite and density(getSupportUpperBound()) returns a non-NaN, non-infinite value.

Returns:
true if the upper bound of support is finite and the density function returns a non-NaN, non-infinite value there
Since:
3.1

isSupportConnected

public boolean isSupportConnected()
Use this method to get information about whether the support is connected, i.e. whether all values between the lower and upper bound of the support are included in the support.

Returns:
whether the support is connected or not
Since:
3.1

sample

public double sample()
Generate a random value sampled from this distribution. The default implementation uses the inversion method.

Specified by:
sample in interface RealDistribution
Overrides:
sample in class AbstractRealDistribution
Returns:
a random value.
Since:
3.1

reseedRandomGenerator

public void reseedRandomGenerator(long seed)
Reseed the random generator used to generate samples.

Specified by:
reseedRandomGenerator in interface RealDistribution
Overrides:
reseedRandomGenerator in class AbstractRealDistribution
Parameters:
seed - the new seed
Since:
3.1

pB

private double pB(int i)
The probability of bin i.

Parameters:
i - the index of the bin
Returns:
the probability that selection begins in bin i

pBminus

private double pBminus(int i)
The combined probability of the bins up to but not including bin i.

Parameters:
i - the index of the bin
Returns:
the probability that selection begins in a bin below bin i.

kB

private double kB(int i)
Mass of bin i under the within-bin kernel of the bin.

Parameters:
i - index of the bin
Returns:
the difference in the within-bin kernel cdf between the upper and lower endpoints of bin i

k

private RealDistribution k(double x)
The within-bin kernel of the bin that x belongs to.

Parameters:
x - the value to locate within a bin
Returns:
the within-bin kernel of the bin containing x

cumBinP

private double cumBinP(int binIndex)
The combined probability of the bins up to and including binIndex.

Parameters:
binIndex - maximum bin index
Returns:
sum of the probabilities of bins through binIndex

getKernel

protected RealDistribution getKernel(SummaryStatistics bStats)
The within-bin smoothing kernel.

Parameters:
bStats - summary statistics for the bin
Returns:
within-bin kernel parameterized by bStats


Copyright (c) 2003-2013 Apache Software Foundation