distribution_fit

hana_ml.algorithms.pal.stats.distribution_fit(data, distr_type, optimal_method=None, censored=False)

This algorithm aims to fit a probability distribution for a variable according to a series of measurements to the variable. There are many probability distributions of which some can be fitted more closely to the observed variable than others.

Parameters
dataDataFrame

DataFrame containing the data.

distr_type{'exponential', 'gamma', 'normal', 'poisson', 'uniform', 'weibull'}

Specify the type of distribution to fit.

optimal_method{'maximum_likelihood', 'median_rank'}, optional

Specifies the estimation method.

Defaults to 'median_rank' when distr_type is 'weibull', 'maximum_likelihood' otherwise.

censoredbool, optional

Specify if data is censored of not.

Only valid when distr_type is 'weibull'.

Default to False.

Returns
DataFrame

Fitting results, structured as follows:

  • NAME: name of distribution parameters.

  • VALUE: value of distribution parameters.

Fitting statistics, structured as follows:

  • STAT_NAME: name of statistics.

  • STAT_VALUE: value of statistics.

Examples

Original data:

>>> df.collect()
     DATA
0    71.0
1    83.0
2    92.0
3   104.0
4   120.0
5   134.0
6   138.0
7   146.0
8   181.0
9   191.0
10  206.0
11  226.0
12  276.0
13  283.0
14  291.0
15  332.0
16  351.0
17  401.0
18  466.0

Perform the function:

>>> res, stats = distribution_fit(data, distr_type, optimal_method='maximum_likelihood')
>>> res.collect()
               NAME    VALUE
0  DISTRIBUTIONNAME  WEIBULL
1             SCALE    244.4
2             SHAPE  2.06698
>>> stats.collect()
Empty DataFrame
Columns: [STAT_NAME, STAT_VALUE]
Index: []