distribution_fit

hana_ml.algorithms.pal.stats.distribution_fit(data, distr_type, optimal_method=None, censored=False)

This algorithm aims to fit a probability distribution for a variable according to a series of measurements to the variable. There are many probability distributions of which some can be fitted more closely to the observed variable than others.

Parameters:
dataDataFrame

DataFrame containing the data.

distr_type{'exponential', 'gamma', 'normal', 'poisson', 'uniform', 'weibull'}

Specify the type of distribution to fit.

optimal_method{'maximum_likelihood', 'median_rank'}, optional

Specifies the estimation method.

Defaults to 'median_rank' when distr_type is 'weibull', 'maximum_likelihood' otherwise.

censoredbool, optional

Specify if data is censored of not.

Only valid when distr_type is 'weibull'.

Default to False.

Returns:
DataFrame

Fitting results, structured as follows:

  • NAME: name of distribution parameters.

  • VALUE: value of distribution parameters.

Fitting statistics, structured as follows:

  • STAT_NAME: name of statistics.

  • STAT_VALUE: value of statistics.

Examples

Original data:

>>> df.collect()
     DATA
0    71.0
1    83.0
2    92.0
3   104.0
4   120.0
5   134.0
6   138.0
7   146.0
8   181.0
9   191.0
10  206.0
11  226.0
12  276.0
13  283.0
14  291.0
15  332.0
16  351.0
17  401.0
18  466.0

Perform the function:

>>> res, stats = distribution_fit(data, distr_type, optimal_method='maximum_likelihood')
>>> res.collect()
               NAME    VALUE
0  DISTRIBUTIONNAME  WEIBULL
1             SCALE    244.4
2             SHAPE  2.06698
>>> stats.collect()
Empty DataFrame
Columns: [STAT_NAME, STAT_VALUE]
Index: []